A pharmaceutical company might do a controlled clinical study of a new drug. Half of the test subjects get the new drug, and half get a placebo. The null hypothesis is that the drug and the placebo are equally effective. The company hopes to prove that the null hypothesis is false, and that the drug is better than the placebo. The proof consists of a trial with more subjects doing better with the drug, and a statistical argument that the difference was unlikely to be pure luck.
To see how a statistical disproof of a null hypothesis would work, consider a trial consisting of 100 coin tosses. The null hypothesis is that heads and tails are equally likely. That means that we would get about 50 heads in a trial, on average, with the variation being a number called "sigma". The next step in the analysis is to figure out what sigma is. In this case, for a fair coin, sigma is 5. That means that the number of heads in a typical trial run will differ from 50 by about 5. Two thirds of the trials will be within one sigma, or between 45 and 55. 95% will be within two sigmas, or between 40 and 60. 99% will be within three sigmas, or between 35 and 65.
Thus you can prove that a coin is biased by tossing it 100 times. If you get more than 65 heads, then either you were very unlucky or the chance of heads was more than 50%. A company can show that its drug is effective by giving it to 100 people, and showing that it is better than the placebo 65 times. Then the company can publish a study saying that the probability that the data matches the (counterfactual) null hypothesis is 0.01 or less. That probability is called the p-value. A p-value of 0.01 means that the company can claim that the drug is effective, with 99% confidence.
The p-value is the leading statistic for getting papers published and drugs approved, but it does not really confirm a hypothesis. It just shows an inconsistency between a dataset and a counterfactual hypothesis.
As a practical matter, the p-value is just a statistic that allows journal editors an easier decision on whether to publish a paper. A p-value under 0.05 is considered statistically significant, and not otherwise. It does not mean that the paper's conclusions are probably true.
A Nature mag editor writes:
scientific experiments don't end with a holy grail so much as an estimate of probability. For example, one might be able to accord a value to one's conclusion not of "yes" or "no" but "P<0.05", which means that the result has a less than one in 20 chance of being a fluke. That doesn't mean it's "right".This explanation is essentially correct, but some scientists who should know better argue that it is wrong and anti-science. A fluke is an accidental (and unlikely) outcome under the (counterfactual) null hypothesis. The scientific paper says that either the experiment was a fluke or the null hypothesis was wrong. The frequentist philosophy that underlies the computation does not allow giving a probability on a hypothesis. So the reader is left to deduce that the null hypothesis was wrong, assuming the experiment was not a fluke.
One thing that never gets emphasised enough in science, or in schools, or anywhere else, is that no matter how fancy-schmancy your statistical technique, the output is always a probability level (a P-value), the "significance" of which is left for you to judge – based on nothing more concrete or substantive than a feeling, based on the imponderables of personal or shared experience. Statistics, and therefore science, can only advise on probability – they cannot determine The Truth. And Truth, with a capital T, is forever just beyond one's grasp.
The core of the confusion is over the counterfactual. Some people would rather ignore the counterfactual, and instead think about a subjective probability for accepting a given hypothesis. Those people are called Bayesian, and they argue that their methods are better because they more completely use the available info. But most science papers use the logic of p-values to reject counterfactuals because assuming the counterfactual requires you to believe that the experiment was a fluke.
Hypotheses are often formulated by combing datasets and looking for correlations. For example, if a medical database shows that some of the same people suffer from obesity and heart disease, one might hypothesize that obesity causes heart disease. Or maybe that heart disease causes obesity. Or that overeating causes both obesity and heart disease, but they otherwise don't have much to do with each other.
The major caution to this approach is that correlation does not imply causation. A correlation can tell you that two measures are related, but the operation is symmetrical and cannot say what causes what. To establish causality requires some counterfactual analysis. The simplest way in a drug study is to randomly gives some patients placebo instead of the drug in question. That way, the intended treatment can be compared to counterfactuals.
A counterfactual theory of causation has been worked out by Judea Pearl and others. His 2000 book begins:
Neither logic, nor any branch of mathematics had developed adequate tools for managing problems, such as the smallpox inoculations, involving cause-effect relationships. Most of my colleagues even considered causal vocabulary to be dangerous, avoidable, ill-defined, and nonscientific. "Causality is endless controversy," one of them warned. The accepted style in scientific papers was to write "A implies B" even if one really meant "A causes B," or to state "A is related to B" if one was thinking "A affects B."His theory is not particularly complex, and could have been worked out a century earlier. Apparently there was resistance to analyzing counterfactuals. Even the great philosopher Bertrand Russell hated causal analysis.
Many theories seem like plausible explanations for observations, but ultimately fail because they offer no counterfactual analysis. For example, the famous theories of Sigmund Freud tell us how to interpret dreams, but do not tell us how to recognize a false interpretation. A theory is not worth much without counterfactuals.