Statistical hypothesis testing
From Wikipedia, the free encyclopedia
A statistical hypothesis test is a method of making statistical decisions from and about experimental data. Null-hypothesis testing just answers the question of "how well the findings fit the possibility that that chance factors alone might be responsible."[1] This is done by asking and answering a hypothetical question. One use is deciding whether experimental results contain enough information to cast doubt on conventional wisdom.
Contents |
One may be faced with the problem of making a definite decision with respect to an uncertain hypothesis which is known only through its observable consequences. A statistical hypothesis test, or more briefly, hypothesis test, is an algorithm to state the alternative (for or against the hypothesis) which minimizes certain risks.
This article describes the commonly used frequentist treatment of hypothesis testing. From the Bayesian point of view, it is appropriate to treat hypothesis testing as a special case of normative decision theory (specifically a model selection problem) and it is possible to accumulate evidence in favor of (or against) a hypothesis using concepts such as likelihood ratios known as Bayes factors.
There are several preparations we make before we observe the data.
- The null hypothesis must be stated in mathematical/statistical terms that make it possible to calculate the probability of possible samples assuming the hypothesis is correct. For example: The mean response to treatment being tested is equal to the mean response to the placebo in the control group. Both responses have the normal distribution with this unknown mean and the same known standard deviation ... (value).
- A test statistic must be chosen that will summarize the information in the sample that is relevant to the hypothesis. In the example given above, it might be the numerical difference between the two sample means, m1 − m2.
- The distribution of the test statistic is used to calculate the probability sets of possible values (usually an interval or union of intervals). In this example, the difference between sample means would have a normal distribution with a standard deviation equal to the common standard deviation times the factor
where n1 and n2 are the sample sizes. - Among all the sets of possible values, we must choose one that we think represents the most extreme evidence against the hypothesis. That is called the critical region of the test statistic. The probability of the test statistic falling in the critical region when the null hypothesis is correct, is called the alpha value (or size) of the test.
- The probability that a sample falls in the critical region when the parameter is θ, where θ is for the alternative hypothesis, is called the power of the test at θ. The power function of a critical region is the function that maps θ to the power of θ.
After the data is available, the test statistic is calculated and we determine whether it is inside the critical region.
If the test statistic is inside the critical region, then our conclusion is one of the following:
- Reject the null hypothesis. (Therefore the critical region is sometimes called the rejection region, while its complement is the acceptance region.)
- An event of probability less than or equal to alpha has occurred.
The researcher has to choose between these logical alternatives. In the example we would say: the observed response to treatment is statistically significant.
If the test statistic is outside the critical region, the only conclusion is that
- There is not enough evidence to reject the null hypothesis.
This is not the same as evidence in favor of the null hypothesis. That we cannot obtain using these arguments, since lack of evidence against a hypothesis is not evidence for it. On this basis, statistical research progresses by eliminating error, not by finding the truth.
See legend defining symbols at bottom of table.
| Name | Formula | Assumptions |
| One-sample z-test | ![]() |
(Normal distribution or n > 30) and σ known. (z is the distance from the mean in standard deviations. It is possible to calculate a minimum proportion of a population that falls within n standard deviations (see: Chebyshev's inequality). |
| Two-sample z-test | ![]() |
Normal distribution and independent observations and (σ1 AND σ2 known) |
| One-sample t-test | ![]()
|
(Normal population or n > 30) and σ unknown |
| Paired t-test | ![]()
|
(Normal population of differences or n > 30) and σ unknown |
| One-proportion z-test | ![]() |
n .p > 10 and n (1 − p) > 10 |
| Two-proportion z-test, equal variances |
|
n1.p1 > 5 AND n1(1 − p1) > 5 and n2.p2 > 5 and n2(1 − p2) > 5 and independent observations |
| Two-proportion z-test, unequal variances | ![]() |
n1.p1 > 5 and n1(1 − p1) > 5 and n2.p2 > 5 and n2(1 − p2) > 5 and independent observations |
| Name | Formula | Assumptions |
| Two-sample pooled t-test | ![]()
|
(Normal populations or n1+n2 > 40) and independent observations and σ1 = σ2 and (σ1 and σ2 unknown) |
| Two-sample unpooled t-test | ![]()
|
(Normal populations or n1+n2 > 40) and independent observations and σ1 ≠ σ2 and (σ1 and σ2 unknown) |
| Definition of symbols | n = sample size = sample meanμ0 = population mean σ = population standard deviation t = t statistic df = degrees of freedom n1 = sample 1 size n2 = sample 2 size s1 = sample 1 std. deviation s2 = sample 2 std. deviation p1 = proportion 1 p2 = proportion 2 μ1 = population 1 mean μ2 = population 2 mean min{n1,n2} = minimum of n1 or n2 |
The statistics for some other tests have their own page on Wikipedia, including the Wald test and the likelihood ratio test.
Hypothesis testing is largely the product of Ronald Fisher, Jerzy Neyman, Karl Pearson and (son) Egon Pearson. Fisher was an agricultural statistician who emphasized rigorous experimental design and methods to extract a result from few samples assuming Gaussian distributions. Neyman (who teamed with the younger Pearson) emphasized mathematical rigor and methods to obtain more results from many samples and a wider range of distributions. Modern hypothesis testing is an (extended) hybrid of the Fisher vs Neyman/Pearson formulation, methods and terminology developed in the early 20th century.
| It has been suggested that some of the information in this article's "Criticism" or "Controversy" section(s) be merged into other sections to achieve a more neutral presentation. (Discuss) |
| This section does not cite any references or sources. Please improve this section by adding citations to reliable sources. Unverifiable material may be challenged and removed. (September 2007) |
Some statisticians have commented that pure "significance testing" has what is actually a rather strange goal of detecting the existence of a "real" difference between two populations. In practice a difference can almost always be found given a large enough sample, what is typically the more relevant goal of science is a determination of causal effect size. The amount and nature of the difference, in other words, is what should be studied. Many researchers also feel that hypothesis testing is something of a misnomer. In practice a single statistical test in a single study never "proves" anything.[citation needed]
"hypothesis testing: generally speaking, this is a misnomer since much of what is described as hypothesis testing is really null-hypothesis testing." (The Sage Dictionary of Statistics, p76, Duncan Cramer, Dennis Howitt, 2004, ISBN 076194138X)
"Statistics do not prove anything." "Billions of supporting examples for absolute truth are outweighed by a single exception." "...in statistics, we can only try to disprove or falsify." (The Tao of Statistics, p91, Keller, 2006, ISBN 1-4129-2473-1)
Even when you reject null hypothesis, effect sizes should be taken into consideration. If the effect is statistically significant but the effect size is very small, then it is a stretch to consider the effect theoretically important.[citation needed]
Philosophical criticism to hypothesis testing includes consideration of borderline cases.
Any process that produces a crisp decision from uncertainty is subject to claims of unfairness near the decision threshold. (Consider close election results.) The premature death of a laboratory rat during testing can impact doctoral theses and academic tenure decisions.
"... surely, God loves the .06 nearly as much as the .05." (Rosnow, R.L. & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276-1284.)
The statistical significance required for publication has no mathematical basis, but is based on long tradition.
"It is usual and convenient for experimenters to take 5% as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results." (Mathematics of a Lady Tasting Tea , Ronald Fisher )
Fisher, in the cited article, designed an experiment to achieve a statistically significant result based on sampling 8 cups of tea.
Pedagogic criticism of the null-hypothesis testing includes the counter-intuitive formulation, the terminology and confusion about the interpretation of results.
"Despite the stranglehold that hypothesis testing has on experimental psychology, I find it difficult to imagine a less insightful means of transiting from data to conclusions." (Loftus, ibid.)
Students find it difficult to understand the formulation of statistical null-hypothesis testing. In rhetoric, examples often support an argument, but a mathematical proof "is a logical argument, not an empirical one". A single counterexample results in the rejection of a conjecture. Karl_Popper defined science by its vulnerability to dis-proof by data. Null-hypothesis testing shares the mathematical and scientific perspective rather the more familiar rhetorical one.
Students also find the terminology confusing. While Fisher disagreed with Neyman-Pearson about the theory of testing, their terminologies have been blended. The blend is not seamless or standardized . While this article teaches a pure Fisher formulation, even it mentions Neyman-Pearson terminology (Type II error and the alternative hypothesis ). The typical introductory statistics text is less consistent. The Sage Dictionary of Statistics would not agree with the title of this article, which it would call null-hypothesis testing. "...there is no alternate hypothesis in Fisher's scheme: Indeed, he violently opposed its inclusion by Neyman and Pearson." (Cohen, J. 1990. Things I have learned (so far). American Psychologist 45: 1304-1312.) In discussing test results, "significance" often has two distinct meanings in the same sentence; One is a probability, the other is a subject-matter measurement (such as currency).
There is widespread and fundamental disagreement on the interpretation of test results.
"A little thought reveals a fact widely understood among statisticians: The null hypothesis, taken literally (and that's the only way you can take it in formal hypothesis testing), is almost always false in the real world.... If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null hypothesis is always false, what's the big deal about rejecting it?" (Cohen, J. 1990. Things I have learned (so far). American Psychologist 45: 1304-1312.) (The above criticism only applies to point hypothesis tests. If one were testing, for example, whether a parameter is greater than zero, it would not apply.)
"How has the virtually barren technique of hypothesis testing come to assume such importance in the process by which we arrive at our conclusions from our data?" (Loftus, G.R. 1991. On the tyranny of hypothesis testing in the social sciences. Contemporary Psychology 36: 102-105.)
Null-hypothesis testing just answers the question of "how well the findings fit the possibility that that chance factors alone might be responsible." (The Sage Dictionary of Statistics, p76, Duncan Cramer, Dennis Howitt, 2004, ISBN 076194138X)
Null-hypothesis significance testing does not determine the truth or falseness of claims. It determines whether confidence in a claim based solely on a sample-based estimate exceeds a threshold. It is a research quality assurance test, widely used as one requirement for publication.
Practical criticism of hypothesis testing includes the sobering observation that published test results are often contradicted. Mathematical models support the conjecture that most published medical test results are flawed. Null-hypothesis testing has not achieved the goal of a low error probability in medical journals.
"Contradiction and initially stronger effects are not unusual in highly cited research of clinical interventions and their outcomes." Ioannidis JPA (2005) Contradicted and initially stronger effects in highly cited clinical research. JAMA 294: 218-228.
"Most Research Findings Are False for Most Research Designs and for Most Fields" Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2(8): e124.
- Comparing means test decision tree
- Counternull
- Multiple comparisons
- Omnibus test
- Behrens-Fisher problem
- Bootstrapping (statistics)
- Checking if a coin is fair
- Falsifiability
- Fisher's method for combining independent tests of significance
- Null hypothesis
- P-value
- Statistical theory
- Statistical significance
- Type I error, Type II error
- ^ The Sage Dictionary of Statistics, p76, Duncan Cramer, Dennis Howitt, 2004, ISBN 076194138X
- A Guide to Understanding Hypothesis Testing
- Bayesian critique of classical hypothesis testing
- Critique of classical hypothesis testing highlighting long-standing qualms of statisticians
- Analytical argumentations of probability and statistics
- Laws of Chance Tables - used for testing claims of success greater than what can be attributed to random chance















= sample mean