Confounding variable
From Wikipedia, the free encyclopedia
| This article or section is in need of attention from an expert on the subject. WikiProject Probability may be able to help recruit one. |
A confounding variable (also confounding factor, lurking variable, a confound, or confounder) is an extraneous variable in a statistical model that correlates with both the dependent variable and the independent variable. The methodologies of scientific studies therefore need to control for these factors to avoid what is known as a type 1 error: A 'false positive' conclusion that the dependent variables are in a causal relationship with the independent variable. Such a relation between two observed variables is termed a spurious relationship. Thus, confounding may is a major threat to the validity of inferences made about cause and effect, i.e. Internal Validity, as the observed effects should attributed to the confounder rather than the independent variable.
For example, assume that a child's weight and a country's gross domestic product (GDP) rise with time. A person carrying out an experiment could measure weight and GDP, and conclude that a higher GDP causes children to gain weight. However, the confounding variable, time, was not accounted for, and is the real cause of both rises.
By definition, a confounding variable is associated with both the probable cause and the outcome. The confounder is not allowed to lie in the causal pathway between the cause and the outcome: If A is thought to be the cause of disease C, the confounding variable B may not be solely caused by behaviour A; and behaviour B shall not always lead to behaviour C. An example: Being female does not always lead to smoking tobacco, and smoking tobacco does not always lead to cancer. Therefore, in any study that tries to elucidate the relation between being female and cancer should take smoking into account as a possible confounder. In addition, a confounder is always a risk factor that has a different prevalence in two risk groups (e.g. females/males). (Hennekens, Buring & Mayrent, 1987).
Though criteria for causality in statistical studies has been researched intensely, Pearl has shown that confounding variables cannot be defined in terms of statistical notions alone; some causal assumptions are necessary.[1] In a 1965 paper, Austin Bradford Hill proposed a set of causal criteria.[2]. Many working epidemiologists take these as a good place to start when considering confounding and causation. However, these are of heuristic value at best. When causal assumptions are articulated in the form of causal graph, a simple criterion is available, called backdoor, to identify sets of confounding variables.
Contents |
There are two basic ways to exclude or to control confounding variables from a study.[3] These are to eliminate confounding by study design, usually by matching, and to control for confounding by measuring the known confounders and using either stratification or multivariate analysis (or both). The basic problem in epidemiology is that you can't get rid of confounding, either because you can't measure the confounding variables with complete accuracy, or because you don't know what they are. This leads to 'residual confounding' - epidemiological jargon for incompletely controlled confounding.
Matching is often used in *Case-control studies. Typically for each case identified, a control is selected of the same gender, close in age, and possibly from the same area, or the same health care facility. This serves two purposes. First, it makes choosing controls easier. In a typical situation there are far more controls than cases. It is very useful to have a guide for selecting controls. Secondly it means that cases and controls are reasonably similar in age and sex. This has many advantages. While multivariate analysis can accommodate some differences in these variables, as so many other variables depend on age and sex, it is much simpler if the two groups have similar age and sex profiles.
- Stratification is an important and underestimated analytical approach. It involves preparing separate analyses within subgroups of the study population. Typically these would be groups of age and sex. This allows one to examine the relationship between the outcome and the explanatory variables in subsets in which the relationship may be simpler and clearer. This is a way of assessing statistical *interaction.
- Multivariate analysis is the term used to describe a series of statistical methods used to analyse the relationship between an outcome variable, and a number of potential explanatory variables. Simpler methods commonly used in the analysis of epidemiological studies are *logistic regresssion for case-control studies, and *Poisson regression for cohort studies.
More recently epidemiologist have used a very wide range of models, including *survival analysis, *Generalised additive models, and fully *Bayesian models. There are some very tricky ideas, and some major assumptions inherent in the whole process of 'controlling' for confounding with multivariate models, but it remains practically important.[4]
These sites contain descriptions or examples of lurking variables:
- Linear Regression (Yale University)
- Scatterplots (Simon Fraser University)
- Pearl, J. "Why there is no statistical test for counfounding, why many think there is, and why they are almost right," UCLA Computer Science Department, Technical Report R-256, January 1998
- ^ Pearl, Judea (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press. ISBN 0-521-77362-8.
- ^ Bradford Hill, Austin (1965). "The environment or disease: association or causation?". Proc R Soc Med 58 (May): 295-300. PMID 14283879.
- ^ Mayrent, Sherry L (1987). Epidemiology in Medicine. Lippincott Williams & Wilkins. ISBN 0-316-35636-0.
- ^ Freedman, David (2005). Statistical Models: Theory and Practice. Cambridge University Press. ISBN 0-521-67105-1.
- Anecdotal evidence
- Joint effect
- Simpson's paradox
- Procedural confound
- Operational confound
- Person confound