Missing data present a problem in statistical analyses. If missingness is correlated with the outcome of interest, then ignoring it will bias the results of statistical tests. In addition, most statistical software packages (e.g., SAS, Stata) automatically drop observations that have missing values for any variables to be used in an analysis. This practice reduces the analytic sample size, lowering the power of any tests carried out.
There are three approaches to dealing with missing data:
- Impute the missing data: fill in the missing values
- Model the probability of missingness: this is a good option if imputation is infeasible; in certain cases it can account for much of the bias that would otherwise occur
- Ignore the missing data: a poor choice, but by far the most common one
Below are presented very brief descriptions of alternative approaches to handling the problem of missing data. The published literature on this subject is large and ever-expanding; readers are encouraged to consult PubMed and other bibliographic sources for the latest research. A separate literature has developed that presents models of missingness and how to use the resulting estimates to adjust regression statistics. These models will not be described here, but interested readers may frequently find articles on the topic in the journal Statistics in Medicine.
I. Listwise (or Case) deletion
In this approach, any case (observation) that contains a missing value for a relevant variable is dropped. Although easy to understand and to perform, it runs the risk of causing bias. SAS and other statistical software packages perform listwise deletion automatically in order to allow the data matrix to be inverted, a necessity for regression analysis. As a result, it amounts to the “default option” in applied statistics.
II. Single Imputation
Single imputation refers to filling in a missing value with a single replacement value. There are two general approaches: arbitrary methods and conditional mean imputation.
Some researchers use arbitrary methods to impute missing data. Some of these including using (a) the mean of all observed values for all people, (b) the mean observed value for the same person in other time periods, (c) the mean of the previous and following values for the person, if they exist, or (d) the most recent observed value for the person. This latter method, known as last-observation-carried-forward (LOCF), is quite common. Other arbitrary methods can be created as well.
A less arbitrary method that is commonly performed is hot deck imputation. In this method, all observations are divided into groups with similar characteristics. An example might be "White women ages 38-45." To impute a missing value, the researcher randomly draws a value for that variable from the pool of people having similar characteristics. Creating a larger number of subgroups yields some improvement in accuracy, but it can also lead to very small sample sizes within some subgroups. The advantage of hot decking is that it reflects both the mean and variance of the underlying data. The primary drawbacks are the lack of guidance in creating subgroups and the possibility of creating subgroups with very few observations. Moreover, hot deck imputation can produce bias similar to case deletion; see Schafer and Graham (2002) for an illustrative example.
Conditional mean imputation
In this strategy, the analyst estimates a regression model in which the dependent variable has missing values for some observations. In the second step, the estimated regression coefficients are used to predict (impute) missing values of that variable. The proper regression model depends on the form of the dependent variable. A probit or logit is used for binary variables, Poisson or other count models for integer-valued variables, and OLS or related models for continuous variables.
III. Model the Missingness
In some cases a missing value can be represented by a new binary variable. For example, suppose that the highest grade completed is modeled as a series of three binary variables: grades 1-8, grades 9-12, or beyond grade 12. If some values are missing, the researcher can create a fourth binary variable that represents "education value missing." This method has two advantages: no cases are dropped due to the missing education values, and unobserved similarity among people with missing education values will be captured by the new term. Continuous variables such as age or income cannot be modeled in this way, however, unless they are converted into a series of nonoverlapping binary variables (such as Age 0-17, Age 18-35, Age 36-55, and Age 56+).
IV. Multiple Imputation
The methods described above are single imputation approaches. They share two general drawbacks. First, researchers must adjust the standard errors of the eventual statistics (such as estimated regression coefficients) to account for the uncertainty behind imputed data. Second, they may cause systematic bias. Consider LOCF: both research and common sense suggest that attrition from clinical studies will more often come from patients who are doing poorly than from those doing well. Carrying forward earlier observations will tend to improve the average outcome as a result.
Multiple imputation yields significant improvements in the statistical properties of the imputed values. A good summary of multiple imputation comes from Faris et al. (2002):
Multiple imputation methods randomly draw observations from a fitted distribution for the covariates and the outcome variable. For each imputed data set, the missing data are filled in with values drawn randomly [with replacement] from the distribution. Analyses are performed on each data set as though the data had been completely observed. The results of these analyses are then pooled to provide point and variance estimates for the effects of interest (p. 186).
Multiple imputation (MI) avoids both problems associated with single imputation. Proper standard errors are estimated as part of the process, thereby reflecting the added uncertainty that comes from using imputed data. And MI produces unbiased estimates of the eventual statistics under reasonable assumptions.
Most published analyses using MI assume that data are "missing at random" (MAR), although this is not a requirement of MI (Schaefer and Graham 2002). MAR implies that the probability of a datum being missing is uncorrelated with past and present values of observed variables. Unfortunately it is not possible to test the MAR assumption. MAR is the null hypothesis; one can "reject" or "not reject" it, but one cannot prove it definitively. At a minimum, one should not assert that the data are MAR if a factor that probably caused missingness is not found in the observed variables. The richer the set of variables (covariates) one has, the easier it is to justify an assertion of MAR.
To understand MI, a good place to start is Patrician (2002). The author walks through MI in a clear fashion and provides relevant SAS code in an appendix. Three more statistically sophisticated references are books: Little and Rubin (1987), Rubin (1987), and Schafer (1997). Among journal manuscripts, two good sources are Little (1992) and Schafer (1999). Schafer and Graham (2002) uses simulated data to illustrate the biases and coverage probabilities of several common imputation models.
Many MI methods exist. Explanations and further references can be found in Allison (1997), Crawford et al. (1995), Faris et al. (2002), Hunsberger et al. (2001), LaPlante et al. (2002), Lavori et al. (1995), Mazumdar et al. (1999), and Revicki et al. (2001).
Multiple imputation is now available in a host of statistical software packages.
Web Sites with Substantial MI Documentation and Program Links
Understanding causal relationships is critical for researchers. Although data from randomized controlled trials is preferred for understanding causal relationships, randomization is not always ethical or feasible, due to the high monetary and time costs. Unfortunately results from observational analyses are prone to bias, especially when the primary right-hand-side variable (i.e., the treatment) is correlated with other factors not included in the analysis; this is often referred to as endogeneity. Why is endogeneity a problem? Regression models assume that all right-hand-side variables are exogenous, hence the right-hand-side variables are often referred to as independent variables. When a variable is endogenous (correlated with unobserved variables), it violates an underlying assumption in the statistical model, resulting in a biased regression coefficient. Instrumental variables (IVs) is a statistical modeling technique to correct for endogeneity. This report describes the use of IVs in VA data. Section 2 provides background on IVs and how to use them, section 3 reviews common examples of IVs in VA data and their pitfalls, and the final section summarizes our discussion.
Allison PD. Multiple imputation for missing data: a cautionary tale. Sociological Methods and Research 2000;28:301-309.
Crawford SL, Tennstedt SL, McKinlay JB. A comparison of analytic methods for non-random missingness of outcomes data. Journal of Clinical Epidemiology 48(2):209-219.
Faris PD, Ghali WA, Brant R, Norris CM, Galbraith PD, Knudtson ML, the APPROACH Investigators. Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. Journal of Clinical Epidemiology 2002;55:184-191.
Fitzmaurice GM, Laird NM, Shneyer L. An alternative parameterizatio of the general linear mixture model for longitudinal data with non-ignorable drop-outs. Statistics in Medicine 2001;20:1009-1021.
Grasdal A. The performance of sample selection estimators to control for attrition bias. Health Economics 2001;10:385-398.
Horton NJ, Lipsitz SR. Multiple imputation in practice: comparison of software packages for regression models with missing variables. The American Statistician 55(3):244-254.
Hunsberger S, Murray D, Davis CE, Fabsitz RR. Imputation strategies for missing data in a school-based multi-centre study: the Pathways study. Statistics in Medicine 2001;20:305-316.
LaPlante MP, Harrington C, Kang T. Estimating paid and unpaid hours of personal assistance services in activities of daily living provided to adults living at home. Health Services Research 2002;37(2):397-415.
Lavori P, Dawson R, Shera D. A multiple imputation strategy for clinical trials with truncation of patient data. Statistics in Medicine 1995;14:1913-1925.
Little RJA. Regression with missing X’s: a review. Journal of the American Statistical Association 1992;87(420):1227-1237.
Little RJA, Rubin DB. Statistical analysis with missing data. New York: Wiley, 1987.
Mazumdar S, Liu KS, Houck PR, Reynolds CF. Intent-to-treat analysis for longitudinal clinical trials: copin with the challenge of missing values. Journal of Psychiatric Research 1999;33:87-95.
Patrician PA. Multiple imputation for missing data. Research in Nursing & Health 2002;25:76-84.
Revicki DA, Gold K, Buckman D, Chan K, Kallich JD, Woolley JM. Imputing physical health status scores missing owing to mortality: results of a simulation comparing multiple techniques. Medical Care 2001;39(1):61-71.
Rubin DB. Multiple imputation for nonresponse in surveys. New York: Wiley, 1987.
Schafer JL. Analysis of incomplete multivariate data. London: Chapman and Hall, 1997.
Schafer JL. Multiple imputation: a primer. Statistical Methods in Medical Research 1999;8:3-15.
Schafer JL, Graham JW. Missing data: our view of the state of the art. Psychological Methods 2002;7(2):147-177.
Wu MC, Albert PS, Wu BU. Adjusting for drop-out in clinical trials with repeated measures: design and analysis issues. Statistics in Medicine 2001;20:93-108.