Originally published by Sharon Choong Kam Chong on Nov. 7th, 2015.
Statisticians are often asked to apply their skills and knowledge into solving real-life problems. However, budding statisticians undertaking their first project can struggle with aspects of quantitative analysis that are not taught in introductory statistics classes. In this blog, I focus on one question: what can be done to tackle the problem of missing data in regression analysis?
Some data are obtained by voluntary submission; others can only be collected from different non-standardized sources; others yet are sensitive data that cannot be disclosed without consent or impractical to obtain due to measurement difficulties or some other reason. Thus, most data sets in real life are incomplete.
Say we pestered our survey participant a million times to fill the questionnaire without success. We contacted the database manager who didn’t know where to find the missing data. We failed to find more complete alternative sources. What can we do next? First, we can classify the missing data. To do this, we must know how our data was collected. Technically, there are 3 types of missing data:
Once we know what we are dealing with, we can decide on the appropriate method to treat it. Here are 3 common ways that missing data is treated.
I. Deletion.
This is plain old removal and works best for MCAR data. There are two ways this can be done.
Listwise deletion involves removing all observations that have missing values before doing all regressions. It can still give unbiased parameters (β’s) even if not MCAR, but may give biased results if the probability of the value missing is dependent on y.
Pairwise deletion involves removing observations only when the variable of the missing value is needed for the regression. It allows more data to be used but regression results cannot be compared with each other since the samples are different for each regression.
For example, say x1 has a missing value for observation 2. When y is regressed on x1, observation 2 is excluded, but when y is regressed on x2, observation 2 is not excluded with pairwise deletion, but it is excluded with listwise deletion.
II. Mean Substitution.
This is a form of “simple imputation”, and again assumes MCAR data. It involves calculating the sample mean of the variable using data available, and filling the missing data with this sample mean. Thus, it preserves the original sample size but the standard errors of the parameter estimates are biased downward (reduced) and the correlation estimates between variables are ignored, and therefore in all likelihood weaker than the true association.
III. Multiple Imputation.
If the data is MAR, there are measured variables in the regression which can indirectly predict the probability of the data missing. If there is such a dependence, we can regress the variables with non-missing values to predict the missing value! This is called Regression Imputation and is another form of “simple imputation”. However, this method has the problem of overfitting (when there is high correlation estimates but they describe more of the random error rather than the relationship we are interested in).
In comes Multiple Imputation. It involves estimating imputed data to fill the missing values, but unlike Regression Imputation, it simulates the imputation several times, effectively introducing error to account for the overfitting problem. It also works with MAR data, but empirical evidence has also shown unbiased results when using Multiple Imputation on MNAR data. Specifically, there are 3 steps involved in Multiple Imputation:
One last point: As a statistics student, you might have heard of maximum likelihood when you learned about logistic regressions. Since the maximum likelihood estimation estimates the parameters (β’s) producing the highest log-likelihood given all data, it can provide unbiased parameter estimates no matter the type of missing values! The catch is that the parameters’ standard errors are biased downward. But there are ways this can be taken into account.
No matter which method we end up using, it is important to be able to explain why we chose this method and whether the assumptions underlying our choice are reasonable.
Sharon Choong Kam Chong
USS Blogger