Missing Values in Regression Analysis

Originally published by Sharon Choong Kam Chong on Nov. 7th, 2015.

Statisticians are often asked to apply their skills and knowledge into solving real-life problems. However, budding statisticians undertaking their first project can struggle with aspects of quantitative analysis that are not taught in introductory statistics classes. In this blog, I focus on one question: what can be done to tackle the problem of missing data in regression analysis?

Some data are obtained by voluntary submission; others can only be collected from different non-standardized sources; others yet are sensitive data that cannot be disclosed without consent or impractical to obtain due to measurement difficulties or some other reason. Thus, most data sets in real life are incomplete.

Say we pestered our survey participant a million times to fill the questionnaire without success. We contacted the database manager who didn’t know where to find the missing data. We failed to find more complete alternative sources. What can we do next? First, we can classify the missing data. To do this, we must know how our data was collected. Technically, there are 3 types of missing data:

Missing Completely at Random (MCAR): Missing independent of y and x variables — e.g. a technical bug deleted some random values in an electronic income database.
Missing at Random (MAR): Missing dependent on other x variable(s) — e.g. people working in the service sector are more likely to not report their income.
Missing Not at Random (MNAR): Missing dependent on itself — e.g. people with high income do not want to report their income.

Once we know what we are dealing with, we can decide on the appropriate method to treat it. Here are 3 common ways that missing data is treated.

I. Deletion.

This is plain old removal and works best for MCAR data. There are two ways this can be done.

Listwise deletion involves removing all observations that have missing values before doing all regressions. It can still give unbiased parameters (β’s) even if not MCAR, but may give biased results if the probability of the value missing is dependent on y.

Pairwise deletion involves removing observations only when the variable of the missing value is needed for the regression. It allows more data to be used but regression results cannot be compared with each other since the samples are different for each regression.

For example, say x1 has a missing value for observation 2. When y is regressed on x1, observation 2 is excluded, but when y is regressed on x2, observation 2 is not excluded with pairwise deletion, but it is excluded with listwise deletion.

II. Mean Substitution.

This is a form of “simple imputation”, and again assumes MCAR data. It involves calculating the sample mean of the variable using data available, and filling the missing data with this sample mean. Thus, it preserves the original sample size but the standard errors of the parameter estimates are biased downward (reduced) and the correlation estimates between variables are ignored, and therefore in all likelihood weaker than the true association.

III. Multiple Imputation.

If the data is MAR, there are measured variables in the regression which can indirectly predict the probability of the data missing. If there is such a dependence, we can regress the variables with non-missing values to predict the missing value! This is called Regression Imputation and is another form of “simple imputation”. However, this method has the problem of overfitting (when there is high correlation estimates but they describe more of the random error rather than the relationship we are interested in).

In comes Multiple Imputation. It involves estimating imputed data to fill the missing values, but unlike Regression Imputation, it simulates the imputation several times, effectively introducing error to account for the overfitting problem. It also works with MAR data, but empirical evidence has also shown unbiased results when using Multiple Imputation on MNAR data. Specifically, there are 3 steps involved in Multiple Imputation:

Imputation: Fill in each of the missing values of the incomplete data set m times. Often the Markov Chain Monte Carlo is used to simulate random draws such that m imputed values can be estimated for one missing value. Traditionally, m = 3 or 5. This results in m complete data sets (i.e. having no missing values) .
Analysis: Perform regression analysis as usual for each of the m complete data sets to obtain m regression results.
Pooling: Compute the mean over the m regression results, its variance, and its confidence interval or P value, and other statistical inference measures of interest. There are rules for pooling the estimates into one.

One last point: As a statistics student, you might have heard of maximum likelihood when you learned about logistic regressions. Since the maximum likelihood estimation estimates the parameters (β’s) producing the highest log-likelihood given all data, it can provide unbiased parameter estimates no matter the type of missing values! The catch is that the parameters’ standard errors are biased downward. But there are ways this can be taken into account.

No matter which method we end up using, it is important to be able to explain why we chose this method and whether the assumptions underlying our choice are reasonable.

Sharon Choong Kam Chong

USS Blogger