Handling Outliers and Missing Data in Regression Models Using R: Simulation Examples

This paper has reviewed two important problems in regression analysis (outliers and missing data), as well as some handling methods for these problems. Moreover, two applications have been introduced to understand and study these methods by R-codes. Practical evidence was provided to researchers to deal with those problems in regression modeling with R. Finally, we created a Monte Carlo simulation study to compare different handling methods of missing data in the regression model. Simulation results indicate that, under our simulation factors, the k-nearest neighbors method is the best method to estimate the missing values in regression models.


Introduction
Consider the following linear regression model: , where is vector of dependent variables, is a unknown parameters vector, is regression matrix, and is a error vector. The classical assumptions for this model are: A1: . A2: is non-stochastic matrix. A3: is full column rank matrix, i.e., The formula of OLS estimator of the model in Eq. (1) is: ̂ .
(2) The OLS estimation is highly sensitive to outliers and missing values in dataset. So many studies provided different methods to handle these problems to get more efficient estimation of .
In this paper 1 , we will review the basics of robust estimators of regression models when the dataset contains outliers, and the common methods to handle the missing data in regression models. Moreover, we provide R-codes to handling these problems in the dataset (outliers and missing data problems). Also, we will investigate the efficiency of some methods to handle the missing data in the regression by conducting simulation study.
The rest of the paper is organized as follows: Section 2 provides the background and the basics of the robust regression. Section 3 presents some different methods to handle the missing data in regression models. Section 4 presents two applications using R-codes. Section 5 displays the Monte Carlo simulation study. Section 6 involves the concluding remarks.

Robust Regression Estimators
There are two categories of outliers; first the outliers in Y-dimension (response variable), second the outliers in X-dimension (explanatory variable).
Detecting or diagnosing outliers is a very important process in regression analysis, so some methods concerning the detection of outliers will be illustrated, and are statistics that focus attention on observations having an influence on OLS estimator, see Barnett and Lewis [2]. Robust estimation provides an alternative to the OLS estimation when classical assumptions are unfulfilled, see Alma [3].
Generally, the goal of robust regression is to develop methods that are resistant to the possibility that one or several unknown outliers may occur anywhere in the data.
Robust regression can be used in any situation where OLS regression can be applied. It generally gives better accuracies over OLS because it uses a weighting mechanism to weigh down the influential observations. It is particularly resourceful when there are no compelling reasons to exclude outliers in the dataset.
Robust estimator term refers to the estimator that can protect it against the violations of the classical linear regression model assumptions, see Andersen [4], Gervini and Yohai [5], and Abonazel and Rabie [6].
Robust estimation method is designed to circumvent some limitations of traditional parametric and nonparametric estimation methods in case of outliers in the data. So the robust methods are resistant to the influence of outliers. Therefore, the robust estimation method is discussed by many papers in several regression models other than linear model, such as count regression model [7], semiparametric partially linear model [8,9], and other. In the following, we will review some of these methods.

M-Estimation
Huber [10] introduced the M-estimation method which is now the most common robust regression method, see Fox and Robust [11]. The M-estimation method is a generalization to maximum likelihood estimation in context of location models. That is nearly as efficient as OLS. Rather than minimizing the sum of squared errors, as the objective, M-estimation method principle is minimizing the residual function, see Huber [12]. The likelihood function for and is where ( ) . By replacing the OLS criterion with a robust criterion, M-estimator of is where denotes the ith residual. We obtain the following normal equations: where and is the first derivative function of and is called the influence function. Iteratively reweighted least squares (IRLS) method used to solve the M-estimates nonlinear normal equations. The following iterative algorithm summarizes this (see Ruckstuhl [13]

S-Estimation
The S-estimator ("S" for "scale statistic") is a member of the class of high breakdown-point (BDP) estimators introduced by Rousseeuw and Yohai [15].
S-estimation is based on residual scale of M-estimation. The weakness of M-estimation method is the lack of consideration on the data distribution and not a function of the overall data because only using the median as the weighted value.
S-estimation uses the residual standard deviation to overcome the weaknesses of median; the S-estimator is defined by ̂ ̂ , with determining minimum robust scale estimator ̂ . We obtain the estimating equations for S-estimator: Pitselis [16], showed that S-estimator is more robustly than the M-estimator.

MM-Estimation
Yohai [17], introduced another robust estimation which has high BDP and high efficiency is MM-estimation, by combining S-estimation with M-estimation. Also, Yohai [17] showed that MM-estimators are highly efficient, and not sensitive to leverage points compared to an M-estimators. Recently, Almetwally and Almongy [18] studied the efficiency of some robust estimators by a simulation study, and they conclude that the best robust estimator is MMestimator.

Missing Data in Regression Models
The missing data is a common and important topic in statistics. There are many methods proposed to handle the missing data. But before jumping to these methods, we have to understand the reason why data goes missing.

Missing Data Types (Mechanisms)
It is helpful to know why they are missing. There are three general missingness mechanisms, moving from the simplest to the most general (see Rubin [19]):

Missing Completely at Random (MCAR)
When the missing data are independent both of observable data and of unobservable data

Missing at Random (MAR)
When the missing data are not related to the missing data, but it is related to some of the observed data

Missing not at Random (MNAR)
When the missing data are related to the reason it's missing. MNAR is called "non-ignorable" because the missing data mechanism itself has to be modeled as you deal with the missing data. You have to include a model for why the data are missing.

Missing Data Patterns
The missingness pattern is very important because it affects the choice of how to deal with missing values, see Van Buuren [20]. Figure 1 shows various data patterns in multivariate data.

Handling Missing Data
Note that the methods for handling missing data differ depending on the type of data (variable), and therefore we cannot use any of them for any data. Many references discuss these methods such as Carpenter and Kenward [21], Berglund and Heeringa [22], Raghunathan, et al. [23], El-Sheikh, et al. [24], and Abonazel and Ibrahim [25]. Figure 2 summarizes some of the methods for handling missing data.

R-Applications
In this section, we will provide two applications using R-codes. The first application displays full steps of the regression analysis when the dataset includes outliers. Similarly, the second application displays different methods to estimate the missing values and make a comparison study between these methods and then select the best estimation method of them. We can consider that these applications are practical guides for researchers to handle these problems (outliers and missing data) in regression using R.

Conclusion
In this paper, some handling methods of outliers and missing data have studied using R. Practical evidence was provided to researchers to deal with these problems (outliers and missing data) in regression with R. We can conclude that OLS residuals must be examined initially, if they have outliers, a robust estimation method should be used instead of OLS to get an efficient estimation of the regression model. While in the case of missing data, we note that different handling methods of missing data must be examined to determine a good estimation of missing values, because there is no one suitable method for all datasets. According to our simulation study, we find that KNN method is the better than the other methods to estimate the missing values in regression models.