Handling Critical Multicollinearity Using Parametric Approach

In regression analysis, it is relatively necessary to have a correlation between the response and explanatory variables, but having correlations amongst explanatory variables is something undesired. This paper focuses on five methodologies for handling critical multicollinearity, they include: Partial Least Square Regression (PLSR), Ridge Regression (RR), Ordinary Least Square Regression (OLS), Least Absolute Shrinkage and Selector Operator (LASSO) Regression, and the Principal Component Analysis (PCA). Monte Carlo Simulations comparing the methods was carried out with the sample size greater than or equal to the levels (n>p) considered in most cases, the Average Mean Square Error (AMSE) and Akaike Information Criterion (AIC) values were computed. The result shows that PCR is the most superior and more efficient in handling critical multicollinearity problems, having the lowest AMSE and AIC values for all the sample sizes and different levels considered.


Introduction
In regression analysis there are many assumptions about the model, namely, multicollinearity, non-consistent variance (non-homogeneity), linearity, and autocorrelation. If one or more assumption is violated, then the model in hand is no more reliable and also is not acceptable in estimating the population parameters [1]. Multicollinearity (or Collinearity) is a statistical phenomenon in multiple linear regression analysis where two (or more) independent or predictor variables are highly correlated with each other, or intercorrelated. Presence of multicollinearity violates one of the core assumptions of multiple linear regression analysis and as such it is problematic; the predicted regression coefficients are not reliable anymore [2]. This problem can create inaccurate estimates of the regression coefficients, inflate the standard errors of the regression coefficients, deflate the partial t-tests for the regression coefficients, give false, non-significant, p-values, and degrade the predictability of the model [3]. There are a variety of informal and formal methods that have been developed for detecting the presence of serious multicollinearity. Begin by studying pairwise scatter plots of pairs of independent variables, looking for near-perfect relationships [4]. Then glance at the correlation matrix for high correlations. Unfortunately, multicollinearity does not always show up when considering the variables two at a time. Thus, we consider the variance inflation factors (VIF), which measures how much the variances of the estimated regression coefficients are inflated compared to when the independent variables are not linearly related. VIFs over 10 indicate collinear variables. Also, Eigenvalues of the correlation matrix of the independent variables near zero indicate multicollinearity. Instead of looking at the numerical size of the eigenvalue, use the condition number. Large condition numbers indicate multicollinearity [5]. Investigating the signs of the regression coefficients could as well help in detecting the presence of multicollinearity as variables whose regression coefficients are opposite in sign from what you would expect may indicate multicollinearity. Correcting multicollinearity would depend on what the source of multicollinearity is, as the solutions will vary. If the multicollinearity has been created by the data collection, collect additional data over a wider X-subspace. If the choice of the linear model has increased the multicollinearity, simplify the model by using variable selection techniques. If an observation or two has induced the multicollinearity, remove those observations. Above all, use care in selecting the variables at the outset. When these steps are not possible, you might try Ridge Regression or any other suitable approaches such as Principal Component Regression (PCR), Partial Least Squares Regression (PLSR), and Ordinary Least Squares Regression (OLS) [6] etc. This paper looks at five different regression methods and explores which performs best as a method for handling multicollinearity problem using simulated data sets.

Materials and Methods
Various methods have been developed to cope with multicollinearity problems. Among such methods are Ridge Regression, Principal Component Regression, Partial Least Squares Regression, Least Absolute Shrinkage and Ordinary Least Square.
In this study, we consider the true model as; We simulate a set of data with sample size contain severe multicollinearity among all explanatory variables using R package with 100 iterations. Following the explanatory variables are generated by ⁄ Where are independent standard normal pseudo-random numbers and is specified so that the theoretical correlation between any two explanatory variables is given by . Dependent variable for each explanatory variables is from with parameters vectors are chosen arbitrarily for and . To measure the amount of multicollinearity in the data set, variance inflation factor (VIF) is examined. The performances of OLS, LASSO, Ridge Regression (RR), PLSR and PCR methods are compared based on the value of AMSE and AIC values. Cross-validation is used to find a value for the value for RR and LASSO.

Ridge Regression
Ridge Regression is developed by Hoerl and Kennard and this method is the modification of the least squares method that allows biased estimators of the regression coefficients [7]. Although the estimators are biased, the biases are small enough for these estimators to be substantially more precise than unbiased estimators. Therefore, these biased estimators are preferred over unbiased ones since they will have a larger probability of being close to the true parameter values. The ridge regression estimator of the coefficient is found by solving for in the equation [8].
where δ ≥ 0 is often referred to as a shrinkage parameter. Thus, the solution for ridge estimator is given by The matrix considered is replaced by ), where is a small positive quantity. Since the V matrix diagonalizes , it also diagonalizes ). Thus,

[ ]
The eigenvalues of the new matrix ) are for where adding to the main diagonal effectively replaces by . From the properties of the ridge estimator, the role of are revealed in moderating the variance of the estimators. The impact of eigenvalues on the variances of the ridge regression coefficients can be illustrated as Therefore, the in ridge regression moderates the damaging impact of the small eigenvalues that result from Collinearity. There are various procedures for choosing the shrinkage parameter . The ridge trace is a very pragmatic procedure for choosing the shrinkage parameter where it allows increasing until stability is indicated in all coefficients. A plot of the coefficients against that pictorially displays the trace often helps the analyst to make a decision regarding the appropriate value of . However, stability does not imply that the regression coefficients have converged. As grows, variances reduce and the coefficients become more stable [9]. Therefore, the value of is chosen at the point for which the coefficients no longer change rapidly.
The -like statistic that is based on the same variance-bias type trade-off is one of the proposed procedures. The use of statistic is by a simple plotting of against , with the use of -value for which is minimized. The statistic is given as Filzmoser and Croux [10].

Where
, is the residual sum of squares using ridge regression and is the trace of . Notice that plays the same role as the HAT matrix in ordinary least squares. In ordinary least squares, residuals are helpful in identifying outliers which do not appear to be consistent with the rest of the data while the HAT matrix is used to identify "high leverage" points which are outliers among the independent variables. The HAT matrix H is given by . The trace is , where p is the m vector of adjustable model parameters to be estimated from the available data set and for all diagonal elements, . The statistic ̂ comes from the residual mean square from ordinary least squares estimation. The other criterion that represents a prediction approach is the generalized cross validation (GCV) that is given by Hocking [11].
∑ where the value of 1 in accounts for the fact that the role of the constant is not involved in . The use of this procedure is to choose so as to minimize GCV by a simple plotting of GCV against .

LASSO Regression
Lasso was originally introduced in the context of least squares, and it can be instructive to consider this case first, since it illustrates many of lasso's properties in a straightforward setting.
Consider a sample consisting of cases, each of which consists of p covariates and a single outcome. Let be the outcome and ( ) be the covariate vector for the case. Then the objective of lasso is to solve [12].

{ }
Subject to ∑ Here is a prespecified free parameter that determines the amount of regularisation. Letting be the covariate matrix, so that and is the row of , the expression can be written more compactly as

Subject to
Where ∑ ⁄ is the standard norm, and is an . Denoting the scalar mean of the data points by ̅ and the mean of the response variables by ̅, the resulting estimate for will end up being ̂ ̅ ̅ so that Xie and Kalivas [13]. ̂ ̅ ̅ ̅ ̅ and therefore it is standard to work with variables that have been centered (made zero-mean). Additionally, the covariates are typically standardized ∑ so that the solution does not depend on the measurement scale. It can be helpful to rewrite [14].

{ }
Subject to in the so-called Lagrangian form

{ }
where the exact relationship between t and is data dependent.

Principal Component Analysis
Let be the matrix of size whose columns are the normalized eigenvectors of ′ , and let be the corresponding eigenvalues. Let . Then is the sample principal components of X. The regression model can be written as = + = ′ + = where = ′ . Under this formulation, the least estimator of is Wethrill [15].
̂ And hence, the principal component estimator of β is defined by ̂ ̂ . Calculation of OLS estimates via principal component regression may be numerically more stable than direct calculation. Critical multicollinearity will be detected as very small eigenvalues. To rid the data of the multicollinearity, principal component omit the components associated with small Eigen values.

Ordinary Least Square Regression
Ordinary least-squares (OLS) regression is a generalized linear modelling technique that may be used to model a single response variable which has been recorded on at least an interval scale. The technique may be applied to single or multiple explanatory variables and also categorical explanatory variables that have been appropriately coded [16]. The OLS regression model can be extended to include multiple explanatory variables by simply adding additional variables to the equation. The form of the model is the same as above with a single response variable , but this time is predicted by multiple explanatory variables .
The interpretation of the parameters from the above model is basically the same as for the simple regression model above, but the relationship cannot now be graphed on a single scatter plot. α indicates the value of when all vales of the explanatory variables are zero. Each parameter indicates the average change in that is associated with a unit change in , whilst controlling for the other explanatory variables in the model. Model-fit can be assessed through comparing deviance measures of nested models. For example, the effect of variable on in the model above can be calculated by comparing the nested models [17].
The change in deviance between these models indicates the effect that has on the prediction of Y when the effects of and have been accounted for (it is, therefore, the unique effect that has on after taking into account and ). The overall effect of all three explanatory variables on Y can be assessed by comparing the models The significance of the change in the deviance scores can be assessed through the calculation of the F-Statistic using the equation provided above (these are, however, provided as a matter of course by most software packages). As with the simple OLS regression, it is a simple matter to compute the R-square statistics.

Partial Least Square Regression
Partial least squares (PLS) regression is a technique that reduces the predictors to a smaller set of uncorrelated components and performs least squares regression on these components, instead of on the original data. PLS regression is especially useful when your predictors are highly collinear, or when you have more predictors than observations and ordinary least-squares regression either produces coefficients with high standard errors or fails completely [18]. PLS does not assume that the predictors are fixed, unlike multiple regression. This means that the predictors can be measured with error, making PLS more robust to measurement uncertainty.
PLS regression is primarily used in the chemical, drug, food, and plastic industries. A common application is to model the relationship between spectral measurements (NIR, IR, UV), which include many variables that are often correlated with each other, and chemical composition or other physio-chemical properties. In PLS regression, the emphasis is on developing predictive models. Therefore, it is not usually used to screen out variables that are not useful in explaining the response.
Unlike least squares regression, PLS can fit multiple response variables in a single model. PLS regression fits multiple response variables in a single model. Because PLS regression models the response variables in a multivariate way, the results can differ significantly from those calculated for the response variables individually. You should model multiple responses separately only if the responses are uncorrelated.
As in multiple linear regression, the main purpose of partial least squares regression is to build a linear model, where Y is an n cases by m variables response matrix, X is an n cases by p variables predictor matrix, B is p by m regression coefficient matrix, and E is a noise term for the model which has the same dimensions as Y. Usually, the variables in X and Y are centered by subtracting their means and scaled by dividing by their standard deviations.
Both principal components regression and partial least squares regression produce factor scores as linear combinations of the original predictor variables, so that there is no correlation between the factor score variables used in the predictive regression model. For example, suppose we have a data set with response variables Y (in matrix form) and a large number of predictor variables X (in matrix form), some of which are highly correlated. A regression using factor extraction for this type of data computes the factor score matrix T=XW for an appropriate weight matrix W, and then considers the linear regression model Y=TQ+E, where Q is a matrix of regression coefficients (loadings) for T, and E is an error (noise) term. Once the loadings Q are computed, the above regression model is equivalent to Y=XB+E, where B=WQ, which can be used as a predictive regression model [19].
Principal components regression and partial least squares regression differ in the methods used in extracting factor scores. In short, principal components regression produces the weight matrix W reflecting the covariance structure between the predictor variables, while partial least squares regression produces the weight matrix W reflecting the covariance structure between the predictor and response variables. For establishing the model, partial least squares regression produces a p by c weight matrix W for X such that T=XW, i.e., the columns of W are weight vectors for the X columns producing the corresponding n by c factor score matrix T. These weights are computed so that each of them maximizes the covariance between responses and the corresponding factor scores. Ordinary least squares procedures for the regression of Y on T are then performed to produce Q, the loadings for Y (or weights for Y) such that Y=TQ+E. Once Q is computed, we have Y=XB+E, where B=WQ, and the prediction model is complete.
One additional matrix necessary for a complete description of partial least squares regression procedures is the p by c factor loading matrix P which gives a factor model X=TP+F, where F is the unexplained part of the X scores [20]. We now can describe the algorithms for computing partial least squares regression.

Tolerance and Variance Inflation Factor (VIF)
Suppose we have a regression model with regressors and an intercept, then the variance of the partial regression coefficient is given by Coxe [21].
Where is the (partial) regression coefficient of the regressor , is the in the (auxilliary) regression of on the remaining predictors. Rule of thumb: The criterion for using as a detection method is that the higher the VIF more certain we are that multicollinearity is present [22][23][24]. But how high will be considered high? We can use the rule of thumb that if is greater than 10 that variable is considered highly collinear [25]. Some statisticians also use the Tolerance to detect multicollinearity where ( ) ( ⁄ )

Comparing the Performance
To evaluate the performances at the methods studied, Average Mean Square Error (AMSE) of regression coefficient ̂ is measured. The AMSE is defined by Osuji, et al. [26].
where ̂ denotes the estimated parameter in the simulation. AMSE value close to zero indicates that the slope and intercept are correctly estimated. In addition, Akaike Information Criterion (AIC) is also used as the performance criterion with formula [27].
( ̂) where ( ̂) ( ̂ ) ̂ are the parameter values that maximize the likelihood function, x = the observed data, n = the number of data points in x, and k = the number of parameters estimated by the model. The best model is indicated by the lowest values of AIC and AMSE.

Results and Discussion
From the simulation study, the AMSE values of the estimated regression parameters ̂ for each specified cases are calculated. These AMSE values indicate to what extent the slope and intercept are correctly estimated. So, the goal is to obtain an AMSE value close to zero. Figs. 1,2,3,4,5,6,7 show the values for each method used. From Table 2 where p = 2,4,6,10,20,50, and 100, and the specified n = 60, 100, 150, 200, 400, 1000 observations, PCR performed best compared to the other methods, having the lowest AMSE values.   To choose the most ideal model, we use AIC of the models obtained using the five methods under review [28][29][30]. The Akaike Information Criterion values for all methods with different number of independent variables and sample sizes is presented in Table 3 and displayed as bars-graphs in Figs. 8,9,10,11,12,13 Figs. 14. Figs. 8,9,10,11,12,13,Figs. 14 show that the greater the sample sizes are the lower the values of Akaike Information Criterion and in contrary to sample sizes, number of explanatory variables does not seem to affect the value of Akaike Information Criterion. LASSO has the highest AIC values in every level of explanatory variables and sample sizes. LASSO as one of the regularized method has the highest AIC values compare to RR and PCR. The differences of AIC values between the PCR performances from RR are small. PCR is the most ideal methods among the selected methods including based on the value of Akaike Information Criterion. It is consistent with the result in Table 1 where PCR has the smallest AMSE value among all the methods applied in the study. PCR is approximately effective and efficient for a small and high number of regressors.

Conclusion
Based on the simulation results at and the number of data containing severe multicollinearity among all explanatory variables, it can be concluded that RR and PCR method are capable of overcoming severe multicollinearity problem. In contrary, the LASSO method does not resolve the problem very well when all variables are severely correlated even though LASSO do better than OLS. In Overall PCR performs best to estimate the regression coefficients on data containing severe multicollinearity among all explanatory variables.

Future Research
The performance of the five methods can also be done by comparing the use of all methods for highdimensional regressors where . It is known that the problem of multicollinearity is present in the data set where the number of variables is high compared to the number of observations.