L1 Penalized Regression Procedures for Feature Selection

Received: 05/Sept/2018, Accepted:13/Oct/2018, Online: 31/Oct/2018 Abstract— In high dimensional regression analysis, a greater number of independent variables occur in many scientific fields and machine learning applications. To select predictors that are relevant to the response, statistical feature selection should be performed. In the study on variable selection in regression analysis, specifically when there are a greater number of predictor variables or highly correlated variables (or both), traditional method includes forward-backward and mixed stepwise variable selection procedure fails. There is need of alternatives, that is, L1 penalized regression procedures which provide higher prediction accuracy and computational efficiency. This paper demonstrates such procedures, particularly least absolute shrinkage and selection operator (LASSO) which does shrinkage and variable selection simultaneously and its variants. In case of extreme observations in the data set, robust regression estimators that are adopted in LASSO tolerate outliers with comparatively greater accuracy. In this paper, the performance of these procedures has been analyzed using the performance measure Median Squared Error (MSE) with numerical illustrations.


INTRODUCTION
Datasets with outliers or heavy-tailed errors are commonly encountered in many scientific field and real-time applications. In regression analysis, those extreme observations may appear in the response variable or in the predictor variables. In this case, the Ordinary Least Square (OLS) estimators fails to produce true value of an estimator. On the other hand, one of the main problems which occur in linear regression is variable selection. Variable selection or feature selection has become widely used as an important task in statistics. Nowadays when it comes to highdimensional models, penalized estimators are widely considered rather than maximum likelihood estimators. As number of predictor variables increases, the predictive model becomes less effective due to most covariates being inactive in the model. This will cause the problem of over-fitting or under fitting, computations become very complex and also decrease the prediction power due to the noise. The effects of covariates and interpretations would become impossible to understand. So, the need for selecting variables in the predictive model is necessary and hence there are plenty of penalized regression procedures are established in the past few decades to perform feature selection in regression model. Standard lasso and its variants were developed to reduce the coefficients in the model towards zero exactly. In some cases, it is reasonable to perform feature selection by grouping features. Group lasso proposed by Yuan and Lin (2006) in which coefficients are grouped. This lasso suffered from estimation inefficiency and inconsistency in variable selection in the same way as lasso. To overcome these limitations, Wang and Leng (2008) proposed adaptive group lasso which selects relevant features by adding weight vector in a grouped way. This can find the true consistency and satisfies oracle property. In this paper, Section II briefly recall the various lasso-type methods. Section III demonstrates the performance of various penalty methods with real data. This paper concludes with a discussion in the last section.

II. PENALIZATION METHODS
The lasso and its variants are briefly summarized in this section.

A.
LASSO Standard lasso is performing well when regression error has extreme observations. To obtain a robust estimator, Wang et al. combined the least absolute deviation (LAD) and Lasso penalty to produce LAD-Lasso estimator which is defined as follows the sum of squares with a constraint of the form is a tuning parameter which controls the amount of shrinkage that remains same for all regression coefficients. Lasso does not only shrink coefficients towards zero but it also provides a selection of the significant covariates. It is known that, the OLS estimator criterion used in lasso regression is very sensitive to outliers.

B. LAD-lasso
Standard lasso is performing well when regression error has extreme observations. To obtain a robust estimator, Wang et al. combined the least absolute deviation (LAD) and Lasso penalty to produce LAD-Lasso estimator which is defined as follows By using suitable n  , LAD-lasso satisfies oracle property.
Besides as Zou (2006) showed that by using appropriate n  and a weight vector, adaptive LADlasso satisfies the oracle property. Moreover, the resulting estimator is not affected by skewed errors since the squared loss is altered to L 1 loss. However, this loss penalizes strongly on small errors. Specifically, when the error is not skewed, it suffers from efficiency over adaptive lasso. In this case, Huber's criterion with lasso is preferable.

C. Huber lasso
where weights vector and the Huber's criterion is defined by where s>0 is a scale parameter for the distribution. The criterion Hadl   is a combination of Huber's loss function and adaptive lasso penalty together. Hence, the resultant estimator tolerates more extremes and filter variables simultaneously. Here, robustness is controlled by the shape parameter M. Huber suggested M as 1.345 to get robustness efficiently for normally distributed data. Generally, Huber's method tolerates more extreme observations in the dataset. But for normally distributed dataset, its efficiency is low.

D. LTS
LTS estimator is defined by adding a penalty parameter  which leads to the sparse LTS estimator. Combination of Lasso and LTS estimator is defined as function. This penalized MTE performs well in robust estimation and variable selection under high dimensional regression. Also, it enjoys consistency, asymptotic normality and oracle property under fixed dimensional regression.

III. NUMERICAL STUDY
The performance of various penalization procedures has been studied under real data and the results obtained are demonstrated in this section. Penalization methods are applied to Boston housing price data set which is taken from 1970 census. There are totally 506 observations, each having 13 predictor variables namely crim (1), zn (2), indus (3), chas (4), nox (5), rm (6), age (7), dis (8), rad (9), tax (10), ptratio (11), black (12), lstat (13) and a dependent variable mdev. As the dataset contains outliers, they were detected and removed by using cook's distance. Analysis of this study was carried out by R software. The results such as variables selection, MSE under various procedures by considering with and without outliers are summarized in the following table.

IV. CONCLUSION
The performances of various LASSO penalty methods were studied with Boston housing price data set with and without outliers. From the numerical study, the efficiency of variable selection and accuracy of prediction is also compared with the standard lasso. All the robust procedures perform well when compared with standard lasso by considering the median squared error. Further, it is noted that the robust procedures MTE and Huber lasso performs better when there are extreme observations. It is concluded that the robust procedures perform well even with extreme observations and the presence of multicollinearity among the variables.