Empirical Robust Multivariate Regression Parameter Estimation Using Median Approach

Main purpose of multivariate regression analysis is the estimation of model parameters. The use of maximum likelihood method would not be appropriate in estimation problems while data contains outlier or extreme observations. So it is necessary to find a parameter estimation method in which the value of the estimator is not much affected by small changes in the data. This paper introduces robust method for multivariate regression based on robust estimation of location and scatter matrix of predictor and response variables. In this paper Comedian method is taken as a robust estimator of location and scatter. Based on the simulations, the finite-sample efficiency and robustness of the estimator are investigated. Efficiency of proposed robust estimators is compared with maximum likelihood estimator, minimum covariance determinant estimator and orthogonalized Gnanadesikan-Kettenring estimator in terms of mean squared errors. Proposed estimator combines high robustness and high efficiency in estimation. The proposed method is illustrated on a real data set. Keywords—Multivariate Regression, Outliers Detection, Comedian Approach, Finite Sample Efficiency.


I. INTRODUCTION
In statistical modeling, regression analysis is a statistical process for estimating the relationships among variables.It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between dependent variables (response) and independent variables (predictors).More specifically, regression analysis helps to understand how the typical value of the dependent variables changes when any one of the independent variables is varied.As outlined above multivariate regression model allows us to assess the impact of multiple variables on one or more dependent variable in the same model.
Let us denote the location of the joint (x, y) variables by μ and scatter matrix by Σ. Partitioning μ and Σ yields the notations: Traditionally, μ and are often estimated by the classical estimation procedures like Maximum Likelihood Method.
Let ̂ and ̂ be the maximum likelihood estimators of μ and .Then the maximum likelihood estimators for ℬ , α and Σ ε are given by The expressions (3), ( 4) and ( 5) are directly depending on the estimates of the location vector and scatter matrix of response and predictor variable respectively.Unfortunately, classical estimators are not robust to the presence of outliers which are the observations in a data that appears to be inconsistent with the remainder of that data set [1].Consequently, the classical regression techniques are extremely sensitive to the presence of outliers and provide misleading results.As a solution to this problem, one may replace classical estimates of location and scatter by highly robust estimates which are less sensitive to outliers and perform robust analysis.Many robust estimates have been proposed over the years with various properties [2].
An overview of robust multivariate regression techniques is explained in the context of simultaneous equation models by [3].Application of M-estimator to each coordinate of the responses was investigated and suggested to minimizing the sum of the Euclidean norm of the residuals [4,5] In section II, robust method adapted for multivariate regression estimation with suitable threshold function is described.Section III consist results of simulated environment in terms of finite sample efficiencies.Section IV includes simulated robustness properties of proposed method.Application of proposed method in real life dataset explained in section V.The conclusion is presented in last section.

II. MATERIALS AND METHODS
Consider the data set Z = {z i ; i=1,2,…,n}∈ R p+q consisting of q response variables and p predictor variable each sample of size n.Then the comedian matrix COM (Z) is defined as Similarly, multivariate correlation median matrix δ is defined as, where D is a diagonal matrix with diagonal elements 1/MAD(Z i ) (i = 1, …, p).
Consider a pxp matrix E whose columns are eigenvectors of δ (Z).Let Q = D (Z) -1 E and Then W is an orthogonalized matrix with rows w i T , (i= 1, …, n) and columns W j (j = 1, …, p).The resulting robust estimates for location μ and scatter Σ are then defined as (8) where Γ = diag(MAD(W 1 ) 2 , …, MAD(W p ) 2 ) and l = (med(W 1 ), …, med(W 1 )) T .The procedure can be iterated, computing Σ R and μ R for W and then expressing them in the original coordinate system.These estimates can be improved on by a reweighting step by using a robust Mahalanobis distance defined as, where Σ R and μ R are defined in (8).Let M be a weight function, and define Σ RW and μ RW as the weighted mean and covariance matrix, where each z i has weight The simplest weight function M is -hard rejection‖, with M(d) = I(d ≤ cv), where I(.) is the indicator function.We consider 2 ( .)median ( ) It is showed that reweighted comedian estimates are positive definite, possess high-breakdown value and are approximately affine equivariant [9].Then the robust Comedian estimators for ℬ , α and Σ ε are obtained by Efficiency of the proposed method is analyzed and evaluated through simulation.

III. EMPIRICAL RESULTS
To investigate the importance and finite sample efficiency of Comedian multivariate regression, the following simulation study is performed.For various sample sizes n and for different choices of p and q, simulated m datasets of size n from the multivariate standard Gaussian distribution N(0, I p+q ), which corresponds to putting ℬ = 0 and α = 0.For each dataset Z (k) , k = 1, . . .,m, Comedian regression has been carried out for yielding the (p × q) slope matrix estimate ̂( ) , the intercept vector ̂( ) , and the (q × q) covariance matrix estimate ̂ ( ) of the errors.
To measure sample efficiency, mean squared error (MSE) of the proposed estimators are used.As commonly defined, MSE of a univariate component T are given by ( ) ( ( ) ) Where, θ is the true value of parameter.The MSE of slope is defined as ( ̂) ( ( ̂ )) Similarly for the intercept ̂ and for the diagonal and offdiagonal of ̂ .To study the robustness, multivariate data sets contaminated by different type of outliers are simulated.A point (x i , y i ) that does not follow the linear pattern of the majority of the data but whose x i is not outlying is called a vertical outlier.A point (x i , y i ) whose x i is outlying is called a leverage point.Such a point (x i , y i ) do not follow pattern of the remaining data is term as bad leverage point; otherwise, it is a good leverage point.The data sets are generated with both type of outliers because regression estimators often inefficient in the presence of vertical outliers or bad leverage points.From the data discussed in the beginning of the section 4, 10% of data is replaced as follows.To include vertical outliers, the x i 's are kept same and q response variables are distributed as N(2√χ 2 p+q, 0.99 , 0.1).Here only the response variables are outlying.Further 10% of the data is replaced with bad leverage points for which p independent variables are generated according to N(2√χ 2 p, 0.99 , 0.1) and q dependent variables are generated according to N(2√χ 2 q, 0.99 , 0.1).
Efficiency comparison results of Comedian regression method for a 10% contaminated data are shown in

IV. ROBUSTNESS PROPERTIES
The robustness properties of the estimator are studied in terms of breakdown and affine equivariance.These robustness properties confirm finite sample results in the previous section.The breakdown point of an estimator is the proportion of outliers an estimator can handle before giving an incorrect result.An empirical method to find the breakdown value was discussed in [9].To find the breakdown value Comedian regression method, observations of size n generated from the multivariate standard Gaussian distribution N (0, I p+q ).The efficiency of the estimates from data sets with and without outliers is compared to find the maximum proportion of contamination tolerable by the Comedian regression method.The study consists of two kind of contamination: vertical outliers and bad leverage points described in the previous section.Various values of n (n = 100, 1000), γ percentage of contamination (γ = 10, 20, 30, 40, 45, 48) and different combinations p and q of were selected to identify the empirical breakdown values of the Comedian regression method.Generalized versions of regression, scale, affine equivariance and robustness of multiple regression estimators developed in [11].Consider T(X,Y)=( ̂ t , ̂ ) t , X is (n×p) matrix and Y is (n×q) matrix.The regression equivariance is that if we transformation of the response variables by adding a linear transformation of predictor variables is equivalent to adding the coefficients in the linear transformation to the estimator.The estimator T is said to be regression equivariant if Here C is any (p×q) matrix, V is any (q×1) vector, and The y-affine equivariance of estimator T means linear transformation of the response variables implies that the estimator T is transformed in the same manner.Here M is any nonsingular (q×q) matrix, P is any (q×1) vector, and O pq is (p×q) zero matrix.The estimator T is said to be x-affine equivariant if Here N is any nonsingular (p×p) matrix and D is any (p×1) vector.If the predictor variables are transformed linearly, then x-affine equivariance says that the estimator T transforms accordingly.
The three equivariance properties are empirically proved with the help of simulated samples in all possible situations by varying parameter and different contamination levels.Table 6 gives empirical evidence to the affine equivariance expressions ( 14), ( 15) and ( 16).The table contains the MSE efficiency of the estimates from transformed data and efficiency of transformed estimates from untransformed data.It is clear that the MSE values are equal when the transformations are given to data and estimate.This indicates the Comedian regression method is affine equivariant.The result is similar when affine equivariance is tested for different possible contamination levels.
One of the important advantages of Comedian regression is that the time consumption for estimation is relatively less compared with MLE, OGK and MCD method.A simulation study is performed to compare the time efficiency of the proposed method.The simulation consist of different sample sizes n (50,200,500,1000) with different combinations of p and q, all simulation done for m=1000 replications.The average time consumptions of different robust methods are tabulated in Table 7.It is possible to see that comedian regression method requires relatively less time for the estimation than other methods.

V. ILLUSTRATION EXAMPLES
Consider the dataset consisting of measurements of properties of Pulp-Fiber and the paper made from them [8].The dataset comprises of n = 62 observations with p = 4 predictor variables and q = 4 response variables.The predictor variables describe the properties four pulp fiber characteristics: arithmetic fiber length, long fiber fraction, fine fiber fraction and zero span tensile and the response variables measure four properties of paper: breaking length, elastic modulus, stress at failure and burst strength measure property paper made from them.The objective is to establish a relationship between pulp fiber properties and the resulting paper properties.
The Figure1 shows the diagnostic plot of Pulp-Fiber data (robust residual distance versus the robust distance of residuals).The vertical and horizontal cutoff lines shown in the Figure1 is at √χ 2  4, 0.975 = 3.34.Observations 56, 58, 59, 60, 61 and 62 lie far from both the cutoff lines, these six observations thus be classified as outliers (bad leverage points).Some observations (28, 51, and 52) lie above the horizontal cutoff lines, these are vertical outliers because they have small residual distance.Considering the fact that the efficiency of Comedian Multivariate regression is relatively high, the suspicious outliers are observations 56, 58, 59, 60, 61 and 62.This computation took only 0.42 seconds in R-Programming.

Figure 1 :
Figure 1: Plot of Robust Residuals versus Robust Distances for the Pulpfiber data

Table 1 .
Finite sample comparison of Comedian, MLE, MCD and OGK estimations based on Mean Square Error (MSE) for p=q=6

Table 2
. Finite sample comparison of Comedian, MLE, MCD and OGK estimations based on Mean Square Error (MSE) for p=q=6, when the data contains 10% outlier.

Table 1
gives the efficiency comparison results of Comedian regression method from simulated data.The proposed Comedian regression is compared with MLE, MCD and OGK based on Mean square Error (MSE).The MSE of slope matrix, intercept vector, and error covariance matrix were obtained based on different methods are tabulated.All simulations were done with m=1000 replications.The table contains sample sizes between 50 and 1000.In the Table 1, MSE of Comedian regression estimates equals MSE of MLE regression estimates.But he MSE obtained from Comedian regression are much lower than those obtained from MCD regression and OGK regression.Simulations for other sample sizes n and different dimensions p and q gave similar results.

Table 2 .
Here also the proposed Comedian regression is compared with MLE, MCD and OGK based on Mean square Error (MSE).From the table, one can see that MSE obtained from Comedian regression are much lower than those obtained from MLE regression, MCD regression and OGK regression.The contamination level is increased to 20% and 40%, the efficiency values are shown in Table3 and Table 4respectively.In 20% contaminated data, the efficiency of proposed method is similar to 10% contamination.Comedian Regression MSEs are greater for small sample sizes and gradually it deceases for 40% contaminated data.Simulations for other sample sizes n and different dimensions p and q gave similar results.The efficiency of Comedian regression is compared in correlated sample by generating correlated multivariate Gaussian responses with correlation r jk = 0.5.

Table 4 .
Finite sample comparison of Comedian, MLE, MCD and OGK estimations based on Mean Square Error (MSE) for p=q=6, when the data contains 40% outlier MCD and OGK methods.The finite sample efficiency and robustness properties are explained with the help of simulated samples and the MSE result are presented in the tables.The proposed approach gave best finite sample performance in the simulations and also gave highest efficiency compared to the other methods.Moreover, the robustness properties of the proposed approach also exist in simulations with contaminated data sets.The proposed method satisfied robustness properties like high breakdown value and affine equivariance through simulation technique.The time efficiency of the Comedian regression method is remarkably better than the other two methods and it is explained through the average time spent for estimation from 1000 replications of simulation.The Comedian regression method requires almost half time required by the OGK and MCD method.Comedian Multivariate regression on a real data application with the help of diagnostic plots has been illustrated.The plots have been constructed based on robust residual distances.With the help of proposed estimators, it is easy to identify possible outliers contained in the data.The proposed robust regression estimator is suitable for multivariate regression estimation and further regression analysis in real data sets with multiple outliers.

Table 5 .
Efficiency comparison of breakdown value

Table 6 :
MSE comparison for different affine equivariance,The sample size is n=1000

Table 7 :
Average time consumption of different method in R Programming.