In parametric regression analysis, we assume a parametric form for the association between the outcome variable, \(Y\), and the features, \(X\)s:
\[\begin{equation} y=f(x|\Theta)+\epsilon \end{equation}\]One the approaches to estimate/find the parameters of the model is to minimize the expected value of a loss function:
\[\begin{equation} min_{\theta \in \Theta} E[L(Y,f(X))] \end{equation}\]In Ordinary Least Squares (OLS) regression model, the loss function is the square of the error terms, and we estimate the parameters of the model to minimize this expected loss:
\[\begin{equation} min_{\alpha,\beta} E[(y-\alpha-\beta X)^2] \end{equation}\]Our statistical model will essentially look something like the following:
\[ \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{N} \end{bmatrix}_{N \times 1} = \begin{bmatrix} 1 & x_{11} & \dots & x_{1K} \\ 1 & x_{21} & \dots & x_{2K} \\ \vdots & \dots & x_{nk} & \vdots \\ 1 & x_{n1} & \dots & x_{NK} \end{bmatrix}_{N \times K} \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_K \end{bmatrix}_{K \times 1} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix}_{N \times 1} \]
or, we can write
\[\begin{equation} Y=X\beta+\epsilon \end{equation}\]It is assumed that this is a simplified reflection of the world that we want to model. The model has a systematic component (\(\beta X\)) and a stochastic component (\(\epsilon\)). Our goal is to obtain estimates of the population parameters in the \(\beta\) vector.The in matrix format is \(\epsilon'\epsilon\):
\[\begin{equation} \begin{bmatrix} e_1 & e_2 & \dots & e_N \end{bmatrix}_{1 \times N} \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix}_{N \times 1} = \begin{bmatrix} e_1^2+e_2^2+\dots+e_N^2 \end{bmatrix}_{1 \times 1} \end{equation}\] We can also re-write the SSR as follow: \[\begin{align} \epsilon' \epsilon &= (y-X\hat{\beta})'(y-X\hat{\beta}) \nonumber \\ &= y'y-\hat{\beta}'X'y-y'X\hat{\beta}+\hat{\beta}'X'X\hat{\beta} \nonumber \\ &= y'y-2\hat{\beta}'X'y+\hat{\beta}'X'X\hat{\beta} \end{align}\]where we use the fact that the transpose of a scalar is the scalar i.e. \(y'X\hat{\beta}=(y'X\hat{\beta})'=\hat{\beta}'X'y\).
To find OLS estimator, we should following optimization problem: \[\begin{eqnarray} \min_{\hat{\beta}} \epsilon' \epsilon \end{eqnarray}\] We need to take the derivative of Eq. with respect to \(\hat{\beta}\). This gives the following equation: \[\begin{equation} \frac{\partial \epsilon' \epsilon}{\partial \hat{\beta}}=-2X'y+2X'X\hat{\beta}=0 \end{equation}\] To check this is a minimum, not a maximum, we would take the derivative of Eq. with respect to \(\hat{\beta}\) again, giving us \(2X'X>0\). From Eq. , \[\begin{align} (X'X)\hat{\beta}&=X'y \\ (X'X)^{-1}(X'X)\hat{\beta}&=(X'X)^{-1}X'y \nonumber \\ I\hat{\beta}&=(X'X)^{-1}X'y \nonumber \\ \hat{\beta}&=(X'X)^{-1}X'y \end{align}\]1. The observed values of \(X\) are uncorrelated with the residuals.
\[\begin{align} (X'X)\hat{\beta}&=X'y \nonumber \\ (X'X)\hat{\beta}&=X'(X\hat{\beta}+\epsilon) \nonumber \\ (X'X)\hat{\beta}&=X'X\hat{\beta}+X'\epsilon \nonumber \\ X'\epsilon=0 \end{align}\]2. The sum of the residuals is zero.
If there is a constant, then the first column in \(X\) will be a column of ones. This means that for the first element in the \(X'e\) vector (i.e. \(X_{11}e_1+x_{21}e_2+\dots+x_{N1}e_N\)) to be zero, it must be the case that \(\sum_1^N e_i=0\).
3. The sample mean of the residuals is zero.
This follows straightforwardly from the previous property, \(\bar{e}=\frac{\sum_1^N e_i}{N}= 0\).
4. The regression hyperplane, line in a bivariate model, passes through the means of the observed values (\(\bar{x}\) and \(\bar{y}\)).
\[\bar{e}=0 \Rightarrow e=y-X\beta \Rightarrow \frac{e}{N}=\frac{y}{N}-\frac{x}{N}\hat{\beta} \Rightarrow \bar{e}=\bar{y}-\bar{x} \hat{\beta} \Rightarrow 0=\bar{y}-\bar{x}\hat{\beta} \Rightarrow \bar{y}=\bar{x}\hat{\beta} \]
5. The predicted values of \(y\) are uncorrelated with the residuals.
Exercise: Prove this property of OLS.
These properties always hold true. You should be careful not to infer anything from the residuals about the disturbances.Note that we know nothing about \(\hat{\beta}\) except that it satisfies all of the properties discussed above. We need to make some assumptions about the true model in order to make any inferences regarding \(\beta\) (the true population parameters) from \(\hat{\beta}\) (our estimator of the true parameters). Recall that \(\hat{\beta}\) comes from our sample, but we want to learn about the true parameters.
This assumption states that there is no perfect multicollinearity. In other words, the columns of \(X\) are linearly independent. This assumption is known as the identification condition.
This assumption - the zero conditional mean assumption - states that the disturbances average out to \(0\) for any value of \(X\). Put differently, no observations of the independent variables convey any information about the expected value of the disturbance. The assumption implies that \(E(y) =X\beta\). This is important since it essentially says that we get the mean function right.
This captures the familiar assumption of homoskedasticity and no autocorrelation.
\(X\) may be fixed or random, but must be generated by a mechanism that is unrelated to \(\epsilon\):
This assumption is not actually required for the Gauss-Markov Theorem. However, we often assume it to make hypothesis testing easier. The Central Limit Theorem is typically evoked to justify this assumption.
The Gauss-Markov Theorem The Gauss-Markov Theorem states that, conditional on assumptions 1-5, OLS estimator is the Best Linear, Unbiased and Efficient estimator (BLUE):
Let’s load some data to explore the properties of least squares regression for forecasting problems.
yx=read.csv("https://raw.githubusercontent.com/babakrezaee/MethodsCourses/master/DataSets/sim-reg-data.csv")
print(summary(yx))
## y x1 x2
## Min. :-2.2770 Min. :0.00348 Min. : 0.004425
## 1st Qu.: 0.4562 1st Qu.:2.33474 1st Qu.: 2.398081
## Median : 1.2172 Median :4.83693 Median : 4.775174
## Mean : 1.2110 Mean :4.90519 Mean : 4.920732
## 3rd Qu.: 2.0133 3rd Qu.:7.48680 3rd Qu.: 7.440223
## Max. : 5.2494 Max. :9.99582 Max. : 9.999662
## x3 x4 x5
## Min. :0.0001646 Min. :0.001629 Min. :0.001979
## 1st Qu.:0.2444854 1st Qu.:0.247012 1st Qu.:0.248805
## Median :0.5096922 Median :0.514615 Median :0.515297
## Mean :0.5017638 Mean :0.506709 Mean :0.506733
## 3rd Qu.:0.7567860 3rd Qu.:0.761047 3rd Qu.:0.761768
## Max. :0.9999581 Max. :1.006039 Max. :1.005441
There are one outcome variable,\(y\), and five predictors/features, \(x\)s. Below, a linear estimation of this data is reported:
lmf= lm(y~.,yx)
print(summary(lmf))
##
## Call:
## lm(formula = y ~ ., data = yx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4575 -0.6490 0.0287 0.6639 4.0229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.024640 0.089157 -0.276 0.782
## x1 0.025323 0.007686 3.295 0.001 **
## x2 0.035590 0.007742 4.597 4.55e-06 ***
## x3 4.248433 10.888697 0.390 0.696
## x4 -5.126662 7.773133 -0.660 0.510
## x5 2.767445 7.835586 0.353 0.724
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.007 on 1994 degrees of freedom
## Multiple R-squared: 0.2434, Adjusted R-squared: 0.2415
## F-statistic: 128.3 on 5 and 1994 DF, p-value: < 2.2e-16
\(x_1\) and \(x_2\) are statistically significant at 1%. But, the question is whether and how much they contribute to the explanatory power of the model and prediction of the outcome variable. To do so, I divide the sample to a train and a test subsample, and then estimate two models: (1) \(x_1\) and \(x_2\) as predictors ; (2) only \(x_3\) as the only predictor.
n=nrow(yx)
nd = 100
set.seed(99)
rmse =function(y,yhat) {sqrt(mean((y-yhat)^2))}
ntrain = floor(n*.75)
resM = matrix(0.0,nd,4)
for(i in 1:nd) {
ii = sample(1:n,ntrain)
dftrain= yx[ii,]; dftest = yx[-ii,]
lm12 = lm(y~x1+x2,dftrain)
resM[i,1] = rmse(dftest$y,predict(lm12,dftest))
lm3 = lm(y~x3,dftrain)
resM[i,2] = rmse(dftest$y,predict(lm3,dftest))
lm4 = lm(y~x4+x5,dftrain)
resM[i,3] = rmse(dftest$y,predict(lm4,dftest))
lm5 = lm(y~x1+x4,dftrain)
resM[i,4] = rmse(dftest$y,predict(lm5,dftest))
}
Now, we can compare the RMSE of the estimated models:
colnames(resM)= c("x12","x3","x45", "x14")
boxplot(resM)
Due to the information and computational revolutions, we can collect more and more information. Consequently, scholars have relatively a large number of potential regressors/predictors/explanatory variables to include in their statistical models. This is not a new concern among political science scholars. For example, one of most cited issues of Conflict Management and Peace Science (Volume 22 Issue 4, September 2005) presents a set of papers on how many regressors should be added to statistical models.
We can “throw in a ton of x’s” into our model. However, this model becomes too complex, computationally intensive, and prone to overfitting risk. Assuming that we have \(p\) features, we can estimate the following model in theory, conditional on having enough data points.
\[\begin{equation} y_i=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\dots+\beta_px_{ip}+\epsilon \end{equation}\]When we decide to remove one of the \(x\)s, it equals to assuming that its coefficient is zero. In fact, the variable selection problem from a prediction problem is equal to asking which variable(s) we set to zero. How do we decide about which variables should be included? In statistical learning, we always want to achieve a balance between bias and variance, so we apply this idea to select the variables should be included in the estimated model:
If we set too many coefficients to zero, we might be throwing out some of variables that are important for modeling the variations in the outcome variable; this leads to large bias.
If we keep too many variables, the model becomes too sensitive to any small variations in the feature; this leads to large variance.
Though we picking a model which gives us a balance between bias and variance is straightforward and intuitive, the main problem is that there are a lot of possible models to pick!
Example: If the dataset includes 16 variables (1 outcome and 15 features), how many models can we estimate by choosing 10 from 15 features?
Solution:
\[\begin{equation} \left(\begin{array}{cc} p\\ k \end{array}\right) = \frac{p!}{k!(p-k)!}=\frac{15!}{10!(15-10)!}=3,003 \end{equation}\]Also, we need to sum over all possible models with different \(k=0,1,2,\dots,p\):
\[\begin{equation} \left(\begin{array}{cc} p\\ 0 \end{array}\right) + \left(\begin{array}{cc} p\\ 1 \end{array}\right) + \left(\begin{array}{cc} p\\ 2 \end{array}\right) +\dots+ \left(\begin{array}{cc} p\\ p \end{array}\right) \end{equation}\]Similar to \(kNN\) model, k represents the complexity of the model. In variable selection problems, \(k\) specifies the number of variables use in the model. That is, \(k=\{0,1,\dots,p\}\). Therefore, big \(k\) leads to a complex, and small \(k\) results in a simple model. For each given \(k\), we will choose a single regression model from \(\left(\begin{array}{cc} p\\ k \end{array}\right)\) possible models.
There are two strategies of choosing a subset of features (a model) given k:
Small p: For \(p\) less than about 40, we can run all the possible regressions. Given the number of variables \(k\), we will pick the subset of variables of size \(k\) with the lowest RMSE.
Big p: We adopt a Forward Stepwise Selection:
Start with \(k=0\), no variables selected. (What does this mean?) Given a current \(k\) and corresponding subset, add in the new variable which gives the largest decrease in RMSE. This method is a greedy search.
Now, we know how to pick \(k\), the number of variables, and the associated variables, similar to the way we chose \(k\) in \(kNN\).You can split the sample to train and test, and see which k gives the best prediction. You also can use cross-validation.
Each observation in this dataset corresponds to a baseball player. We are interested in predicting the salary of baseball players.
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI"
## [6] "Walks" "Years" "CAtBat" "CHits" "CHmRun"
## [11] "CRuns" "CRBI" "CWalks" "League" "Division"
## [16] "PutOuts" "Assists" "Errors" "Salary" "NewLeague"
## [1] 322 20
## number missing for Salary: 59
## number missing for Salary after drop: 0
## dim after dropping missing:
## [1] 263 20
## (Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits
## 1 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 7 TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
## CHmRun CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors
## 1 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## 4 FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## 5 FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## 6 FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## 7 TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## NewLeagueN
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
## 7 FALSE
## (Intercept) Hits CRBI
## -47.9559022 3.3008446 0.6898994
\(R^2\) is an infamous measure of how much of variations in the outcome variable, \(y\), is explained by the estimated model:
\[\begin{equation} R^2=cor(\hat{y},y)^2=\frac{\sum (\hat{y}_i-\bar{y})^2}{\sum (y_i-\bar{y})^2}=1-\frac{\sum (y_i-\hat{y})^2}{\sum (y_i-\bar{y})^2} \end{equation}\]\(R^2\) is a very intuitive concept: how much of variance in \(y\) is explained with the correlation between \(x\) and \(y\). However, \(R^2\) has several critical problems that we will return to them later. One of the suggested solution is an adjusted version of \(R^2\), which is known as adjusted-\(R^2\)!:
\[\begin{equation} R_{adj}^2==1-\frac{\sum (y_i-\hat{y})^2/(N-k)}{\sum (y_i-\bar{y})^2/(N-1)} \end{equation}\]Let’s return to our RMSE measure:
##
## Call:
## lm(formula = Salary ~ ., data = ddfsub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -873.11 -181.72 -25.91 141.77 2040.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.51180 65.00006 1.408 0.160382
## AtBat -1.86859 0.52742 -3.543 0.000470 ***
## Hits 7.60440 1.66254 4.574 7.46e-06 ***
## Walks 3.69765 1.21036 3.055 0.002488 **
## CRBI 0.64302 0.06443 9.979 < 2e-16 ***
## DivisionW -122.95153 39.82029 -3.088 0.002239 **
## PutOuts 0.26431 0.07477 3.535 0.000484 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 319.9 on 256 degrees of freedom
## Multiple R-squared: 0.5087, Adjusted R-squared: 0.4972
## F-statistic: 44.18 on 6 and 256 DF, p-value: < 2.2e-16
Besides the above approach of keeping some of variables out of the model, we can keep our model simple by pushing or shrinking the coefficient towards zero. In this case, the coefficient will be large if it contributes to the predictive power of the model based on the available data. To do so, we reward large \(\beta\)s if they contribute to the predictive power/fit of the statistical model and punished if they are large. Therefore, the loss function that we aim to minimize is:
\[\begin{equation} min_{\beta} \sum_{i=1}^n (y_i-\beta_0-\sum_{j=1}^p \beta_j x_{ij})^2+\lambda \sum_{j=1}^p \beta_j^2 \end{equation}\]\(\lambda\) is the penalty parameter. Small \(\lambda\) means that coefficients will be large, that is the model will be complex. On the other hand, large \(\lambda\) means that many coefficients will be small (preferably close to zero), that is the model will be simple. Therefore, for any give \(\lambda\), we can get a set of coefficients: \(\hat{\beta}_{\lambda}\). Now, the question is how we can choose \(\lmbda\). The answer is clear: cross-validation, or any other types of out-of-sample prediction criteria (but preferably cross-validation!). Before solving a real problem in \(R\), it is important to remind you that \(\lambda\) penalize all the coefficients equally, so it is important to make sure that the \(x\)’s are in same unit and on a similar range. Therefore, do not forget to standaridize your \(x\)’s.
Let’s repeat our analysis of Baseball players’ salary.
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI"
## [6] "Walks" "Years" "CAtBat" "CHits" "CHmRun"
## [11] "CRuns" "CRBI" "CWalks" "League" "Division"
## [16] "PutOuts" "Assists" "Errors" "Salary" "NewLeague"
## [1] 263 20
## number missing for Salary: 0
## number missing for Salary after drop: 0
## [1] 263 19
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI"
## [6] "Walks" "Years" "CAtBat" "CHits" "CHmRun"
## [11] "CRuns" "CRBI" "CWalks" "LeagueN" "DivisionW"
## [16] "PutOuts" "Assists" "Errors" "NewLeagueN"
## [1] 9.765625e+06 8.034825e+06 6.610781e+06 5.439126e+06 4.475128e+06
## [6] 3.681984e+06 3.029411e+06 2.492497e+06 2.050742e+06 1.687281e+06
## [11] 1.388237e+06 1.142194e+06 9.397588e+05 7.732016e+05 6.361641e+05
## [16] 5.234142e+05 4.306474e+05 3.543221e+05 2.915242e+05 2.398562e+05
## [21] 1.973455e+05 1.623692e+05 1.335919e+05 1.099149e+05 9.043421e+04
## [26] 7.440620e+04 6.121889e+04 5.036883e+04 4.144176e+04 3.409687e+04
## [31] 2.805375e+04 2.308167e+04 1.899082e+04 1.562500e+04 1.285572e+04
## [36] 1.057725e+04 8.702602e+03 7.160205e+03 5.891174e+03 4.847058e+03
## [41] 3.987995e+03 3.281187e+03 2.699649e+03 2.221180e+03 1.827511e+03
## [46] 1.503614e+03 1.237123e+03 1.017862e+03 8.374627e+02 6.890359e+02
## [51] 5.669153e+02 4.664387e+02 3.837699e+02 3.157529e+02 2.597907e+02
## [56] 2.137470e+02 1.758638e+02 1.446947e+02 1.190499e+02 9.795023e+01
## [61] 8.059013e+01 6.630682e+01 5.455500e+01 4.488600e+01 3.693068e+01
## [66] 3.038531e+01 2.500000e+01 2.056915e+01 1.692360e+01 1.392416e+01
## [71] 1.145633e+01 9.425878e+00 7.755293e+00 6.380791e+00 5.249899e+00
## [76] 4.319438e+00 3.553887e+00 2.924018e+00 2.405783e+00 1.979396e+00
## [81] 1.628580e+00 1.339940e+00 1.102457e+00 9.070645e-01 7.463019e-01
## [86] 6.140319e-01 5.052046e-01 4.156652e-01 3.419952e-01 2.813820e-01
## [91] 2.315116e-01 1.904799e-01 1.567204e-01 1.289442e-01 1.060909e-01
## [96] 8.728800e-02 7.181760e-02 5.908909e-02 4.861649e-02 4.000000e-02
## lambda 50: 689.0359
## coefficients:
## (Intercept) AtBat Hits HmRun Runs RBI
## 535.925882 16.482095 29.814164 10.233845 24.035517 21.959591
## Walks Years CAtBat CHits CHmRun CRuns
## 28.827807 12.315813 24.794917 30.459885 27.919028 31.187830
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 31.831127 18.947100 6.943116 -27.682942 33.513426 2.360476
## Errors NewLeagueN
## -4.737679 4.333144
The Lasso (least absolute shrinkage and selection operator) changes the form of the penalty. Now, the loss function takes a different form:
\[\begin{equation} min_{\beta} \sum_{i=1}^n (y_i-\beta_0-\sum_{j=1}^p \beta_j x_{ij})^2+\lambda \sum_{j=1}^p |\beta_j| \end{equation}\]You might think that the results from the lasso model is not much different from the Ridge model. However, since the Lasso can set a \(\beta_j\) exactly to zero. Here are the lasso solutions as \(\lambda\) varies.
## (Intercept) AtBat Hits HmRun Runs RBI
## 535.925882 -289.109678 314.374161 13.916703 -34.239618 0.000000
## Walks Years CAtBat CHits CHmRun CRuns
## 124.971379 -33.041086 -191.538897 0.000000 10.072782 407.092162
## CRBI CWalks LeagueN DivisionW PutOuts Assists
## 201.114077 -199.619610 25.169164 -58.149563 79.100366 44.502170
## Errors NewLeagueN
## -19.688952 -6.704629
Copyright 2019. This is an in-progress project, please do not cite or reproduce without my permission.↩