Minimizng Expected Loss

In parametric regression analysis, we assume a parametric form for the association between the outcome variable, \(Y\), and the features, \(X\)s:

\[\begin{equation} y=f(x|\Theta)+\epsilon \end{equation}\]

One the approaches to estimate/find the parameters of the model is to minimize the expected value of a loss function:

\[\begin{equation} min_{\theta \in \Theta} E[L(Y,f(X))] \end{equation}\]

In Ordinary Least Squares (OLS) regression model, the loss function is the square of the error terms, and we estimate the parameters of the model to minimize this expected loss:

\[\begin{equation} min_{\alpha,\beta} E[(y-\alpha-\beta X)^2] \end{equation}\]

OLS in Matrix Form

Let

Our statistical model will essentially look something like the following:

\[ \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{N} \end{bmatrix}_{N \times 1} = \begin{bmatrix} 1 & x_{11} & \dots & x_{1K} \\ 1 & x_{21} & \dots & x_{2K} \\ \vdots & \dots & x_{nk} & \vdots \\ 1 & x_{n1} & \dots & x_{NK} \end{bmatrix}_{N \times K} \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_K \end{bmatrix}_{K \times 1} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix}_{N \times 1} \]

or, we can write

\[\begin{equation} Y=X\beta+\epsilon \end{equation}\]

It is assumed that this is a simplified reflection of the world that we want to model. The model has a systematic component (\(\beta X\)) and a stochastic component (\(\epsilon\)). Our goal is to obtain estimates of the population parameters in the \(\beta\) vector.The in matrix format is \(\epsilon'\epsilon\):

\[\begin{equation} \begin{bmatrix} e_1 & e_2 & \dots & e_N \end{bmatrix}_{1 \times N} \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_N \end{bmatrix}_{N \times 1} = \begin{bmatrix} e_1^2+e_2^2+\dots+e_N^2 \end{bmatrix}_{1 \times 1} \end{equation}\] We can also re-write the SSR as follow: \[\begin{align} \epsilon' \epsilon &= (y-X\hat{\beta})'(y-X\hat{\beta}) \nonumber \\ &= y'y-\hat{\beta}'X'y-y'X\hat{\beta}+\hat{\beta}'X'X\hat{\beta} \nonumber \\ &= y'y-2\hat{\beta}'X'y+\hat{\beta}'X'X\hat{\beta} \end{align}\]

where we use the fact that the transpose of a scalar is the scalar i.e. \(y'X\hat{\beta}=(y'X\hat{\beta})'=\hat{\beta}'X'y\).

To find OLS estimator, we should following optimization problem: \[\begin{eqnarray} \min_{\hat{\beta}} \epsilon' \epsilon \end{eqnarray}\] We need to take the derivative of Eq. with respect to \(\hat{\beta}\). This gives the following equation: \[\begin{equation} \frac{\partial \epsilon' \epsilon}{\partial \hat{\beta}}=-2X'y+2X'X\hat{\beta}=0 \end{equation}\] To check this is a minimum, not a maximum, we would take the derivative of Eq. with respect to \(\hat{\beta}\) again, giving us \(2X'X>0\). From Eq. , \[\begin{align} (X'X)\hat{\beta}&=X'y \\ (X'X)^{-1}(X'X)\hat{\beta}&=(X'X)^{-1}X'y \nonumber \\ I\hat{\beta}&=(X'X)^{-1}X'y \nonumber \\ \hat{\beta}&=(X'X)^{-1}X'y \end{align}\]

Properties of OLS

1. The observed values of \(X\) are uncorrelated with the residuals.

\[\begin{align} (X'X)\hat{\beta}&=X'y \nonumber \\ (X'X)\hat{\beta}&=X'(X\hat{\beta}+\epsilon) \nonumber \\ (X'X)\hat{\beta}&=X'X\hat{\beta}+X'\epsilon \nonumber \\ X'\epsilon=0 \end{align}\]

2. The sum of the residuals is zero.

If there is a constant, then the first column in \(X\) will be a column of ones. This means that for the first element in the \(X'e\) vector (i.e. \(X_{11}e_1+x_{21}e_2+\dots+x_{N1}e_N\)) to be zero, it must be the case that \(\sum_1^N e_i=0\).

3. The sample mean of the residuals is zero.

This follows straightforwardly from the previous property, \(\bar{e}=\frac{\sum_1^N e_i}{N}= 0\).

4. The regression hyperplane, line in a bivariate model, passes through the means of the observed values (\(\bar{x}\) and \(\bar{y}\)).

\[\bar{e}=0 \Rightarrow e=y-X\beta \Rightarrow \frac{e}{N}=\frac{y}{N}-\frac{x}{N}\hat{\beta} \Rightarrow \bar{e}=\bar{y}-\bar{x} \hat{\beta} \Rightarrow 0=\bar{y}-\bar{x}\hat{\beta} \Rightarrow \bar{y}=\bar{x}\hat{\beta} \]

5. The predicted values of \(y\) are uncorrelated with the residuals.

Exercise: Prove this property of OLS.

These properties always hold true. You should be careful not to infer anything from the residuals about the disturbances.Note that we know nothing about \(\hat{\beta}\) except that it satisfies all of the properties discussed above. We need to make some assumptions about the true model in order to make any inferences regarding \(\beta\) (the true population parameters) from \(\hat{\beta}\) (our estimator of the true parameters). Recall that \(\hat{\beta}\) comes from our sample, but we want to learn about the true parameters.

This assumption states that there is no perfect multicollinearity. In other words, the columns of \(X\) are linearly independent. This assumption is known as the identification condition.

This assumption - the zero conditional mean assumption - states that the disturbances average out to \(0\) for any value of \(X\). Put differently, no observations of the independent variables convey any information about the expected value of the disturbance. The assumption implies that \(E(y) =X\beta\). This is important since it essentially says that we get the mean function right.

This captures the familiar assumption of homoskedasticity and no autocorrelation.

\(X\) may be fixed or random, but must be generated by a mechanism that is unrelated to \(\epsilon\):

\(\epsilon | X \sim N(0,\sigma^2 I)\).

This assumption is not actually required for the Gauss-Markov Theorem. However, we often assume it to make hypothesis testing easier. The Central Limit Theorem is typically evoked to justify this assumption.

The Gauss-Markov Theorem The Gauss-Markov Theorem states that, conditional on assumptions 1-5, OLS estimator is the Best Linear, Unbiased and Efficient estimator (BLUE):

\(\hat{\beta}\) is an unbiased estimator of \(\beta\).
\(\hat{\beta}\) is a linear estimator of \(\beta\).
\(\hat{\beta}\) has minimal variance among all linear and unbiased estimators.

Moving Away from the Star Wars

Let’s load some data to explore the properties of least squares regression for forecasting problems.

yx=read.csv("https://raw.githubusercontent.com/babakrezaee/MethodsCourses/master/DataSets/sim-reg-data.csv")

print(summary(yx))

##        y                 x1                x2           
##  Min.   :-2.2770   Min.   :0.00348   Min.   : 0.004425  
##  1st Qu.: 0.4562   1st Qu.:2.33474   1st Qu.: 2.398081  
##  Median : 1.2172   Median :4.83693   Median : 4.775174  
##  Mean   : 1.2110   Mean   :4.90519   Mean   : 4.920732  
##  3rd Qu.: 2.0133   3rd Qu.:7.48680   3rd Qu.: 7.440223  
##  Max.   : 5.2494   Max.   :9.99582   Max.   : 9.999662  
##        x3                  x4                 x5          
##  Min.   :0.0001646   Min.   :0.001629   Min.   :0.001979  
##  1st Qu.:0.2444854   1st Qu.:0.247012   1st Qu.:0.248805  
##  Median :0.5096922   Median :0.514615   Median :0.515297  
##  Mean   :0.5017638   Mean   :0.506709   Mean   :0.506733  
##  3rd Qu.:0.7567860   3rd Qu.:0.761047   3rd Qu.:0.761768  
##  Max.   :0.9999581   Max.   :1.006039   Max.   :1.005441

There are one outcome variable,\(y\), and five predictors/features, \(x\)s. Below, a linear estimation of this data is reported:

lmf= lm(y~.,yx)
print(summary(lmf))

## 
## Call:
## lm(formula = y ~ ., data = yx)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4575 -0.6490  0.0287  0.6639  4.0229 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.024640   0.089157  -0.276    0.782    
## x1           0.025323   0.007686   3.295    0.001 ** 
## x2           0.035590   0.007742   4.597 4.55e-06 ***
## x3           4.248433  10.888697   0.390    0.696    
## x4          -5.126662   7.773133  -0.660    0.510    
## x5           2.767445   7.835586   0.353    0.724    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.007 on 1994 degrees of freedom
## Multiple R-squared:  0.2434, Adjusted R-squared:  0.2415 
## F-statistic: 128.3 on 5 and 1994 DF,  p-value: < 2.2e-16

\(x_1\) and \(x_2\) are statistically significant at 1%. But, the question is whether and how much they contribute to the explanatory power of the model and prediction of the outcome variable. To do so, I divide the sample to a train and a test subsample, and then estimate two models: (1) \(x_1\) and \(x_2\) as predictors ; (2) only \(x_3\) as the only predictor.

n=nrow(yx)
nd = 100
set.seed(99)
rmse =function(y,yhat) {sqrt(mean((y-yhat)^2))}
ntrain = floor(n*.75)
resM = matrix(0.0,nd,4)
for(i in 1:nd) {
ii = sample(1:n,ntrain)
dftrain= yx[ii,]; dftest = yx[-ii,]
lm12 = lm(y~x1+x2,dftrain)
resM[i,1] = rmse(dftest$y,predict(lm12,dftest))
lm3 = lm(y~x3,dftrain)
resM[i,2] = rmse(dftest$y,predict(lm3,dftest))
lm4 = lm(y~x4+x5,dftrain)
resM[i,3] = rmse(dftest$y,predict(lm4,dftest))
lm5 = lm(y~x1+x4,dftrain)
resM[i,4] = rmse(dftest$y,predict(lm5,dftest))
}

Now, we can compare the RMSE of the estimated models:

colnames(resM)= c("x12","x3","x45", "x14")
boxplot(resM)

Subset Selction

Due to the information and computational revolutions, we can collect more and more information. Consequently, scholars have relatively a large number of potential regressors/predictors/explanatory variables to include in their statistical models. This is not a new concern among political science scholars. For example, one of most cited issues of Conflict Management and Peace Science (Volume 22 Issue 4, September 2005) presents a set of papers on how many regressors should be added to statistical models.

We can “throw in a ton of x’s” into our model. However, this model becomes too complex, computationally intensive, and prone to overfitting risk. Assuming that we have \(p\) features, we can estimate the following model in theory, conditional on having enough data points.

\[\begin{equation} y_i=\beta_0+\beta_1 x_{i1}+\beta_2 x_{i2}+\dots+\beta_px_{ip}+\epsilon \end{equation}\]

When we decide to remove one of the \(x\)s, it equals to assuming that its coefficient is zero. In fact, the variable selection problem from a prediction problem is equal to asking which variable(s) we set to zero. How do we decide about which variables should be included? In statistical learning, we always want to achieve a balance between bias and variance, so we apply this idea to select the variables should be included in the estimated model:

If we set too many coefficients to zero, we might be throwing out some of variables that are important for modeling the variations in the outcome variable; this leads to large bias.
If we keep too many variables, the model becomes too sensitive to any small variations in the feature; this leads to large variance.

Though we picking a model which gives us a balance between bias and variance is straightforward and intuitive, the main problem is that there are a lot of possible models to pick!

Example: If the dataset includes 16 variables (1 outcome and 15 features), how many models can we estimate by choosing 10 from 15 features?

Solution:

\[\begin{equation} \left(\begin{array}{cc} p\\ k \end{array}\right) = \frac{p!}{k!(p-k)!}=\frac{15!}{10!(15-10)!}=3,003 \end{equation}\]

Also, we need to sum over all possible models with different \(k=0,1,2,\dots,p\):

\[\begin{equation} \left(\begin{array}{cc} p\\ 0 \end{array}\right) + \left(\begin{array}{cc} p\\ 1 \end{array}\right) + \left(\begin{array}{cc} p\\ 2 \end{array}\right) +\dots+ \left(\begin{array}{cc} p\\ p \end{array}\right) \end{equation}\]

Similar to \(kNN\) model, k represents the complexity of the model. In variable selection problems, \(k\) specifies the number of variables use in the model. That is, \(k=\{0,1,\dots,p\}\). Therefore, big \(k\) leads to a complex, and small \(k\) results in a simple model. For each given \(k\), we will choose a single regression model from \(\left(\begin{array}{cc} p\\ k \end{array}\right)\) possible models.

There are two strategies of choosing a subset of features (a model) given k:

Small p: For \(p\) less than about 40, we can run all the possible regressions. Given the number of variables \(k\), we will pick the subset of variables of size \(k\) with the lowest RMSE.
Big p: We adopt a Forward Stepwise Selection:

Start with \(k=0\), no variables selected. (What does this mean?) Given a current \(k\) and corresponding subset, add in the new variable which gives the largest decrease in RMSE. This method is a greedy search.

Now, we know how to pick \(k\), the number of variables, and the associated variables, similar to the way we chose \(k\) in \(kNN\).You can split the sample to train and test, and see which k gives the best prediction. You also can use cross-validation.

Hitters Example from the ISLR

Each observation in this dataset corresponds to a baseball player. We are interested in predicting the salary of baseball players.

##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"      
##  [6] "Walks"     "Years"     "CAtBat"    "CHits"     "CHmRun"   
## [11] "CRuns"     "CRBI"      "CWalks"    "League"    "Division" 
## [16] "PutOuts"   "Assists"   "Errors"    "Salary"    "NewLeague"

## [1] 322  20

## number missing for Salary:  59

## number missing for Salary after drop:  0

## dim after dropping missing:

## [1] 263  20

##   (Intercept) AtBat  Hits HmRun  Runs   RBI Walks Years CAtBat CHits
## 1        TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE
## 2        TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE
## 3        TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE
## 4        TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE
## 5        TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE
## 6        TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  FALSE FALSE
## 7        TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE   TRUE  TRUE
##   CHmRun CRuns  CRBI CWalks LeagueN DivisionW PutOuts Assists Errors
## 1  FALSE FALSE  TRUE  FALSE   FALSE     FALSE   FALSE   FALSE  FALSE
## 2  FALSE FALSE  TRUE  FALSE   FALSE     FALSE   FALSE   FALSE  FALSE
## 3  FALSE FALSE  TRUE  FALSE   FALSE     FALSE    TRUE   FALSE  FALSE
## 4  FALSE FALSE  TRUE  FALSE   FALSE      TRUE    TRUE   FALSE  FALSE
## 5  FALSE FALSE  TRUE  FALSE   FALSE      TRUE    TRUE   FALSE  FALSE
## 6  FALSE FALSE  TRUE  FALSE   FALSE      TRUE    TRUE   FALSE  FALSE
## 7   TRUE FALSE FALSE  FALSE   FALSE      TRUE    TRUE   FALSE  FALSE
##   NewLeagueN
## 1      FALSE
## 2      FALSE
## 3      FALSE
## 4      FALSE
## 5      FALSE
## 6      FALSE
## 7      FALSE

## (Intercept)        Hits        CRBI 
## -47.9559022   3.3008446   0.6898994

\(R^2\) is an infamous measure of how much of variations in the outcome variable, \(y\), is explained by the estimated model:

\[\begin{equation} R^2=cor(\hat{y},y)^2=\frac{\sum (\hat{y}_i-\bar{y})^2}{\sum (y_i-\bar{y})^2}=1-\frac{\sum (y_i-\hat{y})^2}{\sum (y_i-\bar{y})^2} \end{equation}\]

\(R^2\) is a very intuitive concept: how much of variance in \(y\) is explained with the correlation between \(x\) and \(y\). However, \(R^2\) has several critical problems that we will return to them later. One of the suggested solution is an adjusted version of \(R^2\), which is known as adjusted-\(R^2\)!:

\[\begin{equation} R_{adj}^2==1-\frac{\sum (y_i-\hat{y})^2/(N-k)}{\sum (y_i-\bar{y})^2/(N-1)} \end{equation}\]

Let’s return to our RMSE measure:

## 
## Call:
## lm(formula = Salary ~ ., data = ddfsub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -873.11 -181.72  -25.91  141.77 2040.47 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   91.51180   65.00006   1.408 0.160382    
## AtBat         -1.86859    0.52742  -3.543 0.000470 ***
## Hits           7.60440    1.66254   4.574 7.46e-06 ***
## Walks          3.69765    1.21036   3.055 0.002488 ** 
## CRBI           0.64302    0.06443   9.979  < 2e-16 ***
## DivisionW   -122.95153   39.82029  -3.088 0.002239 ** 
## PutOuts        0.26431    0.07477   3.535 0.000484 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 319.9 on 256 degrees of freedom
## Multiple R-squared:  0.5087, Adjusted R-squared:  0.4972 
## F-statistic: 44.18 on 6 and 256 DF,  p-value: < 2.2e-16

Regularized Linear Regression: The Ridge

Besides the above approach of keeping some of variables out of the model, we can keep our model simple by pushing or shrinking the coefficient towards zero. In this case, the coefficient will be large if it contributes to the predictive power of the model based on the available data. To do so, we reward large \(\beta\)s if they contribute to the predictive power/fit of the statistical model and punished if they are large. Therefore, the loss function that we aim to minimize is:

\[\begin{equation} min_{\beta} \sum_{i=1}^n (y_i-\beta_0-\sum_{j=1}^p \beta_j x_{ij})^2+\lambda \sum_{j=1}^p \beta_j^2 \end{equation}\]

\(\lambda\) is the penalty parameter. Small \(\lambda\) means that coefficients will be large, that is the model will be complex. On the other hand, large \(\lambda\) means that many coefficients will be small (preferably close to zero), that is the model will be simple. Therefore, for any give \(\lambda\), we can get a set of coefficients: \(\hat{\beta}_{\lambda}\). Now, the question is how we can choose \(\lmbda\). The answer is clear: cross-validation, or any other types of out-of-sample prediction criteria (but preferably cross-validation!). Before solving a real problem in \(R\), it is important to remind you that \(\lambda\) penalize all the coefficients equally, so it is important to make sure that the \(x\)’s are in same unit and on a similar range. Therefore, do not forget to standaridize your \(x\)’s.

Let’s repeat our analysis of Baseball players’ salary.

Regularized Regression (Ridge) using glmnet()

##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"      
##  [6] "Walks"     "Years"     "CAtBat"    "CHits"     "CHmRun"   
## [11] "CRuns"     "CRBI"      "CWalks"    "League"    "Division" 
## [16] "PutOuts"   "Assists"   "Errors"    "Salary"    "NewLeague"

## [1] 263  20

## number missing for Salary:  0

## number missing for Salary after drop:  0

## [1] 263  19

##  [1] "AtBat"      "Hits"       "HmRun"      "Runs"       "RBI"       
##  [6] "Walks"      "Years"      "CAtBat"     "CHits"      "CHmRun"    
## [11] "CRuns"      "CRBI"       "CWalks"     "LeagueN"    "DivisionW" 
## [16] "PutOuts"    "Assists"    "Errors"     "NewLeagueN"

##   [1] 9.765625e+06 8.034825e+06 6.610781e+06 5.439126e+06 4.475128e+06
##   [6] 3.681984e+06 3.029411e+06 2.492497e+06 2.050742e+06 1.687281e+06
##  [11] 1.388237e+06 1.142194e+06 9.397588e+05 7.732016e+05 6.361641e+05
##  [16] 5.234142e+05 4.306474e+05 3.543221e+05 2.915242e+05 2.398562e+05
##  [21] 1.973455e+05 1.623692e+05 1.335919e+05 1.099149e+05 9.043421e+04
##  [26] 7.440620e+04 6.121889e+04 5.036883e+04 4.144176e+04 3.409687e+04
##  [31] 2.805375e+04 2.308167e+04 1.899082e+04 1.562500e+04 1.285572e+04
##  [36] 1.057725e+04 8.702602e+03 7.160205e+03 5.891174e+03 4.847058e+03
##  [41] 3.987995e+03 3.281187e+03 2.699649e+03 2.221180e+03 1.827511e+03
##  [46] 1.503614e+03 1.237123e+03 1.017862e+03 8.374627e+02 6.890359e+02
##  [51] 5.669153e+02 4.664387e+02 3.837699e+02 3.157529e+02 2.597907e+02
##  [56] 2.137470e+02 1.758638e+02 1.446947e+02 1.190499e+02 9.795023e+01
##  [61] 8.059013e+01 6.630682e+01 5.455500e+01 4.488600e+01 3.693068e+01
##  [66] 3.038531e+01 2.500000e+01 2.056915e+01 1.692360e+01 1.392416e+01
##  [71] 1.145633e+01 9.425878e+00 7.755293e+00 6.380791e+00 5.249899e+00
##  [76] 4.319438e+00 3.553887e+00 2.924018e+00 2.405783e+00 1.979396e+00
##  [81] 1.628580e+00 1.339940e+00 1.102457e+00 9.070645e-01 7.463019e-01
##  [86] 6.140319e-01 5.052046e-01 4.156652e-01 3.419952e-01 2.813820e-01
##  [91] 2.315116e-01 1.904799e-01 1.567204e-01 1.289442e-01 1.060909e-01
##  [96] 8.728800e-02 7.181760e-02 5.908909e-02 4.861649e-02 4.000000e-02

## lambda 50:  689.0359

## coefficients:

## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##  535.925882   16.482095   29.814164   10.233845   24.035517   21.959591 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##   28.827807   12.315813   24.794917   30.459885   27.919028   31.187830 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##   31.831127   18.947100    6.943116  -27.682942   33.513426    2.360476 
##      Errors  NewLeagueN 
##   -4.737679    4.333144

Cross validation Regularized Regression (Ridge) using cv.glmnet()

Regularized Linear Regression: The Lasso

The Lasso (least absolute shrinkage and selection operator) changes the form of the penalty. Now, the loss function takes a different form:

\[\begin{equation} min_{\beta} \sum_{i=1}^n (y_i-\beta_0-\sum_{j=1}^p \beta_j x_{ij})^2+\lambda \sum_{j=1}^p |\beta_j| \end{equation}\]

You might think that the results from the lasso model is not much different from the Ridge model. However, since the Lasso can set a \(\beta_j\) exactly to zero. Here are the lasso solutions as \(\lambda\) varies.

## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##  535.925882 -289.109678  314.374161   13.916703  -34.239618    0.000000 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##  124.971379  -33.041086 -191.538897    0.000000   10.072782  407.092162 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##  201.114077 -199.619610   25.169164  -58.149563   79.100366   44.502170 
##      Errors  NewLeagueN 
##  -19.688952   -6.704629

Linear Models and Regularization

Babak RezaeeDaryakenari¹, Sabanci University

Spring 2019

Minimizng Expected Loss

OLS in Matrix Form

Properties of OLS

Moving Away from the Star Wars

Subset Selction

Hitters Example from the ISLR

Regularized Linear Regression: The Ridge

Regularized Regression (Ridge) using glmnet()

Cross validation Regularized Regression (Ridge) using cv.glmnet()

Regularized Linear Regression: The Lasso

Linear Models and Regularization

Babak RezaeeDaryakenari1, Sabanci University

Spring 2019

Minimizng Expected Loss

OLS in Matrix Form

Properties of OLS

Moving Away from the Star Wars

Subset Selction

Hitters Example from the ISLR

Regularized Linear Regression: The Ridge

Regularized Regression (Ridge) using glmnet()

Cross validation Regularized Regression (Ridge) using cv.glmnet()

Regularized Linear Regression: The Lasso

Babak RezaeeDaryakenari¹, Sabanci University