In social sciences, we are often interested in studying the variations in an outcome variable, \(Y\)-also known as dependent variable-conditional on a set of independent variables, \(X\)-also known as predictors and features.This usually is written formally as:
\[\begin{equation} P(Y=y|X=x) \end{equation}\]One very common strategy in modeling this conditional probability is making some assumptions about the function, \(f(.)\), that maps \(X\) to \(Y\), and then estimating its parameters. The Ordinary Least Square (OLS) has been one of the oldest and the most important tools of modeling association between two variables. Under Classic Assumptions of Linear Regression, an OLS estimator is the Best Linear Unbiased Estimate, (BLUE). However, there are situations that these assumptions do not hold. We will return to addressing the violations of some of these assumptions in the following weeks. Here, we study \(K\)-NN algorithm as a model-free algorithm, which impose less assumption into the prediction algorithm.Although \(K\)-NN does not offer useful information about the quality and quantity of association between the outcome variable and the predictors, it often is an effective predicting algorithm.
Boston Housing dataset contains information on housing in the Boston, Massachusetts area in the US. Assume that we want to predict the median value of a house base on the percent of lower status population in the sub-area.The below scatter plot shows how these two variables are associated:
We need a function to estimate how the housing values is associated with the percent of low status population in the neighborhood. A linear function can be one of our first candidates. That is we assume that \(Y_i=\alpha+\beta X+\epsilon\). After estimating the parameters of this model, i.e. \(\hat{\alpha}\) and \(\hat{\beta}\), the predicted values of OLS estimation is \(\hat{Y}=\hat \alpha +\hat \beta X\). Not surprisingly, since we assumed that the association between \(X\) and \(Y\) is linear, the predicted line fitted to the data is linear (Blue line in the below plot).
To predict using OLS, we make several assumptions about the form of \(Y=f(X)\). The same applies to non-linear models such as Logit and Probit. However, there are non-parametric approaches that impose less assumption on the form of \(f(X)\). \(k\)-NN is one the most commonly used non-parametric predicting algorithms. Before discussing this method, take a look at how this algorithms fits function \(f(.)\) to the data:
## kNN15: the median value of a house in a neighborhood with 20% low status residents is 14.37
We first need to define a training dataset based on our main sample. The training set for a k-NN algorithm consist of the N pairs of the feature(x) and outcomes as follow:
\[\begin{equation} \{(x_1,y_1), (x_2,y_2), \dots, (x_n,y_n) \} \end{equation}\]We start with a one feature model, but later in this handout, we will extend \(k\)-NN to the cases with more than one feature, i.e. \(j>1\).
\(k\)-NN calculates \(f(x_i)\) by taking the average of outcome values for the \(k\) nearest \(x\)’s to \(x_i\).
We learned how K-NN works, but there is an issue here: K is chosen discretionary and for any \(K\), we can get (significantly) different results.
If we choose a large \(k\), then we get a simple line such that for k=500, the like is almost the mean of Median House price. On the other hand, choosing a small \(k\) gives a volatile estimation which captures the complexity of the data very well, yet it is very sensitive to any small changes in the data. Can we find a middle-way value for \(k\)? This is roughly the idea behind using cross-validation to overcome Bias-Variance Trade-Off. To discuss this idea more in detail, we need to learn about interpreting the prediction accuracy of an estimated function.
For any given feature (\(x\in X\)), or the set of features, we can find the predicted value of the outcome variable, \(\hat{y}\), using the estimated function \(\hat{f}(x_i)\). Given this information, we can find the prediction error for each observation by \(y_i-\hat{y}_i\). The prediction error of each statistical model can be calculated using different indices based on these individual errors. One of the most common one is Root Mean Square Error (RMSE).
\[\begin{equation} RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^n [y_i-\hat{f}(x_i)]^2} \end{equation}\]RMSE calculates the average prediction errors.
Now, let’s calculate the RMSE of predicting Median House Value in Boston Housing data when we use k-NN model, at different levels of k.
Below figure presents one of the core ideas of statistical learning. When the complexity of the model is low, training model has low variance but suffers high bias, which lead to a high prediction error. As the complexity of the model increases, the error of training sample decreases. However, since a model with a high level of complexity is trained to captured biases in the training sample, its prediction power for the test sample decreases, leading to a higher prediction error.
We can model association between \(X\) and \(Y\) as follow:
\[\begin{equation} Y= \underbrace{f(X)}_\text{Signal}+\underbrace{\epsilon}_{Noise} \end{equation}\]\(\epsilon\) is the shock component, also known as noise, of the process. The goal of statistical modeling of a process is capturing the changes caused by \(f(x)\), as the shock/noise component is a random process. Thus, we do not want to mistakenly capture the changes caused by the noise part of the model as the changes caused by signal. If we fit a highly complex model which captures/model all changes in the outcome, then the risk of estimating a noisy model increases.
Test and training error as a function of model complexity. (Copied from The Elements of Statistical Learning by Jerome, Friedman, Tibshirani, and Hastie)
Estimating a simple model to explain a complex process leads to a bias function, thought this estimation has a low level of variance. On the other hand, a complex model is sensitive to every bit of changes in the data, which leads to a high variance, though the bias is low.
Dividing the sample to a train sample and a test sample allows us to evaluate the performance of predictive model. However, there are few challenges regarding this approach:
How can we be sure that the prediction results are independent of the train set? How can we draw the samples to address this concern?
Leaving a part of dataset for validating the trained model means that we overlook part of information that we can get from the validation sub-sample. As we overlook this part of information to train our model, we might get a higher level of errors.
The suggested in statistical learning literature solution to address the above issues is Cross-Validation.
In this approach, we divide the sample into train and test samples. Assuming that the size of train sample is \(n\), we leave one observation out, say \(i^{th}\) observation, and train the model using \(n-1\) remaining observations. Then, we predict the outcome of \(i^{th}\) observation and compute the amount of prediction error: \(RMSE_i=\sqrt{(Y_i-\hat{Y}_i)^2}\). We repeat this procedure for all observations in the test sample and compute the mean of Cross-Validation RMSE:
\[\begin{equation} RMSE_{LOCV}=\frac{1}{n}\sum_{i=1}^n RMSE_i \end{equation}\]Leave-One-Out Cross-Validation is computationally intensive since you need to repeat the training and evaluation \(n\) times. An alternative approach is increasing the number of observations that are left out at each stage. To do so, we can divide the sample randomly to \(k\) almost equal size groups; then, we will leave one of these groups out for the validation and use the remaining \(k-1\) groups to train the model. Using this training model, we can predict the outcome for the observations in the validation subsample and then calculate the prediction for each of thoes observations. We will iterate the training and validation steps for each group (\(k\) times). Below figure shows a schematic view of \(k\)-fold Cross-Validation (KCV).
The photo is borrowed from: https://www.researchgate.net/figure/K-fold-cross-validation-E-is-the-overall-error-estimate_fig3_320270458
Considering the calculated error terms for each observation, we can compute the mean of \(RMSE^{KCV}\) as follow:
\[\begin{equation} RMSE^{KCV}=\sqrt{\frac{1}{n}\sum_{i=1}^n(Y_i-\hat{Y}_i)^2} \end{equation}\]## Lets use the average of 5-,10-, and 15-fol
## The best K is: 34
In the previous sections, we studied the \(kNN\) prediction algorithm for the cases where there is only one feature. Finding the \(k\) nearest points for a problem with one feature/independent variable is straightforward, but how does the algorithm changes if we want to use more than one feature/independent variable?
The idea of \(kNN\) is predicting based on the most similar, closest, observations on the training subsample. Where \(p>1\), the problem is finding the predicted values of \(Y_k\) given a set of the features/covariates \(X_f=(x_{f1},x_{f2},\dots,x_{fp})\).
\(kNN\):
To predict \(\hat{Y}_f\) using \(X_f=(x_{f1},x_{f2},\dots,x_{fp})\), we can define a neighborhood around \(X_f\) which include \(k\) nearest \(X_i\)’s to \(X_f\) on the training dataset; then, we calculate the average of \(y_i\)’s corresponding to these \(k\) smallest \(d_i\).
There are different distance metric. Here, we use one of the most well-known one, that is Euclidean Distance:
\[\begin{equation} d_{if}=\sqrt{\sum_{j=1}^p (x_{fj}-x_{ij})^2}~, X_i~in~training~data \end{equation}\]The second issue that we should address in employing \(kNN\) for problems with more than one feature is rescaling and normalizing the features.
Let’s take a look at below scatter plots of Median Housing price versus Percentage of lower status of the population, lstat, and Weighted distances to five Boston employment centers, dis.
As you can see in the above photo, the unit of lstat is different from the unit of dis. This different between units of features affect finding the nearest points around \(X_f\). One common solution is rescaling/normalizing the features before applying \(kNN\) algorithm. There are two common rescaling approaches:
This rescaling formula limit the X’s to \(0\) and \(1\).
Let’s see how this rescaling changes the correlation between dis and medv:
As you can see in the above graphs, rescaling features does not change the association between the features and the outcome variable.
Now, let’s predict the median price of houses in Boston data using several features. The features that I picked to predict the housing price, medv, are crim,indus,age,dis,black, and lstat. Before estimating the prediction model, it can be helpful to plot the scatter plots showing the association between the outcome and the features.
## The min value of RMSE for 10-fold cross-validation is associated with k= 7
## The min value of RMSE for 10-fold cross-validation is associated with k= 8
## The RMSE for 5-fold kNN is 4.01538
## The RMSE for 10-fold kNN is 4.040219
## y linear kNN5 kNN10
## y 1.0000000 0.7715306 0.9015274 0.9006329
## linear 0.7715306 1.0000000 0.8534194 0.8559228
## kNN5 0.9015274 0.8534194 1.0000000 0.9962098
## kNN10 0.9006329 0.8559228 0.9962098 1.0000000
Copyright 2019. This is an in-progress project, please do not cite or reproduce without my permission.↩