The OLS linear method of estimation usually works very well for estimating linear association between the dependent and independent variables. However, the world is not necessarily linear. Figure below shows a non-linear association between two variables.
x=runif(200,min=0,max=10)
e=runif(200,min=0,max=8)
y=x^2+e
plot(x,y, col='maroon', pch=19, main='A non-linear association between x and y')
plot(x,y, col='deepskyblue3', pch=19,
main='Estimating non-linear association between x and y, using
different transformation functions.')
lm=lm(y~x)
abline(lm$coef, col="navy", lwd=4)
x2=x*x
lm2=lm(y~x+x2)
xbar=sort(x)
xbar2=xbar*xbar
ybar=lm2$coefficients[1]+xbar*lm2$coefficients[2]+xbar2*lm2$coefficients[3]
lines(y=ybar, x=xbar, pch=19, col="orange", lwd=4)
xlog=log10(x)
lm3=lm(y~x+xlog)
xbar=sort(x)
xbarLog=log10(xbar)
ybar=lm3$coefficients[1]+xbar*lm3$coefficients[2]+xbarLog*lm3$coefficients[3]
lines(y=ybar, x=xbar, pch=19, col="maroon", lwd=4)
legend("topleft", legend=c(TeX('$y=\\beta_0+\\beta_1 x+\\epsilon$'),
TeX('$y=\\beta_0+\\beta_1 x+\\beta_2 x^2+\\epsilon$'),
TeX('$y=\\beta_0+\\beta_1 x+\\beta_2 log(x)+\\epsilon$')),
col=c("navy","orange","maroon"),
lty=1, box.lty=0, cex=1)
It is common that political scientists work on research questions in which the outcome variable is a binary/categorical variable. Do we observe civil war in country \(x\)? Does person \(i\) vote in an election? Will the UN Security Council impose sanction on country \(x\)?
The simplest case of a binary outcomes is a model with two outcomes. Let’s look at the scatter graph of the association between \(x\) and \(y\) in which y is a binary outcome (Figure ). This graph also shows the fitted OLS models for the association between \(x\) and our binary outcome, \(y\). What does the graph tell us about the estimated models?
x=runif(200,min=0,max=1)
e=runif(200,min=0,max=.3)
y=.5*x+e
y=ifelse(y<.5, 0,1)
plot(x,y, col='deepskyblue3', pch=19, main='Binary outcome and OLS fit.')
lm=lm(y~x)
abline(lm$coef, col="navy", lwd=4)
x2=x*x
lm2=lm(y~x+x2)
xbar=sort(x)
xbar2=xbar*xbar
ybar=lm2$coefficients[1]+xbar*lm2$coefficients[2]+xbar2*lm2$coefficients[3]
lines(y=ybar, x=xbar, pch=19, col="orange", lwd=4)
xlog=log10(x)
lm3=lm(y~x+xlog)
xbar=sort(x)
xbarLog=log10(xbar)
ybar=lm3$coefficients[1]+xbar*lm3$coefficients[2]+xbarLog*lm3$coefficients[3]
lines(y=ybar, x=xbar, pch=19, col="maroon", lwd=4)
legend("topleft", legend=c(TeX('$y=\\beta_0+\\beta_1 x+\\epsilon$'),
TeX('$y=\\beta_0+\\beta_1 x+\\beta_2 x^2+\\epsilon$'),
TeX('$y=\\beta_0+\\beta_1 x+\\beta_2 log(x)+\\epsilon$')),
col=c("navy","orange","maroon"),
lty=1, box.lty=0, cex=1)
Estimating a model with binary outcome using OLS model violates three of the Gauss-Markov assumptions:
A residuals versus fitted plot in OLS ideally looks like a random scatter plot, as \(E(y|X\beta,x)=0\). This means that \(Var(\epsilon)=\sigma_i I\neq \sigma I\).
Formal proof: Let \(X_i=\{0,1\}\) and assume \(p=pr(x_i=1)\) then \(X_i^2=X_i\). Thus, \(V(x_i)=E(x_i^2)-E(x_i)^2=E(x_i)-E(x_i)^2=p-p^2=p_i(1-p_i).\)\ Is \(p_i\) constant across all individuals?Although Heteroskedasticity is a violation of classic OLS assumptions, it can be resolved using robust standard errors.
x=runif(200,min=0,max=1)
e=runif(200,min=0,max=.3)
y=.6*x+e
y=ifelse(y<.5, 0,1)
lm=lm(y~x)
xbar=sort(x)
ybar=lm$coefficients[1]+xbar*lm$coefficients[2]
u=y-ybar
plot(x,u, col='maroon', pch=19, main=TeX('Non-random association between $\\hat{e}$ and $x$'))
fmat=cbind(x,u)
my_line <- function(x,y,...){
points(x,y,...)
segments(min(x), min(y), max(x), max(y),...)
}
pairs(fmat,lower.panel = my_line, upper.panel = my_line, col='maroon')
print(cor(fmat))
## x u
## x 1.000000 0.617287
## u 0.617287 1.000000
hist(u, freq = FALSE, col = "lightblue", border = "pink")
dens <- density(u)
lines(dens)
shapiro.test(u)
##
## Shapiro-Wilk normality test
##
## data: u
## W = 0.96508, p-value = 7.316e-05
where \(F(.)\) is a specified function. If we assume that \(F(·)\) is the cdf of the logistic distribution, the estimated model is called logit. Similarly, if we use assume that \(F(·)\) is the standard normal cdf, then the estimated model is probit. Note that if \(F(·)\) is a cdf, then this cdf is only being used to model the parameter \(p\) and does not denote the cdf of y itself. Maximum Likelihood (ML) Estimator (MLE) is a common method for estimating the parameters of a non-linear model. Before introducing MLE method, it would be good to have a brief review of optimization.
One of the major uses of calculus in mathematical models is to find and characterize maxima and minima of functions.
A function \(f\) has a local or relative maximum at \(x_0\) if \(f(x)\leq f(x_0)\) for all \(x\) in some open interval containing \(x_0\); \(f\) has a global or absolute maximum at \(x_0\) if \(f(x)\leq f(x_0)\) for all \(x\) in the domain of \(f\).The function \(f\) has a local or relative minimum at \(x_0\) if \(f(x)\geq f(x_0)\) for all \(x\) in some open interval containing \(x_0\); \(f\) has a global or absolute minimum at \(x_0\) if \(f(x)\geq f(x_0)\) for all \(x\) in the domain of \(f\).If \(f\) has a local maximum (minimum) at \(x_0\), we will simply say that \(x_0\) is a max (min) of \(f\). If we want to emphasize that \(f\) has a global maximum (minimum) at \(x_0\), we will say that \(x_0\) is a global max (lobal min) of \(f\).
A max or min of a function can occur at an endpoint of the domain of \(f\) or at a point which is not an endpoint–in the interior of the domain.
Theorem: If \(x_0\) is an interior max or min, then \(x_0\) is a critical point of \(f\).
If \(x_0\) is a critical point of a function \(f\), how can we use calculus to decide whether critical point \(x_0\) is a max, a min, or neither? The answer to this question lies in the second derivative of \(f\) at \(x_0\).
Theorem:
In general, it is difficult to find a global max of a function or even to prove that a given local max is a global max. These are, however, three situations in which this problem is somewhat easier:
Consider the sample: \(\{(y_i,x_i), i=1,...,N \}\). The maximum likelihood (ML) estimator maximizes the likelihood function, see below. The likelihood function is the joint density, which given independent observations is the product \(\prod_i f(y_i|x_i,\beta)\) of the individual densities, where we have conditioned on the regressors.
\[\begin{equation} P(\beta|y,X)=\frac{f(y,X|\beta)P(\beta)}{P(y,X)} \end{equation}\] \[\begin{equation} \mathcal{L}(\beta|y,x)\equiv \prod_i f(y_i|x_i,\beta) \end{equation}\]The question is how we can estimate the parameters of this model such that \(P(Y|y,X)\) is maximized.
\[\begin{equation} \max_{\hat{\beta}} \mathcal{L}(\beta|y,x)\equiv \prod_i f(y_i|x_i,\beta) \end{equation}\]Assume \(y_i\sim N(\mu,\sigma)\), so:
\[\begin{equation}\label{eq:norm_dis} P(y_i=Y)=\frac{1}{(2\pi \sigma^2)^{-\frac{1}{2}}}e^{[\frac{-(y_i-\mu_i)^2}{2 \sigma^2}]} \end{equation}\]Assuming that \((y_1, y_2,...,y_n)\) are distributed independently, we have:
\[\begin{align}\label{eq:MLE} f(y_1,...,y_n|\mu,\sigma)&= \nonumber \\ f(y_1|\mu,\sigma) f(y_2|\mu,\sigma)...f(y_n|\mu,\sigma)&= \nonumber \\ \prod_i f(y_i|\mu,\sigma) \end{align}\]\(P(y_i=Y)=\frac{1}{(2\pi \sigma^2)^{-\frac{1}{2}}}e^{[\frac{-(y_i-\mu_i)^2}{2 \sigma^2}]}\) in the above function:
\[\begin{align}\label{eq:MLE_normal} \prod_i f(y_i|\mu,\sigma) = (\frac{1}{2\pi \sigma^2})^{\frac{n}{2}} exp(-\frac{\sum_{i=1}^n (y_i-\mu)^2}{2\sigma^2}) \end{align}\]Working with this function is easier, if we transfer it to a log-liklihood model, and this transformation does not affect the optima of function, as logarithm functions are strictly increasing.
\[\begin{align} log(\mathcal{L}(\mu,\sigma))=(-\frac{n}{2})log(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\mu)^2 \end{align}\]We now can find the optimal parameters of MLE by computing the derivatives of the log-likelihood function as follow:
F.O.C.:
\[\begin{align} \frac{\partial}{\partial \mu}log(\mathcal{L}(\mu,\sigma))=\frac{-2n(\bar{y}-\mu)}{2\sigma^2}=0 \end{align}\] \[\begin{align} \hat{\mu}=\bar{y}=\sum_{i=1}^n \frac{y_i}{n} \end{align}\] \[\begin{align} \frac{\partial}{\partial \sigma} log((\frac{1}{2\pi \sigma^2})^{\frac{n}{2}} exp(-\frac{\sum_{i=1}^n (y_i-\bar{y})^2+n(\bar{y}-\mu)^2}{2\sigma^2}))=0 \Rightarrow \nonumber \\ \frac{\partial}{\partial \sigma}(\frac{n}{2}log(\frac{1}{2\pi \sigma^2})-\frac{\sum_{i=1}^n (y_i-\bar{y})^2+n(\bar{y}-\mu)^2}{2\sigma^2})=\\ -\frac{n}{\sigma}+\frac{\sum_{i=1}^n (x_i-\bar{y})+n(\bar{y}-\mu)^2}{\sigma^3}=0 \end{align}\] \[\begin{align} \hat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n (y_i-\mu) \end{align}\]Suppose we have 3 data points as 2, 2.5 and 3. Also, assume that these numbers are drawn from a normal distribution. Although we know that these numbers are generated from a normal distribution, we do not know the parameter of the distribution. Below example shows the calculation of Maximum Likelihood.
\[\begin{equation} L(\Theta|X)= \prod_{i=1}^n f(X_i|\Theta) \end{equation}\]Let’s calculate the likelihood when \(N(\mu=2,\sigma=1)\):
\[\begin{align} L(\mu=2,\sigma=1|X=2,2.5,3)&=f(x=2|2,1) \times f(x=2.5|2,1) \times f(x=3|2,1) \\ L(\mu=2,\sigma^2=1|X=2,2.5,3)&=\frac{1}{\sqrt{2 (\pi) (1)}}e^{-\frac{(2-2)^2}{(2)(1)}} \times \frac{1}{\sqrt{2 (\pi) (1)}}e^{-\frac{(2.5-2)^2}{(2)(1)}} \times \frac{1}{\sqrt{2 (\pi) (1)}}e^{-\frac{(3-2)^2}{(2)(1)}} \\ L(\mu=2,\sigma^2=1|X=2,2.5,3)&=.39 \times .35 \times .24=.0328 \end{align}\]Likelihood of observing these numbers when the distribution is \(N(\mu=4,\sigma=2)\):
\[\begin{align} L(\mu=2,\sigma=1|X=2,2.5,3)&=f(x=2|2,1) \times f(x=2.5|2,1) \times f(x=3|2,1) \\ L(\mu=4,\sigma^2=2|X=2,2.5,3)&=\frac{1}{\sqrt{2 (\pi) (2)}}e^{-\frac{(2-4)^2}{(2)(2)}} \times \frac{1}{\sqrt{2 (\pi) (2)}}e^{-\frac{(2.5-4)^2}{(2)(2)}} \times \frac{1}{\sqrt{2 (\pi) (2)}}e^{-\frac{(3-4)^2}{(2)(2)}} \\ L(\mu=2,\sigma=1|X=2,2.5,3)&=.1 \times .16 \times .21=.0034 \end{align}\]So, a normal distribution with \(\mu=2\) and \(\sigma^2=1\) is a better fit to the data.
Now, how can we find the model that fits best to the data?
\[\begin{align} L(\mu,\sigma^2|X=2,2.5,3)&=\prod_{i=1}^3 f(x_i|\mu,\sigma^2) \\ &=\frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{(2-\mu)^2}{2\sigma^2}} \times \frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{(2.5-4)^2}{2\sigma^2}} \times \frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{(3-4)^2}{2\sigma^2}} \\ &=(\frac{1}{\sqrt{2 \pi \sigma^2}})^3 e^{-\frac{(2-\mu)^2}{2\sigma^2}} \times e^{-\frac{(2.5-4)^2}{2\sigma^2}} \times e^{-\frac{(3-4)^2}{2\sigma^2}} \end{align}\]Now, we take the natural logarithms from both sides of the equation:
\[\begin{align} ln(L(\mu,\sigma^2|X=2,2.5,3))&=ln(\frac{1}{\sqrt{2 \pi \sigma^2}})^3+ln(e^{-\frac{(2-\mu)^2}{2\sigma^2}} )+ ln(e^{-\frac{(2.5-4)^2}{2\sigma^2}})+ln(e^{-\frac{(3-4)^2}{2\sigma^2}}) \\ &=3ln(1)-3ln(\sqrt{2\pi \sigma^2})-\frac{(2-\mu)^2}{2\sigma^2}-\frac{(2.5-\mu)^2}{2\sigma^2}-\frac{(3-\mu)^2}{2\sigma^2} \\ &=-3ln(\sqrt{2\pi \sigma^2})-\frac{1}{2\sigma^2}(19.5-15\mu+3\mu^2) \end{align}\]Now, we need to take the derivative of the above log-likelihood function respect to \(\mu\) and \(\sigma^2\):
\[\begin{align} \frac{\partial ln(L(\mu,\sigma^2|X))}{\partial \mu}&=\frac{1}{2\sigma^2}(0-15+6\mu)=0\\ \frac{\partial ln(L(\mu,\sigma^2|X))}{\partial \sigma^2}&=0 \end{align}\]Thus, \(\mu=2.5\). What is \(\sigma^2\)?
We start with a simple example so that we can cross check the result. Suppose the observations \(X_1, X_2, \dots, X_n\) are from \(N(\mu,\sigma^2)\) distribution. As I showed above, the log-likelihood function is
\[\begin{align} \sum \{ -\frac{(X_i-\mu)^2}{2\sigma^2}-\frac{1}{2}log2\pi -\frac{1}{2}log2\sigma^2 \} \end{align}\]X=c(2,5,3,7,10,6,8) # some random numbers!
# Now, definethe the negative of log-liklihood function
fn <- function(theta) {
sum ( 0.5*(X - theta[1])^2/theta[2] + 0.5* log(theta[2]) )
}
Here are two ways to find the maximum of Maximum Likelihood function:
nlm(fn, theta <- c(0,1), hessian=TRUE)
## $minimum
## [1] 10.15418
##
## $estimate
## [1] 5.857140 6.693874
##
## $gradient
## [1] 2.820509e-08 -6.634263e-09
##
## $hessian
## [,1] [,2]
## [1,] 1.045732e+00 -4.529359e-05
## [2,] -4.529359e-05 7.807998e-02
##
## $code
## [1] 1
##
## $iterations
## [1] 28
optim(theta <- c(0,1), fn, hessian=TRUE)
## $par
## [1] 5.854323 6.688254
##
## $value
## [1] 10.15418
##
## $counts
## function gradient
## 89 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
##
## $hessian
## [,1] [,2]
## [1,] 1.0466109215 0.0004412741
## [2,] 0.0004412741 0.0783742222
For continuious outcomes and linear estimations, we used Root Mean Square Error (RMSE) to evaluate the performance of our statistical models. We also used cross-validation techniques to achieve a balance between variance and bias.For example, we learned how to use the cross-validation method to pick the optimal k in \(kNN\) algorithm. How can we achieve this when we estimate a statical model using MLE? AIC and BIC are widely used attempts in this direction.
The BIC is the Bayesian information criterion.
The BIC is an approximation to the Bayesian approach to model selection which gives you an automatic penalty for complex models. (ELS section 7.7)
Assume that we have a set of candidate models: \(\mathcal{M}=\{\mathcal{M}_1, \mathcal{M}_2, \dots, \mathcal{M}_M\}\).
Model \(\mathcal{M}_m\) has parameter vector \(\theta_m\) associate with it and \(p(Z|\theta_m,\mathcal{M}_m)\), where represnts the model of the data \(Z\) under model _m.
Let \(\hat{\theta}_m\) be the MLE of parameters under model \(\mathcal{M}_m\):
\[\begin{align} \hat{L}_m=p(Z|\hat{\theta}_m, \mathcal{M}_m) \end{align}\]where _m is the maximized likelihood under model \(\mathcal{M}_m\).
Then, the deviance is
\[\begin{align} D_m=-2log(\hat{L}_m ) \end{align}\]and the BIC is
\[\begin{align} BIC_m=D_m+log(n)d_m \end{align}\]where \(d_m\) is the dimension of \(\theta_m\) and n is the sample size.
The deviance, first term of the BIC equation, measures the in sample fit; a smaller deviance indicate a better fit. \(log(n)d\) is a “complexity penalty”. For adding each new variable to the model, the number of estimated parameter increases by 1 which is weighted by \(log(n)\).
As you add parameters, the deviance will go down, but the complexity penalty will go up, giving you a “U”. Again the bias-variance trade-off!
The AIC is “an information criterion” or, “the Akaike information criterion”.
\[\begin{align} AIC=D+2d=-2log(\hat{L}_m )+2d \end{align}\]Choose the model with the smallest AIC.
You can re-write the BIC formula for a linear regression:
If \(RSS=\sum_{i=1}^N e_i^2\), then
\[\begin{align} BIC=log(RSS)+log(n)d \end{align}\]Copyright 2019. This is an in-progress project, please do not cite or reproduce without my permission.↩