Estimating Non-Linear Models Using OLS

The OLS linear method of estimation usually works very well for estimating linear association between the dependent and independent variables. However, the world is not necessarily linear. Figure below shows a non-linear association between two variables.

x=runif(200,min=0,max=10)
e=runif(200,min=0,max=8)

y=x^2+e

plot(x,y, col='maroon', pch=19, main='A non-linear association between x and y')

plot(x,y, col='deepskyblue3', pch=19,
main='Estimating non-linear association between x and y, using
      different transformation functions.')


lm=lm(y~x)

abline(lm$coef, col="navy", lwd=4)

x2=x*x
lm2=lm(y~x+x2)

xbar=sort(x)
xbar2=xbar*xbar

ybar=lm2$coefficients[1]+xbar*lm2$coefficients[2]+xbar2*lm2$coefficients[3]

lines(y=ybar, x=xbar, pch=19, col="orange", lwd=4) 

xlog=log10(x)
lm3=lm(y~x+xlog)

xbar=sort(x)
xbarLog=log10(xbar)

ybar=lm3$coefficients[1]+xbar*lm3$coefficients[2]+xbarLog*lm3$coefficients[3]

lines(y=ybar, x=xbar, pch=19, col="maroon", lwd=4) 

legend("topleft", legend=c(TeX('$y=\\beta_0+\\beta_1 x+\\epsilon$'),
                           TeX('$y=\\beta_0+\\beta_1 x+\\beta_2 x^2+\\epsilon$'),
                           TeX('$y=\\beta_0+\\beta_1 x+\\beta_2 log(x)+\\epsilon$')),
       col=c("navy","orange","maroon"), 
       lty=1, box.lty=0, cex=1)

Binary Outcome Models

It is common that political scientists work on research questions in which the outcome variable is a binary/categorical variable. Do we observe civil war in country \(x\)? Does person \(i\) vote in an election? Will the UN Security Council impose sanction on country \(x\)?

The simplest case of a binary outcomes is a model with two outcomes. Let’s look at the scatter graph of the association between \(x\) and \(y\) in which y is a binary outcome (Figure ). This graph also shows the fitted OLS models for the association between \(x\) and our binary outcome, \(y\). What does the graph tell us about the estimated models?

x=runif(200,min=0,max=1)
e=runif(200,min=0,max=.3)

y=.5*x+e

y=ifelse(y<.5, 0,1)

plot(x,y, col='deepskyblue3', pch=19, main='Binary outcome and OLS fit.')


lm=lm(y~x)

abline(lm$coef, col="navy", lwd=4)

x2=x*x
lm2=lm(y~x+x2)

xbar=sort(x)
xbar2=xbar*xbar

ybar=lm2$coefficients[1]+xbar*lm2$coefficients[2]+xbar2*lm2$coefficients[3]

lines(y=ybar, x=xbar, pch=19, col="orange", lwd=4) 

xlog=log10(x)
lm3=lm(y~x+xlog)

xbar=sort(x)
xbarLog=log10(xbar)

ybar=lm3$coefficients[1]+xbar*lm3$coefficients[2]+xbarLog*lm3$coefficients[3]

lines(y=ybar, x=xbar, pch=19, col="maroon", lwd=4) 

legend("topleft", legend=c(TeX('$y=\\beta_0+\\beta_1 x+\\epsilon$'),
                           TeX('$y=\\beta_0+\\beta_1 x+\\beta_2 x^2+\\epsilon$'),
                           TeX('$y=\\beta_0+\\beta_1 x+\\beta_2 log(x)+\\epsilon$')),
       col=c("navy","orange","maroon"), 
       lty=1, box.lty=0, cex=1)

Which OLS assumptions are violated in models with binary outcome?

Estimating a model with binary outcome using OLS model violates three of the Gauss-Markov assumptions:

Violation 1: Heteroskedasticity

A residuals versus fitted plot in OLS ideally looks like a random scatter plot, as \(E(y|X\beta,x)=0\). This means that \(Var(\epsilon)=\sigma_i I\neq \sigma I\).

Formal proof: Let \(X_i=\{0,1\}\) and assume \(p=pr(x_i=1)\) then \(X_i^2=X_i\). Thus, \(V(x_i)=E(x_i^2)-E(x_i)^2=E(x_i)-E(x_i)^2=p-p^2=p_i(1-p_i).\)\ Is \(p_i\) constant across all individuals?Although Heteroskedasticity is a violation of classic OLS assumptions, it can be resolved using robust standard errors.

x=runif(200,min=0,max=1)
e=runif(200,min=0,max=.3)

y=.6*x+e

y=ifelse(y<.5, 0,1)

lm=lm(y~x)

xbar=sort(x)


ybar=lm$coefficients[1]+xbar*lm$coefficients[2]

u=y-ybar

plot(x,u, col='maroon', pch=19, main=TeX('Non-random association between $\\hat{e}$ and $x$'))

fmat=cbind(x,u)

my_line <- function(x,y,...){
    points(x,y,...)
    segments(min(x), min(y), max(x), max(y),...)
}


pairs(fmat,lower.panel = my_line, upper.panel = my_line, col='maroon')

print(cor(fmat))

##          x        u
## x 1.000000 0.617287
## u 0.617287 1.000000

**Violation 2: Error terms are not normally distributed*}** This problem can be discussed in relation to the previous violation.

hist(u, freq = FALSE, col = "lightblue", border = "pink")
dens <- density(u)
lines(dens)

shapiro.test(u)

## 
##  Shapiro-Wilk normality test
## 
## data:  u
## W = 0.96508, p-value = 7.316e-05

Violation 3: Linearity An OLS regression of \(y\) on \(x\) ignores the discreteness of the dependent variable and does not constrain predicted probabilities to be between zero and one.

Non-linear models

To address these issues, we can form a regression model by parameterizing the probability \(p_i\) to depend on a regressor vector \(x\) and a \(K \times 1\) parameter vector \(\beta\). The commonly used models are of single-index form with conditional probability given by \[\begin{equation} p_i\equiv Pr[y_i=1|X]=F(x_i\beta) \end{equation}\]

where \(F(.)\) is a specified function. If we assume that \(F(·)\) is the cdf of the logistic distribution, the estimated model is called logit. Similarly, if we use assume that \(F(·)\) is the standard normal cdf, then the estimated model is probit. Note that if \(F(·)\) is a cdf, then this cdf is only being used to model the parameter \(p\) and does not denote the cdf of y itself. Maximum Likelihood (ML) Estimator (MLE) is a common method for estimating the parameters of a non-linear model. Before introducing MLE method, it would be good to have a brief review of optimization.

A brief introduction to optimization

One of the major uses of calculus in mathematical models is to find and characterize maxima and minima of functions.

A function \(f\) has a local or relative maximum at \(x_0\) if \(f(x)\leq f(x_0)\) for all \(x\) in some open interval containing \(x_0\); \(f\) has a global or absolute maximum at \(x_0\) if \(f(x)\leq f(x_0)\) for all \(x\) in the domain of \(f\).The function \(f\) has a local or relative minimum at \(x_0\) if \(f(x)\geq f(x_0)\) for all \(x\) in some open interval containing \(x_0\); \(f\) has a global or absolute minimum at \(x_0\) if \(f(x)\geq f(x_0)\) for all \(x\) in the domain of \(f\).If \(f\) has a local maximum (minimum) at \(x_0\), we will simply say that \(x_0\) is a max (min) of \(f\). If we want to emphasize that \(f\) has a global maximum (minimum) at \(x_0\), we will say that \(x_0\) is a global max (lobal min) of \(f\).

Local maxima and minima on the boundary and in the interior

A max or min of a function can occur at an endpoint of the domain of \(f\) or at a point which is not an endpoint–in the interior of the domain.

Theorem: If \(x_0\) is an interior max or min, then \(x_0\) is a critical point of \(f\).

Second order conditions

If \(x_0\) is a critical point of a function \(f\), how can we use calculus to decide whether critical point \(x_0\) is a max, a min, or neither? The answer to this question lies in the second derivative of \(f\) at \(x_0\).

Theorem:

[(a)] If \(f'(x_0)=0\) and \(f''(x_0)<0\), then \(x_0\) is a max of \(f\).
[(b)] If \(f'(x_0)=0\) and \(f''(x_0)>0\), then \(x_0\) is a min of \(f\).
[(c)] If \(f'(x_0)=0\) and \(f''(x_0)=0\), the second derivative does not provide any significant information about max or min of function at \(x_0\).

Global maxima and minima

In general, it is difficult to find a global max of a function or even to prove that a given local max is a global max. These are, however, three situations in which this problem is somewhat easier:

when \(f\) has only one critical point in its domain.
when \(f''>0\) or \(f''<0\) throughout the domain of \(f\), and
when the domain of \(f\) is a closed finite interval.

Maximum Likelihood (ML) Estimator

Consider the sample: \(\{(y_i,x_i), i=1,...,N \}\). The maximum likelihood (ML) estimator maximizes the likelihood function, see below. The likelihood function is the joint density, which given independent observations is the product \(\prod_i f(y_i|x_i,\beta)\) of the individual densities, where we have conditioned on the regressors.

\[\begin{equation} P(\beta|y,X)=\frac{f(y,X|\beta)P(\beta)}{P(y,X)} \end{equation}\] \[\begin{equation} \mathcal{L}(\beta|y,x)\equiv \prod_i f(y_i|x_i,\beta) \end{equation}\]

The question is how we can estimate the parameters of this model such that \(P(Y|y,X)\) is maximized.

\[\begin{equation} \max_{\hat{\beta}} \mathcal{L}(\beta|y,x)\equiv \prod_i f(y_i|x_i,\beta) \end{equation}\]

Estimating a normally distributed variable

Assume \(y_i\sim N(\mu,\sigma)\), so:

\[\begin{equation}\label{eq:norm_dis} P(y_i=Y)=\frac{1}{(2\pi \sigma^2)^{-\frac{1}{2}}}e^{[\frac{-(y_i-\mu_i)^2}{2 \sigma^2}]} \end{equation}\]

Assuming that \((y_1, y_2,...,y_n)\) are distributed independently, we have:

\[\begin{align}\label{eq:MLE} f(y_1,...,y_n|\mu,\sigma)&= \nonumber \\ f(y_1|\mu,\sigma) f(y_2|\mu,\sigma)...f(y_n|\mu,\sigma)&= \nonumber \\ \prod_i f(y_i|\mu,\sigma) \end{align}\]

\(P(y_i=Y)=\frac{1}{(2\pi \sigma^2)^{-\frac{1}{2}}}e^{[\frac{-(y_i-\mu_i)^2}{2 \sigma^2}]}\) in the above function:

\[\begin{align}\label{eq:MLE_normal} \prod_i f(y_i|\mu,\sigma) = (\frac{1}{2\pi \sigma^2})^{\frac{n}{2}} exp(-\frac{\sum_{i=1}^n (y_i-\mu)^2}{2\sigma^2}) \end{align}\]

Working with this function is easier, if we transfer it to a log-liklihood model, and this transformation does not affect the optima of function, as logarithm functions are strictly increasing.

\[\begin{align} log(\mathcal{L}(\mu,\sigma))=(-\frac{n}{2})log(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\mu)^2 \end{align}\]

We now can find the optimal parameters of MLE by computing the derivatives of the log-likelihood function as follow:

F.O.C.:

\[\begin{align} \frac{\partial}{\partial \mu}log(\mathcal{L}(\mu,\sigma))=\frac{-2n(\bar{y}-\mu)}{2\sigma^2}=0 \end{align}\] \[\begin{align} \hat{\mu}=\bar{y}=\sum_{i=1}^n \frac{y_i}{n} \end{align}\] \[\begin{align} \frac{\partial}{\partial \sigma} log((\frac{1}{2\pi \sigma^2})^{\frac{n}{2}} exp(-\frac{\sum_{i=1}^n (y_i-\bar{y})^2+n(\bar{y}-\mu)^2}{2\sigma^2}))=0 \Rightarrow \nonumber \\ \frac{\partial}{\partial \sigma}(\frac{n}{2}log(\frac{1}{2\pi \sigma^2})-\frac{\sum_{i=1}^n (y_i-\bar{y})^2+n(\bar{y}-\mu)^2}{2\sigma^2})=\\ -\frac{n}{\sigma}+\frac{\sum_{i=1}^n (x_i-\bar{y})+n(\bar{y}-\mu)^2}{\sigma^3}=0 \end{align}\] \[\begin{align} \hat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n (y_i-\mu) \end{align}\]

Calculation of MLE: A simple example

Suppose we have 3 data points as 2, 2.5 and 3. Also, assume that these numbers are drawn from a normal distribution. Although we know that these numbers are generated from a normal distribution, we do not know the parameter of the distribution. Below example shows the calculation of Maximum Likelihood.

\[\begin{equation} L(\Theta|X)= \prod_{i=1}^n f(X_i|\Theta) \end{equation}\]

Let’s calculate the likelihood when \(N(\mu=2,\sigma=1)\):

\[\begin{align} L(\mu=2,\sigma=1|X=2,2.5,3)&=f(x=2|2,1) \times f(x=2.5|2,1) \times f(x=3|2,1) \\ L(\mu=2,\sigma^2=1|X=2,2.5,3)&=\frac{1}{\sqrt{2 (\pi) (1)}}e^{-\frac{(2-2)^2}{(2)(1)}} \times \frac{1}{\sqrt{2 (\pi) (1)}}e^{-\frac{(2.5-2)^2}{(2)(1)}} \times \frac{1}{\sqrt{2 (\pi) (1)}}e^{-\frac{(3-2)^2}{(2)(1)}} \\ L(\mu=2,\sigma^2=1|X=2,2.5,3)&=.39 \times .35 \times .24=.0328 \end{align}\]

Likelihood of observing these numbers when the distribution is \(N(\mu=4,\sigma=2)\):

\[\begin{align} L(\mu=2,\sigma=1|X=2,2.5,3)&=f(x=2|2,1) \times f(x=2.5|2,1) \times f(x=3|2,1) \\ L(\mu=4,\sigma^2=2|X=2,2.5,3)&=\frac{1}{\sqrt{2 (\pi) (2)}}e^{-\frac{(2-4)^2}{(2)(2)}} \times \frac{1}{\sqrt{2 (\pi) (2)}}e^{-\frac{(2.5-4)^2}{(2)(2)}} \times \frac{1}{\sqrt{2 (\pi) (2)}}e^{-\frac{(3-4)^2}{(2)(2)}} \\ L(\mu=2,\sigma=1|X=2,2.5,3)&=.1 \times .16 \times .21=.0034 \end{align}\]

So, a normal distribution with \(\mu=2\) and \(\sigma^2=1\) is a better fit to the data.

Now, how can we find the model that fits best to the data?

\[\begin{align} L(\mu,\sigma^2|X=2,2.5,3)&=\prod_{i=1}^3 f(x_i|\mu,\sigma^2) \\ &=\frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{(2-\mu)^2}{2\sigma^2}} \times \frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{(2.5-4)^2}{2\sigma^2}} \times \frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{(3-4)^2}{2\sigma^2}} \\ &=(\frac{1}{\sqrt{2 \pi \sigma^2}})^3 e^{-\frac{(2-\mu)^2}{2\sigma^2}} \times e^{-\frac{(2.5-4)^2}{2\sigma^2}} \times e^{-\frac{(3-4)^2}{2\sigma^2}} \end{align}\]

Now, we take the natural logarithms from both sides of the equation:

\[\begin{align} ln(L(\mu,\sigma^2|X=2,2.5,3))&=ln(\frac{1}{\sqrt{2 \pi \sigma^2}})^3+ln(e^{-\frac{(2-\mu)^2}{2\sigma^2}} )+ ln(e^{-\frac{(2.5-4)^2}{2\sigma^2}})+ln(e^{-\frac{(3-4)^2}{2\sigma^2}}) \\ &=3ln(1)-3ln(\sqrt{2\pi \sigma^2})-\frac{(2-\mu)^2}{2\sigma^2}-\frac{(2.5-\mu)^2}{2\sigma^2}-\frac{(3-\mu)^2}{2\sigma^2} \\ &=-3ln(\sqrt{2\pi \sigma^2})-\frac{1}{2\sigma^2}(19.5-15\mu+3\mu^2) \end{align}\]

Now, we need to take the derivative of the above log-likelihood function respect to \(\mu\) and \(\sigma^2\):

\[\begin{align} \frac{\partial ln(L(\mu,\sigma^2|X))}{\partial \mu}&=\frac{1}{2\sigma^2}(0-15+6\mu)=0\\ \frac{\partial ln(L(\mu,\sigma^2|X))}{\partial \sigma^2}&=0 \end{align}\]

Thus, \(\mu=2.5\). What is \(\sigma^2\)?

MLE Computations, using R

We start with a simple example so that we can cross check the result. Suppose the observations \(X_1, X_2, \dots, X_n\) are from \(N(\mu,\sigma^2)\) distribution. As I showed above, the log-likelihood function is

\[\begin{align} \sum \{ -\frac{(X_i-\mu)^2}{2\sigma^2}-\frac{1}{2}log2\pi -\frac{1}{2}log2\sigma^2 \} \end{align}\]

X=c(2,5,3,7,10,6,8) # some random numbers!

# Now, definethe the negative of log-liklihood function

fn <- function(theta) {
sum ( 0.5*(X - theta[1])^2/theta[2] + 0.5* log(theta[2]) )
}

Here are two ways to find the maximum of Maximum Likelihood function:

nlm(fn, theta <- c(0,1), hessian=TRUE)

## $minimum
## [1] 10.15418
## 
## $estimate
## [1] 5.857140 6.693874
## 
## $gradient
## [1]  2.820509e-08 -6.634263e-09
## 
## $hessian
##               [,1]          [,2]
## [1,]  1.045732e+00 -4.529359e-05
## [2,] -4.529359e-05  7.807998e-02
## 
## $code
## [1] 1
## 
## $iterations
## [1] 28

optim(theta <- c(0,1), fn, hessian=TRUE)

## $par
## [1] 5.854323 6.688254
## 
## $value
## [1] 10.15418
## 
## $counts
## function gradient 
##       89       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##              [,1]         [,2]
## [1,] 1.0466109215 0.0004412741
## [2,] 0.0004412741 0.0783742222

Deviance, BIC and AIC

For continuious outcomes and linear estimations, we used Root Mean Square Error (RMSE) to evaluate the performance of our statistical models. We also used cross-validation techniques to achieve a balance between variance and bias.For example, we learned how to use the cross-validation method to pick the optimal k in \(kNN\) algorithm. How can we achieve this when we estimate a statical model using MLE? AIC and BIC are widely used attempts in this direction.

BIC

The BIC is the Bayesian information criterion.

The BIC is an approximation to the Bayesian approach to model selection which gives you an automatic penalty for complex models. (ELS section 7.7)

Assume that we have a set of candidate models: \(\mathcal{M}=\{\mathcal{M}_1, \mathcal{M}_2, \dots, \mathcal{M}_M\}\).

Model \(\mathcal{M}_m\) has parameter vector \(\theta_m\) associate with it and \(p(Z|\theta_m,\mathcal{M}_m)\), where represnts the model of the data \(Z\) under model _m.

Let \(\hat{\theta}_m\) be the MLE of parameters under model \(\mathcal{M}_m\):

\[\begin{align} \hat{L}_m=p(Z|\hat{\theta}_m, \mathcal{M}_m) \end{align}\]

where _m is the maximized likelihood under model \(\mathcal{M}_m\).

Then, the deviance is

\[\begin{align} D_m=-2log(\hat{L}_m ) \end{align}\]

and the BIC is

\[\begin{align} BIC_m=D_m+log(n)d_m \end{align}\]

where \(d_m\) is the dimension of \(\theta_m\) and n is the sample size.

The deviance, first term of the BIC equation, measures the in sample fit; a smaller deviance indicate a better fit. \(log(n)d\) is a “complexity penalty”. For adding each new variable to the model, the number of estimated parameter increases by 1 which is weighted by \(log(n)\).

As you add parameters, the deviance will go down, but the complexity penalty will go up, giving you a “U”. Again the bias-variance trade-off!

AIC

The AIC is “an information criterion” or, “the Akaike information criterion”.

Maximum Likelihood Estimation(MLE) and Nonlinear Model

Babak RezaeeDaryakenari¹, Sabanci University

Spring 2019

Estimating Non-Linear Models Using OLS

Binary Outcome Models

Which OLS assumptions are violated in models with binary outcome?

Non-linear models

A brief introduction to optimization

Local maxima and minima on the boundary and in the interior

Second order conditions

Global maxima and minima

Maximum Likelihood (ML) Estimator

Estimating a normally distributed variable

Calculation of MLE: A simple example

MLE Computations, using R

Deviance, BIC and AIC

BIC

AIC

Maximum Likelihood Estimation(MLE) and Nonlinear Model

Babak RezaeeDaryakenari1, Sabanci University

Spring 2019

Estimating Non-Linear Models Using OLS

Binary Outcome Models

Which OLS assumptions are violated in models with binary outcome?

Non-linear models

A brief introduction to optimization

Local maxima and minima on the boundary and in the interior

Second order conditions

Global maxima and minima

Maximum Likelihood (ML) Estimator

Estimating a normally distributed variable

Calculation of MLE: A simple example

MLE Computations, using R

Deviance, BIC and AIC

BIC

AIC

Babak RezaeeDaryakenari¹, Sabanci University