In social science, we are interested to explain whether and how two factors/variables are associated. That is, how variable \(x\) affects variable \(y\), assuming that there is a causation. For example: how does democracy, \(x\), affect economy, \(y\)?, how ethnic diversity,\(x\), affect the risk of civil war,y?, how education affect the support for the united states of Europe?, how does time affect philosophers’ option about freedom?, and so forth. We can show it as:
\[\begin{equation*} x \overset{?:+/-} \longrightarrow y \end{equation*}\]
One of the main concerns of us in this class, an almost all of social science metrics classes, is modeling this association. You are probably familiar with Acemoglu, Johnson, and Robinson(2001)1. The authors estimate the effect of institutions on economic performance.
The scatter diagram of protection against expropriation and log GDP per capita
This plot presents two variables:
Outcome variable/Dependent variable/Endogenous variable(\(y\)):log GDP per capita, as a proxy of economic performance
Independent variable/Regressor/Explanatory variable/Exogenous variable (\(x\)): protection against expropriation, as a proxy of institutional quality.
We need a function to model this association. Formally,
\[\begin{equation} y=f(x) \end{equation}\]
Function: A relation from a set of inputs to a set of possible outputs where each input is related to exactly one output. Roughly speaking, function is like a machine that receives input \(x\) and process it using plan f(.), and returns outcome \(y\). Your oven is a function! This class is a function, receives a list of students, does some process-like teaching quantitative methods- on the inputs, and return a set of outcomes, that is trained students!
There are literally infinite number of functions that we can nominate to model the association between \(x\) and \(y\) (See Figure3). However, the question is which one is the best, or one of the best options?
The scatter diagram of protection against expropriation and log GDP per capita as well as different fits
In this course, you will learn about Ordinary Least Square (OLS) method. OLS is a linear fit to the association between x and y(See Figure ). Why do we use OLS? What is it called OLS method?
The scatter diagram of protection against expropriation and log GDP per capita as well as an OLS fit
Our statistical model for a linear bivariate model is
\[\begin{equation} y=\beta_0+\beta_1x+\epsilon \end{equation}\]
where \(y\) is a vector of outcome, \(x\) is a vector of independent variable, \(\epsilon\) is disturbance, \(\beta_0\) is the intercept, and \(\beta_1\) is the slope in this linear model.
\(y\) and \(x\) are given, meaning we have data on them. We need to estimate the parameters of this model: \(\beta_0\) and \(\beta_1\). But, how?
There are different methods for estimating the parameters of a statistical model, the most well-known one is OLS. This method suggests estimating the parameters by minimizing the square of errors.
The scatter diagram of protection against expropriation and log GDP per capita as well as an OLS fit
Ordinary Least Square estimate the parameters using minimizing the sum of squared residuals (SSR):
\[\begin{eqnarray} \min_{\hat{\beta_0},\hat{\beta_1}} \sum_i^N e_i^2 \end{eqnarray}\] where \(e_i=y_i-\hat{y_i}=y_i-\hat{\beta_0}-\hat{\beta_1}x_i\)
How do you interpret \(\hat{\beta_0}\) and \(\hat{\beta_1}\)? Tip: look at the plot! Can you identify them on the plot?
Exercise: Show that OLS estimator for a bivariate model is as follow:
\[\begin{equation} \hat{\beta_1}=\frac{\sum_{i=1}^N(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^N(x_i-\bar{x})^2} \end{equation}\]
and
\[\begin{equation} \hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x} \end{equation}\]
In a bivariate regression, we assume that we only can/want to model the effect of one independent variable on the outcome, and the remaining factors will be reported as error terms, because we could not model it. But, for different reasons, we want to model the association of the outcome variable with more than one regressor/exogenous/co-variate. This is called Multi-variate regression, where the regressor of interest is called independent variable and the other regressors are called control variables, which we will discuss it more in detail later. We formally write a multivariate regression as follow:
\[\begin{equation} y=\beta_0+\beta_1x_1+\beta_2x_2+\dots+\beta_kx_2+\epsilon \end{equation}\]
Daron Acemoglu, Simon Johnson, and James A Robinson. The colonial origins of comparative development: an empirical investigation. The American Economic Review, 91(5):1369-1401, 2001.http://www.jstor.org/stable/26779306↩