Count data

Some political outcomes only take non-negative integer values, \(Y \in \mathbb{Z}^+\). For example, the number of civilians killed daily/weekly/monthly/yearly in a civil war; number of female candidate at every election; the number of terrorist attacks within a time period. King(1988) shows that estimating a model with count outcomes using OLS leads to

Thus, do not use OLS for count models, though labor economists have a different view (Angrist & Pischke 2009).

These types of outcomes are often modeled using a Poisson distribution, \(Y \sim Poisson(\lambda)\). Parameter \(\lambda\) here simply present the expected number of occurrences. Does \(\lambda\) need to be an integer value?

The probability function of Poisson distribution is:

\[\begin{equation} Poisson(\lambda)=\frac{\lambda^k e^{-\lambda}}{k!} \end{equation}\]

The mean and variance of this distribution are \(\lambda>0\). There is a straightforward interpretation for \(\lambda\): the expected number of events in any particular time period t. Poisson distribution is a memoriless distribution; that is, the occurrence of an event is independent.

x = 0:20;  pdf = dpois(x, 4)


yrange <- range(dpois(x, 2),dpois(x, 4),dpois(x, 7)) 

plot(range(x), yrange, type="n", lwd=3, col="blue", 
     main=TeX('Poisson Distributions for different values of $\\lambda$'))

lines(x, dpois(x, 2), type="b", lwd=2, col='blue')

lines(x, dpois(x, 4), type="b", lwd=2, col='orange')

lines(x, dpois(x, 7), type="b", lwd=2, col='maroon')

lines(x, dpois(x, 10), type="b", lwd=2, col='green')

legend("topright", legend=c(TeX('$\\lambda$=2'),
                            TeX('$\\lambda$=4'),
                            TeX('$\\lambda$=7'),
                           TeX('$\\lambda$=10')),
       col=c("blue","orange","maroon","green"), 
       lty=1, box.lty=0, cex=1,lwd=2)

Generalized Linear Models (GLM)

GLM method extend the linear estimation method to variables that are not normally distributed. Although GLM helps us to model non-linear outcomes, it is made up of a linear estimation:

\[\begin{equation} \eta_i=\beta_0+\beta_1x_{1i}+\dots+\beta_px_{pi} \end{equation}\]

and two functions:

\[\begin{align} g(\lambda)=\eta_i \\ E(Y_i)=\lambda=g^{-1}(\eta_i) \end{align}\] \[\begin{equation} var(Y_i)=\phi V(\lambda)=V(g^{-1}(\eta_i)) \end{equation}\]

where \(\phi\) is a constant parameter that measures dispersion.

Estimation using MLE

The likelihood function of a Poisson distribution, assuming that \(log(\lambda_i)=x_i' \beta \equiv \lambda_i=exp(x_i' \beta)\), can be written as follow:

\[\begin{align} P(\lambda|Y)=\prod_{i=1}^{N}\frac{e^{-\lambda} \lambda^Y}{Y!} \\ ln L(\lambda|Y)=ln[\prod_{i=1}^{N} \frac{e^{-e^{X\beta}}e^{(X\beta) Y}}{Y!}] \end{align}\]

We can re-arrange the above equation:

\[\begin{align} ln L(\lambda|Y)=\sum_{i=1}^N [-e^{X\beta}+Y(X\beta)-log(Y!)] \end{align}\]

This maximum likelihood model can be estimated fairly easy by numerical optimization of the above likelihood model.

The signs of the vector \(beta\) indicate the effect of \(X\) on \(\lambda\):

The above link function/log-transformation resolves the problem of negative predicted values. Cameron and Trivedi (2005) also recommend using robust standard errors to correct for heteroskedasticity when we use Poisson model for estimation.

Estimation with no predictor

Here, we use the data from “Armed intervention and civilian victimization in intrastate conflicts” by Wood, Kathman, and Gent. The full replication data is available here: https://www.prio.org/JPR/Datasets

WKGdata=read.csv("https://raw.githubusercontent.com/babakrezaee/MethodsCourses/master/DataSets/WKG_JPR_2012.csv")


hist(WKGdata$rebbest2010,  col="grey", border="maroon",
     main="Histogram for Conflict Deaths (SCAD data set: modified)",
     xlab="Deaths",
     breaks = 0:summary(WKGdata$rebbest2010)[6])

library(glmnet)
poisson_ml <- glm(rebbest2010 ~ 1 , family="poisson", data=WKGdata)

summary(poisson_ml)
## 
## Call:
## glm(formula = rebbest2010 ~ 1, family = "poisson", data = WKGdata)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -12.77  -12.77  -12.77   -7.27  514.44  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 4.401508   0.003252    1353   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 596617  on 1158  degrees of freedom
## Residual deviance: 596617  on 1158  degrees of freedom
##   (7 observations deleted due to missingness)
## AIC: 598550
## 
## Number of Fisher Scoring iterations: 8
exp(poisson_ml$coefficients[[1]]+0)
## [1] 81.57377
poissonModel <- glm(rebbest2010 ~ lnintv_ratiolag, family="poisson", data=WKGdata)

summary(poissonModel)
## 
## Call:
## glm(formula = rebbest2010 ~ lnintv_ratiolag, family = "poisson", 
##     data = WKGdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -15.646  -11.286  -11.286   -3.631  254.864  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      4.153921   0.004455  932.50   <2e-16 ***
## lnintv_ratiolag -0.126376   0.003725  -33.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 230394  on 824  degrees of freedom
## Residual deviance: 229372  on 823  degrees of freedom
##   (341 observations deleted due to missingness)
## AIC: 231038
## 
## Number of Fisher Scoring iterations: 7
exp(poissonModel$coefficients[[1]]+0)
## [1] 63.68323
exp(poissonModel$coefficients[[1]]+poissonModel$coefficients[[2]]*1)
## [1] 56.12299

The results show that each unit of intervention decreases the average number of rebel violence against civilians by 7.5602458.

Here, \(\beta=log(\lambda x+1)-log(\lambda x)\).

Predicting from the estimated model

After estimating your Poisson regression, you can predict outcomes using any dataset containing the features used in the estimated model.

newData=data.frame(lnintv_ratiolag=3)

predict(poissonModel,newdata = newData, type='response')
##        1 
## 43.58852

Visualizing findings

library(jtools)

plot_summs(poissonModel, scale=TRUE, exp = FALSE,
           model.names = c("Poisson"),
           coefs = c("log(Intervention)" = "lnintv_ratiolag"))

We can also compare the coefficients of two models:

linearModel <- lm(rebbest2010 ~ lnintv_ratiolag, data=WKGdata)


plot_summs(poissonModel,linearModel, scale=TRUE, exp = FALSE,
           model.names = c("Poisson","OLS"),
           coefs = c("log(Intervention)" = "lnintv_ratiolag"))

Marginal effects in count models

library(sjPlot)
library(ggplot2)

plot_model(poissonModel, type = "pred", terms = "lnintv_ratiolag")+xlab('log(Intervention)')+ylab('Rebel violence')

See here for more info on plotting marginal effects: https://cran.r-project.org/web/packages/sjPlot/vignettes/plot_marginal_effects.html

To compute robust standard errors, as suggested by Cameron and Trivedi (2005), we can use \({\tt sandwich}\) library:

poissonModel <- glm(rebbest2010 ~ lnintv_ratiolag, family="poisson", data=WKGdata)

library(sandwich)

cov.pois <- vcovHC(poissonModel, type="HC0")

# HC0 are White's correction

se <- sqrt(diag(cov.pois))

robust <- cbind(estimate=poissonModel$coefficients, se,
pvalue=round(2*(1-pnorm(abs(poissonModel$coefficients/se))),
5), lower=poissonModel$coefficients-1.96*se,
upper=poissonModel$coefficients+1.96*se)

robust
##                   estimate        se  pvalue      lower     upper
## (Intercept)      4.1539213 0.1943300 0.00000  3.7730345 4.5348082
## lnintv_ratiolag -0.1263758 0.1750579 0.47035 -0.4694894 0.2167377

Negative Binomial Models

One of the major assumptions of Poisson distribution/model is that the mean and variance of the data are equal: \(mean(Y)=Var(Y)=\lambda\)

However, when we evaluate this assumption, it is rejected:

mean(WKGdata$rebbest2010, na.rm=TRUE)
## [1] 81.57377
var(WKGdata$rebbest2010,na.rm=TRUE)
## [1] 751373.2

When the variance is greater than the mean of the distribution, there is overdispersion. MLE is still consistent when there is overdispersion, but the estimates will be:

Thus, we use a Negative Binomial MLE for overdispersed, and underdispersed, count data.

Question: what is an underdispersed distribution?

How to test the overdispersion?

We can statistically test dispersiontest using \(\mathcal{R}\) package \({\tt AER}\).

library(AER)

poissonModel <- glm(rebbest2010 ~ lnintv_ratiolag, family="poisson", data=WKGdata)

dispersiontest(poissonModel,trafo=1)
## 
##  Overdispersion test
## 
## data:  poissonModel
## z = 1.3384, p-value = 0.09039
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 1762.699

In the Poisson regression, we assumed that \(\lambda=e^{X\beta}\). We can add a second component to \(X\beta\). This component is a random effct component,, to capture:

  • The combined effects of omitted variables, causing misspecification

  • Random errors associated with \(\lambda\), but uncorrelated with \(X\).

Thus, the link function of a negative binomial would be as follow:

\[\begin{align} \tilde{\lambda_i}=e^{\beta X_i+\epsilon_i}=\lambda_i e^\epsilon_i \end{align}\]

Now, we need to specify the distribution of \(\epsilon\) to identify the above model. It is common to use the Gamma distribution for \(\epsilon\). There are two main reasons for it:

  • The distribution is defined over continuous and non-negative values.

  • This distribution leads to a nice closed form solution.

If we assume that \(\epsilon\) is distributed as a 1-parameter Gamma distribution with \(E(\epsilon)=1\) and \(Var(\epsilon)=\frac{1}{\alpha}\), then the marginal density of Y distributed as a Negative Binomial model is:

\[\begin{equation} P(Y|\lambda,\alpha)=[\frac{\Gamma(\alpha^{-1}+Y)}{\Gamma(\alpha^{-1})\Gamma(Y+1)}][\frac{\alpha^{-1}}{\alpha^{-1}+\lambda}]^{\alpha^{-1}}[\frac{\lambda}{\lambda+\alpha^{-1}}]^Y \end{equation}\]

Under this assumption, \(E(Y)=\lambda\), but \(Var(Y)=\lambda(1+\alpha \lambda)\):

  • The variance still is proportional to \(\lambda\), depending on \(\alpha\), which is the over dispersion parameter.

  • If \(\alpha \rightarrow 0\), then \(Var(Y) \rightarrow \lambda\), so the model converge to a Poisson distribution.

library(MASS)
negBinomModel=glm.nb(formula = rebbest2010 ~ lnratio_scale,  data = WKGdata)

summary (negBinomModel) # Model summary
## 
## Call:
## glm.nb(formula = rebbest2010 ~ lnratio_scale, data = WKGdata, 
##     init.theta = 0.05230984157, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9534  -0.8967  -0.8625  -0.1485   4.2675  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     5.1801     0.2478  20.902  < 2e-16 ***
## lnratio_scale   0.2766     0.0782   3.538 0.000404 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.0523) family taken to be 1)
## 
##     Null deviance: 578.51  on 932  degrees of freedom
## Residual deviance: 564.32  on 931  degrees of freedom
##   (233 observations deleted due to missingness)
## AIC: 5017
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.05231 
##           Std. Err.:  0.00339 
## 
##  2 x log-likelihood:  -5010.98200
y_hat=predict (negBinomModel, WKGdata, type="response") # predict on new data 

We also can use Likelihood Ratio Test: \(LR = 2(lnL1-lnL2)\) for overdispersion. Estimate model 1 as Negative Binomial, then estimate model 2 as Poisson with the same specification as model 1 (although it will naturally be without the dispersion parameter):

#Perform likelihood ratio test

m1=glm(rebbest2010 ~ lnratio_scale, family="poisson", data=WKGdata)

m2=glm.nb(formula = rebbest2010 ~ lnratio_scale,  data = WKGdata)

lrtest(m2, m1)
## Warning in modelUpdate(objects[[i - 1]], objects[[i]]): original model was
## of class "negbin", updated model is of class "glm"
## Likelihood ratio test
## 
## Model 1: rebbest2010 ~ lnratio_scale
## Model 2: rebbest2010 ~ lnratio_scale
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   3   -2505                         
## 2   2 -265690 -1 526369  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Poisson is nested within Negative Binomial (\(\alpha=0\)): the null hypothesis is that the restricted model (i.e., Poisson) is a better fit of the data.

Zero Inflated Models: When zeros are different!

Sometimes count data will have a preponderance of zeros and very few observed counts. Zero-inflated models two distinct phases of the data generating process:

We can model these two stages as follow:

\[\begin{align} Pr(Y=0|X,Z)&=[1-\Phi(Z\gamma)]+\Phi(Z\gamma)[e^-\lambda] \\ Pr(Y=j>0|X,Z)&=[\Phi(Z\gamma)][\frac{e^{-\lambda} \lambda^Y}{Y!}] \end{align}\]

where \(\Phi\) is either the Logit or Probit probability function.

Estimation of zero-inflated models

The \({\tt pscl}\) package contains a command to estimate zero inflated count models:

library(pscl)
# zeroinfl(formula, data, dist=c("poisson", "negbin", "geometric"), link=c("logit", "probit", "cloglog"))

ZIP=zeroinfl(formula = rebbest2010 ~ lnratio_scale,  data = WKGdata, dist=c("poisson"), link=c("logit"))

summary(ZIP)
## 
## Call:
## zeroinfl(formula = rebbest2010 ~ lnratio_scale, data = WKGdata, 
##     dist = c("poisson"), link = c("logit"))
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7576 -0.6734 -0.6371 -0.3330 97.9662 
## 
## Count model coefficients (poisson with log link):
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   6.651260   0.004435  1499.6   <2e-16 ***
## lnratio_scale 0.394998   0.001895   208.5   <2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.98692    0.12612   7.825 5.08e-15 ***
## lnratio_scale  0.06297    0.03901   1.614    0.106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 15 
## Log-likelihood: -1.476e+05 on 4 Df
vuong(m2, ZIP)
## NA or numerical zeros or ones encountered in fitted probabilities
## dropping these 13 cases, but proceed with caution
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                    11.86509 model1 > model2 < 2.22e-16
## AIC-corrected          11.86579 model1 > model2 < 2.22e-16
## BIC-corrected          11.86749 model1 > model2 < 2.22e-16

In the vuong test, the null hypothesis is that the models are similar so rejection of the null would indicate that the Zero Inflated model is a better fit of the data


  1. Copyright 2019. This is an in-progress project, please do not cite or reproduce without my permission.