gllamm for complex problems
General notion
gllamm stands for Generalized Linear Latent And Mixed Models.
The primary aims of this software is to provide a maximum likelihood
framework for models with unobserved components, such as multilevel
models, certain latent variable models, panel data models, or models
with common factors.
The official website of the software is http://www.gllamm.org
maintained by one of the authors, Sophia Rabe-Hesketh. She and other authors of gllamm,
Anders Skrondal and Andrew Pickles, published a book "Generalized Latent Variable Modelling".
gllamm is implemented as a system of Stata modules. To install it on your Stata,
type
net from http://www.gllamm.org
net install gllamm, replace
| |
The website also details the current version of the software that can be checked with
which Stata command.
Statistical background
Let us consider each of the letters of the word gllamm in turn.
The first letter g means the same as "g" in GLM, generalized linear models.
They extend the linear regression concept to some other models such as binary dependent variable,
multinomial or ordinal dependent variable, and Poisson regressions. This is done by noting that
all of them share a similar form of the density of the response variable:
nice picture for exponential family here
There are many reasons why this is a nice form of the density, including existence
of the sufficient statistic, straightforward maximum likelihood estimation,
means and variances expressed as functions of theta and phi, and fit / outlier residual
diagnostics. The important concepts that the user of gllamm must know
to proceed further are link function and family of distributions.
The link function is the function that maps the mean μ of the response to the
linear predictor η=x'β: η = g(μ), and the inverse link is
μ = g-1(η). It can be as simple as an identity link used in linear
regression: η=μ, or something more involved as logit link:
nice picture for logit link
The family of distributions deals with the response variable, so it is Gaussian
distribution for a linear regression / continuous response, binomial distribution
for binary response (note that in this case a variety of links is possible: logit,
probit, complementary log-log / Gompertz / extreme value), and Poisson for count variables
(gllamm currently does not support the negative binomial link, which is another
popular option for count outcomes). gllamm also supports mulinomial logit and ordered
logit / probit distributions / links, which are not normally viewed as a part of GLM
framework, but share some similarities with them.
For further reading, see
McCullagh and Nelder (1989) as the basic classic textbook on the topic, and
Hardin and Hilbe (2001) for Stata implementation.
The second letter l is responsible for linearity that is being assumed
between the mean and the explanatory variables.
As was mentioned earlier, the linearity of the model enters through the linear predictor
η=x'β. That is not as restrictive as it
may seem, since nothing precludes the researcher from using semiparametric specifications
such as polynomial terms or splines. (For the multinomial model, the linear predictor
would be the mean utility associated with a particular outcome; it will also give the
predicted probability of that outcome through the link function.)
The middle lam part of gllamm stands for Latent And Mixed.
The terminology of latent variables is more often used in social sciences for models
with common factors, measurement models, and SEMs, while mixed models is a notion
of multilevel/hierarchical models common in biostatistics. Any of those approaches implies that there
are some unobserved variables that enter the linear predictor part additively. In the simplest
case of a single random effect, this can be represented as
ηij = xij'β + ui
where i indexes groups, or clusters, of observations, and j indexes the observations within
that cluster. The multilevel/hierarchical and mixed models can be written down as
ηij = xij'β +
zij'ui
which may have a variety of meanings:
- multilevel / hierarchical models: z's are fixed and represent the design matrix of different levels;
- mixed / random coefficient models: z's include the constant as well as some of the variables
in the model;
gllamm is flexible enough to estimate all those (and some other) models;
for the measurement and factor models, some special data handling tricks may have to be used --- see below.
Unfortunately, as long as the u's are unobserved, their contribution to the likelihood
is represented as an integral over their distribution. The integrand terms are the conditional
probabilities / densities / likelihoods from the GLM model for a given u. Hence each
evaluation of the likelihood involves a lot of numerical integration steps, and during the
iterative maximization, there will be hundreds of such calls. This is the main reason
gllamm is so slow.
The numeric integration is performed by Gaussian quadrature, adaptive quadrature
(Rabe-Hesketh, Skrondal and Pickles 2002), or
semi-parametric discrete distribution similar (up to certain scaling conventions) to
the discrete factor models in econometrics.
A number of options is available to tune the performance and speed of gllamm.
Data preparation, syntax and options
The most basic type of a model gllamm is especially easy to apply is
the repeated measurement / multiple observations / panel format: several observations
per same individual, or several individuals treated as a cluster.
Note that gllamm is indifferent to whether the
data set is balanced on not. This is pretty much irrelevant in estimation it performs.
Multilevel models
Suppose the data set consists of measurements of standardized test scores score of students.
The students are nested in classrooms class, the classrooms are in turn nested in schools
school, and schools are nested in districts distr. Suppose also that the student
characteristics available to the researcher are race black and other, and a measure
of socio-economic status ses. The two simplistic models a researcher can run are
gllamm score , i(class school distr)
|
for the basic ANOVA model with nested random factors, or
"unconditional hierarchical linear model" in Raudenbush and Bryk (2002) terminology.
The output will show the variances of effects at
each level (so that the researcher can see whether classrooms within schools are indeed different
from one another, the schools within a district are similar to one another, and districts themselves
are similar to one another); or a regression model with mutlilevel random effects
gllamm score black other ses, i(class school distr)
|
We have omitted the default options for family and link, which are Gaussian response and identity
link, respectively. If the only data that we had were whether the student passed or failed the test
pass, then the call would look like
gllamm pass black other ses, i(class school distr) link(logit) family(bin)
|
Random coefficients
Let us now ignore for simplicity all the levels except the students and schools.
A simple enough 2-level model that is easy to estimate with gllamm is the
random coefficient model that relaxes the assumption that the effect of race
and socio-economic status are the same for all schools. They can be allowed to vary,
and the way to indicate that is through the level-2 equations
(i.e. equations for the second level units, here, schools; the students are level-1 units).
In gllamm framework, one can specify those
with eq command that was deemed obsolete as of Stata 6:
gen byte one = 1
eq cons : one
eq black : black
eq other : other
eq ses : ses
gllamm score black other ses, i(school) nrf(4) eqs(black other ses cons)
| | | | | |
The added options are: nrf() that says we now have four random effects in the model;
and eqs() shows what are the variables a particular random effect is multiplying.
The first random effect multiplies the variable black, and thus it is the random
component of the black coefficient that is shared by all students in the same school.
The last random effect is a random intercept.
Note that is now the responsibility of the researcher to indicate that intercepts can be random.
Once eqs is specified, gllamm no longer thinks of the model in terms
of random intercepts (and there are good reasons to behave like that).
Note that the variances and covariances of the random effects are not so straightforward
to interpret as they were in the random intercept only models. It is easy to see (see Fig. 3.1
of Skrondal and Rabe-Hesketh (2004)) that they would depend on such rather arbitrary things as
the centering of explanatory variables.
Now, what
would be the ways to incorporate school-level characteristics into analysis? Suppose we have
a measure like the number of computers per student comp. A straightforward way
would be to include this variable directly into the linear predictor part, saying that
a given number of computers affects the score directly:
gllamm score black other ses comp, i(school)
|
On the other hand, as long as it is a school-level characteristic, it may be affecting
only the school-level variables.
In the current model, such variable is the random effect of the school. The interpretation
that can be suggested is that the number of computers per student is a proxy for overall
quality of the school.
eq f1: comp one
gllamm score black other ses, i(school) geqs(f1) nocons
| |
The option geqs shows the "generalized" equations, i.e., the regressions
for latent variables. The interpretation of equation f1 is now different;
comp and one are explanatory variables in that regression.
Here we only have one random variable, the random effect of the school, and the
second symbol in the equation name shows that it refers to the 1st latent
variable in the model. The option nocons lets gllamm know that
the constant terms should not be used in the regression: it enters the model through
the constant term of the equation f1 (so an alternative way to identify
the constant is to omit one from equation f1).
We can also think of a two-level model where the random coefficients may depend
on the number of computers:
eq f1: comp
eq f2: comp
eq f3: comp
eq f4: comp
gllamm score black other ses, i(school) ///
nrf(4) geqs(f1 f2 f3 f4) eqs(black other ses one)
| | | | | |
Note again that there are two ways to deal with the fixed effects, and here
we have chosen to specify them through the regression equation. If we did
otherwise (including one into all of the four "generalized" equations),
we would have to specify nocons option, as well as omit the
variables from the regression equation:
eq f1: comp one
eq f2: comp one
eq f3: comp one
eq f4: comp one
gllamm score , i(school) nocons ///
nrf(4) geqs(f1 f2 f3 f4) eqs(black other ses one)
| | | | | |
Multiple equations
An important thing that is to be remembered about gllamm is that it assumes
only one response variable. This works perfectly for the hierarchical linear models, but
other types of models where there are several response variables require certain modifications
of the data so that the model looks like a single dependent variable, however complicated
that model would be. This is typically done with Stata's reshape command, or some
manual work that is roughly equivalent to it. A part of that work is to create a variable
that identifies equations; dummy variables corresponding to it; and interactions of actual
regressors with those dummies.
Suppose the model consists of three equations:
y1 = β01 + β11x1 +
β21x2 + λ1ui +
&epsilon1;
y2 = β02 + β12x1 +
β13x3 + λ2ui +
&epsilon2;
y3 ~ Bernoulli ( Φ( β03 +
β14x4) )
Then the appropriate steps in Stata may look as follows:
gen long id = _n
expand 3
bysort id : gen v = _n
g resp = .
replace resp = y1 if v==1
replace resp = y2 if v==2
replace resp = y3 if v==3
tab v , gen(s)
foreach x of varlist x1-x4 {
forvalues i=1/3 {
g `x'_`i' = s`i'*`x'
}
}
| | | | | | | | | | | | |
Let us now incorporate the random structure into our gllamm model.
The following would be a good starting point for your gllamm adventure:
eq het : s1 s2
eq load: s1 s2 s3
gllamm resp x1_1 x2_1 x1_2 x3_2 x4_3 s*, ///
link(id id probit) lv(v) fam(gauss gauss bin) fv(v) ///
s(het) i(id) eq(load) nocons
| | | | |
The first part specifies the regression part of the model, i.e. the correspondence between
the response and the explanatory variable. lv() and fv() options
indicate to gllamm that different observations may have different families
and links depending on the value of the variable v. So, when v==1,
the link is identity, and the family is gaussian. Note that at
the same time resp is effectively regressed on x1_1, x2_1
and s1 which plays a role of the constant. All other explanatory variables
are zero when v1==1.
Also note that the variance (scale) of the response is modelled by equation
het. As long as s* are dummy variables, this equation prescribes
that there are three response variables in the model that are allowed to have
three different variances. (Actually, as long as the scale of the probit equation
is set to 1, there is no need to specify s3 in the het equation.)
Three different combinations
of s1, s2 and s3 correspond to three values of v.
Note also that het equation does not contain a constant because each equation
has its own separate constant s*.
Integration options
There are three integration methods provided: Gaussian quadrature, adaptive quadrature,
and discrete distribution with freely estimated point masses.
The degree of accuracy is controlled by the number of integration
points, which is specified as nip(#, #, ...). Each number corresponds to the particular
latent variable / random effect. It should be kept in mind that the computation time is
roughly proportional to the product of those numbers of integration points, so highly
multidimensional problems should be first tackled with a very low number of points, say
4 for each factor. If convergence can be achieved with that, the coefficient estimates
should be saved, and gllamm can be restarted with more integration points:
gllamm ... , nrf(4) nip(4, 4, 4, 4) adapt
mat bb = e(b)
gllamm ... , nrf(4) nip(8, 8, 8, 8) adapt from(bb) copy
| | |
It almost always seems to be a good idea to specify adaptive integration rather
than straightforward Gaussian integration. Even though some of the computational time
is spent on finding the optimal reconfiguration of the integration points, the integrals
come out to be more accurate.
The freely estimated point masses can be requested by ip(f). It is not quite
the same as discrete factor models popular in econometrics. gllamm ..., ip(f)
allocates masses in such a way that their mean is zero, and their variance is free, whilst
the discrete factor models restrict the lowest value to be put at zero, and the largest
value, at 1. Thus the two methods would have different means and variances which translates
directly into the scales of the coefficient estimates.
High performance computing
As far as gllamm is so time consuming, the researcher needs to make sure
that he uses the best computational resources available. Fortunately, UNC has
a strong support for statistical and scientific computation.
The three most powerful machines are StatApps Solaris cluster
(24 SUN 1.05GHz processors, 48 GB of RAM),
Baobab Linux cluster (352 processors , at least 2 Gb RAM per each processor),
Yatta (IBM P690 with 32 CPUs and 128GB of Memory) and Chastity IRIX cluster
(64 processors MIPS R14000 Processors @ 500MHz, 64 Gb of RAM).
See High Performance Computing web-page
for a description of the procedures required to obtain an account, upload your
data and submit your job for processing.
Examples
Caution: The examples here are not meant to represent any reasonable analysis in terms of
the substantive area, but rather the examples of syntax of gllamm that were verified to
run without software errors.
Multilevel models with educational data
The analysis follows examples in Raudenbush and Bryk (2002) textbook.
The data set contains math achievement results for 7185 students
nested in 160 schools, along with their SES, and school-averaged SES.
use http://www.ats.ucla.edu/stat/paperexamples/singer/hsb12.dta |
The first analysis is an ANOVA aimed at finding out whether there are differences
between students within a school, and between schools.
mat b1 = [12, 2, 8]
gllamm math , i(school) from(b1) copy |
The average score is 12.84, with a substantial variation within a school (s.d.=6.27)
and between schools (s.d.=2.62). Can this variation be explained in any manner?
The first argument can be put forward is that the performance difference is due
to the fact that quality of schools differs because they serve
different neighborhoods. The latter can be measured by the
average socio-economics status, and the way it is introduced into the model
at the school level is through the school means (means-as-outcomes model):
g byte one = 1
eq f1 : one meanses
mat b2 = [2, 6, 12, 6]
constr 77 [mathach]_cons=0
gllamm math , i(school) geqs(f1) diff adapt constr(77) from(b2) copy
|
The peculiar constraint given to gllamm serves the identification purpose. Without it,
gllamm is going to estimate both the grand mean and the constant in the regression
for the latent variable, which in this case is the school mean of the test.
The estimation results:
Notice that the individual student variance remained essentially the same, but the
variance between schools has decreased dramatically, so that the R2 in the regression
for the latent variable can be found as 1-2.593/6.841=0.62. The mean is a little bit different
at 12.64, and the slope for the SES in the level-2 regression is 5.86. The formal tests
of the difference between this model and means only model is either through a t-test
on the regression slope, or the likelihood ratio test from two models.
Alternatively, one can say that the difference in performance can be explained by the
student level characteristics. The students of different socio-demographic
characteristics enjoy different returns to their characteristics in different schools;
in other words, there are between-school differences in the slopes of individual
student characteristics. The way the SES is recorded in this data set is in deviations
from the school mean, so with cses variable, only students within the same
school can be compared.
* Random coefficients model
eq cses: cses
eq one: one
gllamm math cses, i(school) eqs(one cses) diff adapt nrf(2)
|
Due to the way the data are centered, the constant in the regression is the grand mean
that we know to be about 12.64. The individual SES is on average strongly associated with higher
math grades. The cses measurement are roughly normalized on 1, and the standard
deviation within schools is usually between 0.6 and 0.7.
The variability in this coefficient between schools is given by
the variance of the second latent variable (note the reverse order of coefficients
in the level-1 and level-2 regressions... that was an omission on the author's side!).
Thus the confidence interval for the cses in a school from this population of
schools can be constructed as 2.193 ± 1.96(0.6782+0.1279^2)1/2 = 2.193
± 1.633 = (0.56, 3.826). Even in the most equitable schools, there still
would be some differences between high and low SES students (about 2 point on math score scale).
In general, the variances and covariances of the random effects should be interpreted with
a great caution because of lack of invariance to the change of the scale. In this case, due
to centering of cses, the entries of the variance-covariance matrix are informative of
the following. The variance of slopes is given by var(2), and we have already used it
in the above calculations. The variance of intercepts conditional on zero value of cses
is given by var(1), and in fact it is greater than what was observed before.
The correlation between the random intercepts and the random slopes is insignificantly different
from zero. If it were significantly positive, for instance, then we could say that the schools that
are better in terms of the average score also tend to have higher returns to individual SES
(which would probably be an undesirable result for educational policy: better schools seem
to be better because they are providing a better training to their top students only).
The computational time for this whole example was about 5 hours on Baobab cluster.
Selectivity model: treatreg by gllamm
The analysis is based on the part of NLSY data available in Stata through
use http://www.stata-press.com/data/r8/nlswork.dta
|
This is a panel data set on 4148 individuals observed in 15 time periods
between 1968 and 1988.
The main preparation steps include
keep if year == 88
xi i.race
| |
The first analysis is a straightforward use of Stata's existing tools. It took Stata 10 seconds
to find the MLE.
Note that the link between the two equations (lambda) is not statistically significant,
and the selectivity variable (collgrad) is not significant in the main regression, either.
We'll see how it may pose difficulties for gllamm.
Preparing the data for gllamm:
* preparing for gllamm
mat bb = e(b)
* from treatreg estimation
expand 2
bysort id year : g byte var = _n
g resp = ln_wage if var == 1
replace resp = coll if var == 2
foreach x of varlist age grade south _Irace_2 _Irace_3 tenure coll {
g v1_`x' = `x'*(var == 1)
g v2_`x' = `x'*(var == 2)
}
g v1 = (var == 1)
g v2 = (var == 2)
|
Call to gllamm:
gllamm resp v1_tenure v1__Irace_2 v1__Irace_3 v1_grade v1_coll v1 ///
v2__Irace_2 v2__Irace_3 v2_south v2 , ///
fv(var) lv(var) link(id probit) fam(gaus bin) nip(4) adapt ///
i(id) from(bb) copy nocons difficult
|
The main regression estimates are identical, but selection equation is a little bit different.
Note that the variance of the random effect is insignificantly different from zero, which is
an alternative indication of the independence of two equations. This time, MLE took a bit over 10 minutes
(606 seconds).
On a second run, we shall use more integration points, and use the current estimates as the
starting values:
mat bb1 = e(b)
gllamm resp v1_tenure v1__Irace_2 v1__Irace_3 v1_grade v1_coll v1 ///
v2__Irace_2 v2__Irace_3 v2_south v2 , ///
fv(var) lv(var) link(id probit) fam(gaus bin) nip(8) copy ///
adapt i(id) from(bb1) nocons difficult
|
The estimates have hardly changed. It took Stata and gllamm 384 sec. to figure it out.
The final model in this example is the discrete factor model. It was tried with a number of settings,
and the best performing one was found to be the model with just two mass points.
gllamm resp v1_tenure v1__Irace_2 v1__Irace_3 v1_grade v1_coll v1 ///
v2__Irace_2 v2__Irace_3 v2_south v2 , ///
fv(var) lv(var) link(id probit) fam(gaus bin) ip(f) nip(2) ///
i(id) nocons difficult
|
Computation time was 230 seconds.
Note that of the two mass points specified, the bulk of the probability mass is given
to only one of them, which again is evidence that the two equations (selection and regression)
are not particularly strongly related. Sometimes it is argued that the discrete factor model
can serve as a source of identification when regressors are the same in both equations,
but that's a very weak form of identification that cannot be relied upon exclusively.
Note also that the model is parameterized differently from the one in treatreg.
The variance identification condition imposed by treatreg is that the overall
variance of the error term in the selection equation is equal to one, while gllamm
has to assume that the variance of the unique part of that error term is equal to one,
and hence the total variance is but 1 + Var(u).
Sensitivity analysis for the number and location of point masses
can be performed with gateaux derivative option.
References
Hardin, J., and Hilbe, J. (2001) Generalized Linear Models and Extensions. Stata Press,
College Station, TX.
McCullagh, P., and Nelder, J. A. (1989) Generalized Linear Models. 2nd ed. Chapman and Hall,
London.
Skrondal, A., Rabe-Hesketh, S., and Pickles, A. (2002). Reliable estimation of generalized linear mixed
models using adaptive quadrature. Stata Journal, 2 (1), 1--21.
Rabe-Hesketh, S., and Skrondal, A. (2004) Generalized Latent Variable Modeling. CRC/Chapman and Hall.
Rabe-Heskteh, S., Skrondal, A., and Pickles, A. (2004)
GLLAMM manual. U.C.Berkeley Division of Biostatistics Working Paper No. 160.
Raudenbush, S., and Bryk, A. (2002) Hierarchical Linear Models, 2nd edition. Sage.
Questions? Comments?
Send an email to Stas Kolenikov, skolenik at unc.edu.