The problem

A crucial problem for inference about non-linear models with random effects is that the likelihood function for such models involves integrals for which no analytical solution exists.

For given values 𝐛\boldsymbol{b} of the random effects the likelihood function of a conditional logit model (and therefore also of a baseline-logit model) can be written in the form

β„’cpl(𝐲,𝐛)=exp(β„“cpl(𝐲,𝐛))=exp(β„“(𝐲|𝐛;𝛂)βˆ’12lndet(𝚺)βˆ’12π›β€²πšΊβˆ’1𝐛) \mathcal{L}_{\text{cpl}}(\boldsymbol{y},\boldsymbol{b}) = \exp\left(\ell_{\text{cpl}}(\boldsymbol{y},\boldsymbol{b})\right) =\exp \left( \ell(\boldsymbol{y}|\boldsymbol{b};\boldsymbol{\alpha}) -\frac12\ln\det(\boldsymbol{\Sigma}) -\frac12\boldsymbol{b}'\boldsymbol{\Sigma}^{-1}\boldsymbol{b} \right)

However, this β€œcomplete data” likelihood function cannot be used for inference, because it depends on the unobserved random effects. To arrive at a likelihood function that depends only on observed data, one needs to used the following integrated likelihood function:

β„’obs(𝐲)=∫exp(β„“cpl(𝐲,𝐛))βˆ‚π›=∫exp(β„“(𝐲|𝐛;𝛂)βˆ’12lndet(𝚺)βˆ’12π›β€²πšΊβˆ’1𝐛)βˆ‚π› \mathcal{L}_{\text{obs}}(\boldsymbol{y}) = \int \exp\left(\ell_{\text{cpl}}(\boldsymbol{y},\boldsymbol{b})\right) \partial \boldsymbol{b} = \int \exp \left( \ell(\boldsymbol{y}|\boldsymbol{b};\boldsymbol{\alpha}) -\frac12\ln\det(\boldsymbol{\Sigma}) -\frac12\boldsymbol{b}'\boldsymbol{\Sigma}^{-1}\boldsymbol{b} \right) \partial \boldsymbol{b}

In general, this integral cannot be β€œsolved”, i.e.Β eliminated from the formula by analytic means (it is β€œanalytically untractable”). Instead, one will compute it either using numeric techniques (e.g.Β using numerical quadrature) or approximate it using analytical techniques. Unless there is only a single level of random effects numerical quadrature can become computationally be demanding, that is, the computation of the (log-)likelihood function and its derivatives can take a lot of time even on modern, state-of-the-art computer hardware. Yet approximations based on analytical techniques hand may lead to biased estimates in particular in samples where the number of observations relative to the number of random offects is small, but at least they are much easier to compute and sometimes making inference possible after all.

The package β€œmclogit” supports to kinds of analytical approximations, the Laplace approximation and what one may call the Solomon-Cox appoximation. Both approximations are based on a quadratic expansion of the integrand so that the thus modified integral does have a closed-form solution, i.e.Β is analytically tractable.

The Laplace approximation and PQL

Laplace approximation

The (first-order) Laplace approximation is based on the quadratic expansion the logarithm of the integrand, the complete-data log-likelihood

β„“cpl(𝐲,𝐛)β‰ˆβ„“(𝐲|𝐛̃;𝛂)βˆ’12(π›βˆ’π›Μƒ)′𝐇̃(π›βˆ’π›Μƒ)βˆ’12lndet(𝚺)βˆ’12(π›βˆ’π›Μƒ)β€²πšΊβˆ’1(π›βˆ’π›Μƒ) \ell_{\text{cpl}}(\boldsymbol{y},\boldsymbol{b})\approx \ell(\boldsymbol{y}|\tilde{\boldsymbol{b}};\boldsymbol{\alpha}) - \frac12 (\boldsymbol{b}-\tilde{\boldsymbol{b}})' \tilde{\boldsymbol{H}} (\boldsymbol{b}-\tilde{\boldsymbol{b}}) -\frac12\ln\det(\boldsymbol{\Sigma}) -\frac12(\boldsymbol{b}-\tilde{\boldsymbol{b}})'\boldsymbol{\Sigma}^{-1}(\boldsymbol{b}-\tilde{\boldsymbol{b}})

where 𝐛̃\tilde{\boldsymbol{b}} is the solution to

βˆ‚β„“cpl(𝐲,𝐛)βˆ‚π›=0 \frac{\partial\ell_{\text{cpl}}(\boldsymbol{y},\boldsymbol{b})}{\partial\boldsymbol{b}} = 0

and 𝐇̃=𝐇(𝐛̃)\tilde{\boldsymbol{H}}=\boldsymbol{H}(\tilde{\boldsymbol{b}}) is the value of the negative Hessian with respect to 𝐛\boldsymbol{b}

𝐇(𝐛)=βˆ’βˆ‚2β„“(𝐲|𝐛;𝛂)βˆ‚π›βˆ‚π›β€² \boldsymbol{H}(\boldsymbol{b})=-\frac{\partial^2\ell(\boldsymbol{y}|\boldsymbol{b};\boldsymbol{\alpha})}{\partial\boldsymbol{b}\partial\boldsymbol{b}'}

for 𝐛=𝐛̃\boldsymbol{b}=\tilde{\boldsymbol{b}}.

Since this quadratic expansionβ€”let us call it β„“Lapl*(𝐲,𝐛)\ell^*_{\text{Lapl}}(\boldsymbol{y},\boldsymbol{b})β€”is a (multivariate) quadratic function of 𝐛\boldsymbol{b}, the integral of its exponential does have a closed-form solution (the relevant formula can be found in Harville (1997)).

For purposes of estimation, the resulting approximate log-likelihood is more useful:

β„“Lapl*=ln∫exp(β„“Lapl(𝐲,𝐛))βˆ‚π›=β„“(𝐲|𝐛̃;𝛂)βˆ’12π›Μƒβ€²πšΊβˆ’1π›Μƒβˆ’12lndet(𝚺)βˆ’12lndet(𝐇̃+πšΊβˆ’1). \ell^*_{\text{Lapl}} = \ln\int \exp(\ell_{\text{Lapl}}(\boldsymbol{y},\boldsymbol{b})) \partial\boldsymbol{b} = \ell(\boldsymbol{y}|\tilde{\boldsymbol{b}};\boldsymbol{\alpha}) - \frac12\tilde{\boldsymbol{b}}'\boldsymbol{\Sigma}^{-1}\tilde{\boldsymbol{b}} - \frac12\ln\det(\boldsymbol{\Sigma}) - \frac12\ln\det\left(\tilde{\boldsymbol{H}}+\boldsymbol{\Sigma}^{-1}\right).

Penalized quasi-likelihood (PQL)

If one disregards the dependence of 𝐇̃\tilde{\boldsymbol{H}} on 𝛂\boldsymbol{\alpha} and 𝐛\boldsymbol{b}, then 𝐛̃\tilde{\boldsymbol{b}} maximizes not only β„“cpl(𝐲,𝐛)\ell_{\text{cpl}}(\boldsymbol{y},\boldsymbol{b}) but also β„“Lapl*\ell^*_{\text{Lapl}}. This motivates the following IWLS/Fisher scoring equations for 𝛂̂\hat{\boldsymbol{\alpha}} and 𝐛̃\tilde{\boldsymbol{b}} (see Breslow and Clayton (1993) and this page):

[𝐗′𝐖𝐗𝐗′𝐖𝐙𝐙′𝐖𝐗𝐙′𝐖𝐙+πšΊβˆ’1][𝛂̂𝐛̃]=[𝐗′𝐖𝐲*𝐙′𝐖𝐲*] \begin{aligned} \begin{bmatrix} \boldsymbol{X}'\boldsymbol{W}\boldsymbol{X} & \boldsymbol{X}'\boldsymbol{W}\boldsymbol{Z} \\ \boldsymbol{Z}'\boldsymbol{W}\boldsymbol{X} & \boldsymbol{Z}'\boldsymbol{W}\boldsymbol{Z} + \boldsymbol{\Sigma}^{-1}\\ \end{bmatrix} \begin{bmatrix} \hat{\boldsymbol{\alpha}}\\ \tilde{\boldsymbol{b}}\\ \end{bmatrix} = \begin{bmatrix} \boldsymbol{X}'\boldsymbol{W}\boldsymbol{y}^*\\ \boldsymbol{Z}'\boldsymbol{W}\boldsymbol{y}^* \end{bmatrix} \end{aligned}

where

𝐲*=𝐗𝛂+𝐙𝐛+π–βˆ’(π²βˆ’π›‘) \boldsymbol{y}^* = \boldsymbol{X}\boldsymbol{\alpha} + \boldsymbol{Z}\boldsymbol{b} + \boldsymbol{W}^{-}(\boldsymbol{y}-\boldsymbol{\pi})

is the IWLS β€œworking dependend variable” with 𝛂\boldsymbol{\alpha}, 𝐛\boldsymbol{b}, 𝐖\boldsymbol{W}, and 𝛑\boldsymbol{\pi} computed in an earlier iteration.

Substitutions lead to the equations:

(π—π•βˆ’π—)𝛂̂=π—π•βˆ’π²* (\boldsymbol{X}\boldsymbol{V}^-\boldsymbol{X})\hat{\boldsymbol{\alpha}} = \boldsymbol{X}\boldsymbol{V}^-\boldsymbol{y}^*

and

(𝐙′𝐖𝐙+πšΊβˆ’1)𝐛=𝐙′𝐖(𝐲*βˆ’π—π›‚) (\boldsymbol{Z}'\boldsymbol{W}\boldsymbol{Z} + \boldsymbol{\Sigma}^{-1})\boldsymbol{b} = \boldsymbol{Z}'\boldsymbol{W}(\boldsymbol{y}^*-\boldsymbol{X}\boldsymbol{\alpha})

which can be solved to compute 𝛂̂\hat{\boldsymbol{\alpha}} and 𝐛̃\tilde{\boldsymbol{b}} (for given 𝚺\boldsymbol{\Sigma})

Here

𝐕=π–βˆ’+π™πšΊπ™β€² \boldsymbol{V} = \boldsymbol{W}^-+\boldsymbol{Z}\boldsymbol{\Sigma}\boldsymbol{Z}'

and

π•βˆ’=π–βˆ’π–π™β€²(𝐙′𝐖𝐙+πšΊβˆ’1)βˆ’1𝐙𝐖 \boldsymbol{V}^- = \boldsymbol{W}- \boldsymbol{W}\boldsymbol{Z}'\left(\boldsymbol{Z}'\boldsymbol{W}\boldsymbol{Z}+\boldsymbol{\Sigma}^{-1}\right)^{-1}\boldsymbol{Z}\boldsymbol{W}

Following Breslow and Clayton (1993) the variance parameters in 𝚺\boldsymbol{\Sigma} are estimated by minimizing

q1=det(𝐕)+(𝐲*βˆ’π—π›‚)π•βˆ’(𝐲*βˆ’π—π›‚) q_1 = \det(\boldsymbol{V})+(\boldsymbol{y}^*-\boldsymbol{X}\boldsymbol{\alpha})\boldsymbol{V}^-(\boldsymbol{y}^*-\boldsymbol{X}\boldsymbol{\alpha})

or the β€œREML” variant:

q2=det(𝐕)+(𝐲*βˆ’π—π›‚)π•βˆ’(𝐲*βˆ’π—π›‚)+det(π—β€²π•βˆ’π—) q_2 = \det(\boldsymbol{V})+(\boldsymbol{y}^*-\boldsymbol{X}\boldsymbol{\alpha})\boldsymbol{V}^-(\boldsymbol{y}^*-\boldsymbol{X}\boldsymbol{\alpha})+\det(\boldsymbol{X}'\boldsymbol{V}^{-}\boldsymbol{X})

This motivates the following algorithm, which is strongly inspired by the glmmPQL() function in Brian Ripley’s R package MASS (Venables and Ripley 2002):

  1. Create some suitable starting values for 𝛑\boldsymbol{\pi}, 𝐖\boldsymbol{W}, and 𝐲*\boldsymbol{y}^*
  2. Construct the β€œworking dependent variable” 𝐲*\boldsymbol{y}^*
  3. Minimize q1q_1 (quasi-ML) or q2q_2 (quasi-REML) iteratively (inner loop), to obtain an estimate of 𝚺\boldsymbol{\Sigma}
  4. Obtain hat𝛂hat{\boldsymbol{\alpha}} and 𝐛̃\tilde{\boldsymbol{b}} based on the current estimate of 𝚺\boldsymbol{\Sigma}
  5. Compute updated π›ˆ=𝐗𝛂+𝐙𝐛\boldsymbol{\eta}=\boldsymbol{X}\boldsymbol{\alpha} + \boldsymbol{Z}\boldsymbol{b}, 𝛑\boldsymbol{\pi}, 𝐖\boldsymbol{W}.
  6. If the change in π›ˆ\boldsymbol{\eta} is smaller than a given tolerance criterion stop the algorighm and declare it as converged. Otherwise go back to step 2 with the updated values of 𝛂̂\hat{\boldsymbol{\alpha}} and 𝐛̃\tilde{\boldsymbol{b}}.

This algorithm is a modification of the IWLS algorithm used to fit conditional logit models without random effects. Instead of just solving a linear requatoin in step 3, it estimates a weighted linear mixed-effects model. In contrast to glmmPQL() it does not use the lme() function from package nlme (Pinheiro and Bates 2000) for this, because the weighting matrix 𝐖\boldsymbol{W} is non-diagonal. Instead, q1q_1 or q2q_2 are minimized using the function nlminb from the standard R package β€œstats” or some other optimizer chosen by the user.

The Solomon-Cox approximation and MQL

The Solomon-Cox approximation

The (first-order) Solomon approximation (Solomon and Cox 1992) is based on the quadratic expansion the integrand

β„“cpl(𝐲,𝐛)β‰ˆβ„“(𝐲|𝟎;𝛂)+𝐠0β€²π›βˆ’12𝐛′𝐇0π›βˆ’12lndet(𝚺)βˆ’12π›β€²πšΊβˆ’1𝐛 \ell_{\text{cpl}}(\boldsymbol{y},\boldsymbol{b})\approx \ell(\boldsymbol{y}|\boldsymbol{0};\boldsymbol{\alpha}) + \boldsymbol{g}_0' \boldsymbol{b} - \frac12 \boldsymbol{b}' \boldsymbol{H}_0 \boldsymbol{b} -\frac12\ln\det(\boldsymbol{\Sigma}) -\frac12\boldsymbol{b}'\boldsymbol{\Sigma}^{-1}\boldsymbol{b}

where 𝐠_0=𝐠(𝟎)\boldsymbol{g}\_0=\boldsymbol{g}(\boldsymbol{0}) is the gradient of β„“(𝐲βˆ₯𝐛;𝛂)\ell(\boldsymbol{y}\|\boldsymbol{b};\boldsymbol{\alpha})

𝐠(𝐛)=βˆ’βˆ‚β„“(𝐲|𝐛;𝛂)βˆ‚π› \boldsymbol{g}(\boldsymbol{b})=-\frac{\partial\ell(\boldsymbol{y}|\boldsymbol{b};\boldsymbol{\alpha})}{\partial\boldsymbol{b}}

at 𝐛=𝟎\boldsymbol{b}=\boldsymbol{0}, while 𝐇_0=𝐇(𝟎)\boldsymbol{H}\_0=\boldsymbol{H}(\boldsymbol{0}) is the negative Hessian at 𝐛=𝟎\boldsymbol{b}=\boldsymbol{0}.

Like before, the integral of the exponential this quadratic expansion (which we refer to as β„“SC(𝐲,𝐛)\ell_{\text{SC}}(\boldsymbol{y},\boldsymbol{b})) has a closed-form solution, as does its logarithm, which is:

ln∫exp(β„“SC(𝐲,𝐛))βˆ‚π›=β„“(𝐲|𝟎;𝛂)βˆ’12𝐠0β€²(𝐇0+πšΊβˆ’1)βˆ’1𝐠0βˆ’12lndet(𝚺)βˆ’12lndet(𝐇0+πšΊβˆ’1). \ln\int \exp(\ell_{\text{SC}}(\boldsymbol{y},\boldsymbol{b})) \partial\boldsymbol{b} = \ell(\boldsymbol{y}|\boldsymbol{0};\boldsymbol{\alpha}) - \frac12\boldsymbol{g}_0'\left(\boldsymbol{H}_0+\boldsymbol{\Sigma}^{-1}\right)^{-1}\boldsymbol{g}_0 - \frac12\ln\det(\boldsymbol{\Sigma}) - \frac12\ln\det\left(\boldsymbol{H}_0+\boldsymbol{\Sigma}^{-1}\right).

Marginal quasi-likelhood (MQL)

The resulting estimation technique is very similar to PQL (again, see Breslow and Clayton 1993 for a discussion). The only difference is the construction of the β€œworking dependent” variable 𝐲*\boldsymbol{y}^*. With PQL it is constructed as 𝐲*=𝐗𝛂+𝐙𝐛+π–βˆ’(π²βˆ’π›‘)\boldsymbol{y}^* = \boldsymbol{X}\boldsymbol{\alpha} + \boldsymbol{Z}\boldsymbol{b} + \boldsymbol{W}^{-}(\boldsymbol{y}-\boldsymbol{\pi}) while the MQL working dependent variable is just

𝐲*=𝐗𝛂+π–βˆ’(π²βˆ’π›‘) \boldsymbol{y}^* = \boldsymbol{X}\boldsymbol{\alpha} + \boldsymbol{W}^{-}(\boldsymbol{y}-\boldsymbol{\pi})

so that the algorithm has the following steps:

  1. Create some suitable starting values for 𝛑\boldsymbol{\pi}, 𝐖\boldsymbol{W}, and 𝐲*\boldsymbol{y}^*
  2. Construct the β€œworking dependent variable” 𝐲*\boldsymbol{y}^*
  3. Minimize q1q_1 (quasi-ML) or q2q_2 (quasi-REML) iteratively (inner loop), to obtain an estimate of 𝚺\boldsymbol{\Sigma}
  4. Obtain 𝛂̂\hat{\boldsymbol{\alpha}} based on the current estimate of 𝚺\boldsymbol{\Sigma}
  5. Compute updated π›ˆ=𝐗𝛂\boldsymbol{\eta}=\boldsymbol{X}\boldsymbol{\alpha}, 𝛑\boldsymbol{\pi}, 𝐖\boldsymbol{W}.
  6. If the change in π›ˆ\boldsymbol{\eta} is smaller than a given tolerance criterion stop the algorighm and declare it as converged. Otherwise go back to step 2 with the updated values of 𝛂̂\hat{\boldsymbol{\alpha}}.

References

Breslow, Norman E., and David G. Clayton. 1993. β€œApproximate Inference in Generalized Linear Mixed Models.” Journal of the American Statistical Association 88 (421): 9–25.
Harville, David A. 1997. Matrix Algebra from a Statistician’s Perspective. New York: Springer.
Pinheiro, JosΓ© C., and Douglas M. Bates. 2000. Mixed-Effects Models in s and s-PLUS. New York: Springer. https://doi.org/10.1007/b98882.
Solomon, P. J., and D. R. Cox. 1992. β€œNonlinear Component of Variance Models.” Biometrika 79 (1): 1–11. https://doi.org/10.1093/biomet/79.1.1.
Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with s. Fourth. New York: Springer. https://www.stats.ox.ac.uk/pub/MASS4/.