Skip to contents

Fits a generalized linear model (GLM) accounting for mismatch errors using a mixture model framework in the secondary analysis setting. The variance-covariance matrix is estimated using the sandwich formula.

Usage

glmMixture(
  x,
  y,
  family,
  z = cbind(rep(1, nrow(x))),
  m.rate = NULL,
  safe.matches = NULL,
  control = list(),
  ...
)

Arguments

x

Design matrix for the primary outcome model (numeric matrix or data frame).

y

Response vector for the primary outcome model.

family

A family object (e.g., gaussian, binomial) specifying the error distribution and link function. Can be a character string or a function.

z

Design matrix for the mismatch indicator model (mismatch covariates). If NULL, an intercept-only model is assumed.

m.rate

The assumed overall mismatch rate (a proportion between 0 and 1). If provided, it imposes a constraint on the mismatch model intercept.

safe.matches

Logical vector; TRUE indicates a "safe match" (treated as definitely correct), FALSE indicates a potential mismatch.

control

An optional list of control parameters. Arguments passed via ... will override values in this list.

  • max.iter: Maximum EM iterations (default: 1000).

  • cmax.iter: Maximum iterations for the subroutine in the constrained logistic regression function (default: 1000).

  • tol: Convergence tolerance (default: 1e-4).

  • init.beta: Initial parameter estimates for the outcome model.

  • init.gamma: Initial parameter estimates for the mismatch indicator model.

  • fy: Estimated marginal density of the response. If NULL, estimated using Kernel Density Estimation or parametric assumption.

...

Additional arguments passed to control.

Value

A list of results:

coefficients

A named vector of coefficients for the outcome model.

m.coefficients

A named vector of coefficients for the mismatch indicator model (gamma).

match.prob

The posterior correct match probabilities (weights) for each observation.

residuals

The working residuals, defined as y - fitted.values.

fitted.values

The fitted mean values of the outcome model, obtained by transforming the linear predictors by the inverse of the link function.

linear.predictors

The linear fit on the link scale.

deviance

The deviance of the weighted outcome model at convergence.

null.deviance

The deviance of the weighted null outcome model.

var

The estimated variance-covariance matrix of the parameters (sandwich estimator).

dispersion

The estimated dispersion parameter (e.g., variance for Gaussian, 1/shape for Gamma).

objective

A vector tracking the negative log pseudo-likelihood at each iteration of the EM algorithm.

converged

Logical indicating if the EM algorithm converged within max.iter.

rank

The numeric rank of the fitted linear model.

df.residual

The residual degrees of freedom.

df.null

The residual degrees of freedom for the null model.

family

The family object used.

call

The matched call.

References

Slawski, M.*, West, B. T., Bukke, P., Wang, Z., Diao, G., & Ben-David, E. (2025). A general framework for regression with mismatched data based on mixture modelling. Journal of the Royal Statistical Society Series A: Statistics in Society, 188(3), 896-919. doi:10.1093/jrsssa/qnae083

Slawski, M.*, Diao, G., Ben-David, E. (2021). A pseudo-likelihood approach to linear regression with partially shuffled data. Journal of Computational and Graphical Statistics. 30(4), 991-1003. doi:10.1080/10618600.2020.1870482

Examples

data(lifem)

x <- cbind(1, poly(lifem$unit_yob, 3, raw = TRUE))
y <- lifem$age_at_death
z <- cbind(1, lifem$commf, lifem$comml)

fit <- glmMixture(x, y, family = "gaussian",
                  z, m.rate = 0.05, safe.matches = lifem$hndlnk)