Skip to contents

Fits a Cox proportional hazards regression adjusting for mismatched data using a mixture modeling framework in the secondary analysis setting. The method relies on a two-component mixture model where true matches follow the Cox model and mismatches follow the marginal distribution of the survival outcome. Variance estimates are obtained via Louis' method.

Usage

coxphMixture(
  x,
  y,
  cens,
  z,
  m.rate = NULL,
  safe.matches = NULL,
  control = list(),
  ...
)

Arguments

x

A matrix or data.frame of covariates (design matrix).

y

A numeric vector of observed time-to-event outcomes.

cens

A numeric vector indicating censoring status (1 = censored, 0 = event). Note: This is the reverse of the standard Surv object convention where 1 usually indicates an event.

z

A matrix or data.frame of mismatch covariates (e.g., match scores, blocking variables). Used to model the probability of a mismatch.

m.rate

An optional numeric value between 0 and 1 specifying the assumed overall mismatch rate upper bound. If provided, the mismatch indicator model is constrained such that the average estimated mismatch rate does not exceed this bound.

safe.matches

A logical vector indicating records known to be correct matches (TRUE). These records are fixed as matches (probability 1) during estimation. Defaults to all FALSE.

control

An optional list of control parameters. Parameters can also be passed directly via ....

  • louis.k: Number of Monte Carlo iterations for variance estimation (default: 1000).

  • max.iter: Maximum EM iterations (default: 1000).

  • cmax.iter: Maximum iterations for the constrained optimization subroutine (default: 1000).

  • tol: Convergence tolerance (default: 1e-4).

  • init.beta: Initial estimates for outcome model coefficients.

  • init.gamma: Initial estimates for mismatch model coefficients.

  • fy: Pre-calculated marginal density of the response. If NULL, estimated non-parametrically.

...

Additional arguments passed to control.

Value

An list of results:

coefficients

Estimated coefficients for the outcome model (beta).

m.coefficients

Estimated coefficients for the mismatch model (gamma).

var

Variance-covariance matrix of the estimates.

linear.predictors

Linear predictors for the outcome model.

means

Column means of the covariate matrix x.

n

Number of observations.

nevent

Number of events.

match.prob

Posterior probabilities that each observation is a correct match.

objective

Value of the negative log pseudo-likelihood at each iteration.

converged

Logical indicating if the algorithm converged.

Lambdahat0

Estimated baseline cumulative hazard.

gLambdahat0

the baseline cumulative hazard for the marginal density of the response variable (using Nelson-Aalen estimator)

References

Bukke, P., Ben-David, E., Diao, G., Slawski, M., & West, B. T. (2025). Cox Proportional Hazards Regression Using Linked Data: An Approach Based on Mixture Modelling.

Examples

library(survival)
set.seed(123)
n <- 200
# Generate covariates
x_cov <- seq(-3, 3, length = n)
d_cov <- rep(0:1, each = n/2)
X <- cbind(d_cov, x_cov, x_cov * d_cov)

# True parameters
b <- c(-1.5, 1, 0.5)
sigma <- 0.25
mu <- X %*% b
y <- exp(drop(mu)) * rweibull(n, shape = 1/sigma)

# Censoring
cens <- (y >= 1.5)
y[cens] <- 1.5

# Generate mismatch errors
ps <- rbeta(n, 4.5, 0.5)
logit_ps <- log(ps / (1 - ps))
mp <- cbind(1, logit_ps)
gamma_true <- c(-0.5, 1)
m <- 1 - rbinom(n, prob = plogis(mp %*% gamma_true), size = 1)
yperm <- y
shuffled_ix <- sample(which(m == 1))
yperm[shuffled_ix] <- yperm[sample(shuffled_ix)]

# Fit model
fit <- coxphMixture(x = X, y = yperm, cens = as.numeric(cens),
                    z = matrix(logit_ps, ncol = 1),
                    control = list(max.iter = 50))

print(fit)
#> Call:
#> NULL
#> 
#> Outcome Model Coefficients:
#>  d_cov  x_cov        
#>  7.005 -4.377 -2.899 
#> 
#> Mismatch Model Coefficients:
#> [1] 0.9644
#> 
#> Likelihood ratio test (model=outcome) not available due to pseudo-likelihood.
#> n= 200 , number of events= 144 
summary(fit)
#> 
#> Call:
#> NULL
#> 
#> --- Outcome Model (Cox PH) ---
#>             coef  exp(coef)   se(coef)       z Pr(>|z|)    
#> d_cov    7.00547 1102.65118    0.66757  10.494  < 2e-16 ***
#> x_cov   -4.37692    0.01256    0.37763 -11.590  < 2e-16 ***
#>         -2.89911    0.05507    0.61015  -4.751 2.02e-06 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> --- Hazard Ratios & Confidence Intervals ---
#>       exp(coef) exp(-coef) lower .95 upper .95
#> d_cov 1.103e+03  9.069e-04 2.980e+02 4.080e+03
#> x_cov 1.256e-02  7.959e+01 5.994e-03 2.634e-02
#>       5.507e-02  1.816e+01 1.666e-02 1.821e-01
#> 
#> --- Mismatch Indicator Model ---
#>      Estimate Std. Error z value Pr(>|z|)    
#> [1,]   0.9644     0.1566   6.158 7.37e-10 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Average Estimated Correct Match Rate: 0.8789 
#> Events: 144  / Total: 200 
#> Iterations: 24 
#>