Missing data and measurement error problems
Missing data and measurement error problems
My second primary area of interest is in the development of methods for data that
contain either missing observations or observations that are affected by some form
of measurement error. The first contribution in this area is examines the most commonly used class of methods for missing observations (the multiple imputation approaches of researchers like Donald Rubin and Joseph Schafer). In this work, I show that one can improve the quality of the common imputation approach by using a mixture of Normal distributions to summarize imputation results rather than a single t-distribution. I also examined the validity of certain choices that are made in the imputation process, such as selecting a small number of imputed data sets to use in summaries of the data. This work was done in collaboration with my Ph.D. supervisor (Adrian Raftery) and Naisyin Wang at Texas A & M.
I have also continued work in multiple imputation on a project with an under-
graduate honours statistics student. I supervised an undergraduate research pro ject
with Anne-Sophie Charest where we developed a new method for performing im-
putations for categorical data. The method uses a recently published method for
clustering categorical data to reduce the dimensionality of the categorical data space,
which greatly ameliorates computational difficulties associated with imputing cate-
gorical data for large datasets.
Finally, I am working with two groups of researchers on model selection ap-
proaches (see previous section) in situations where there is missing data. First, I
am working Antonio Ciampi (Dept. of Epidemiology and Biostatistics, McGill Uni-
versity) on developing new approaches for model selection in dynamic latent class
models when there is missing data. Currently, only heuristic approaches for model
selection exist for researchers who want to use multiple imputation to correct for
problems with missing data. We have implemented an approach for a complex latent
class problem where one is trying to choose the number of latent states for longitu-
dinal data. Also, I have done some work in the area of model selection for casusal inference problems with Robert Platt and Ian Shrier (published in a letter to the American Journal
of Epidemiology). The issues that arise with respect to model selection for causal
inference problems are quite similar to those that arise in the missing data problem
and we look forward to pursuing these connections in the future.
I have also worked with Michelle Ross (graduate student in statistics) and Robert
Platt (a biostatistics faculty member in the department of Epidemiology and Bio-
statistics at McGill) on implementing semi-parametric functional data approaches
to covariate measurement error when the measurement error distribution can be
considered to be a mixture. The semi-parametric approach contained in the paper
discretizes the measurements (appropriate for the gestational age problem that we
are working on) and assumes a hidden Markov model for the unknown true covariate
measurements. The functional data approach uses parametric model for the covari-
ate measurement error, but then uses a functional data approach for modelling the
mean response function.