Missing data and measurement error problems

 

My second primary area of interest is in the development of methods for data that

contain either missing observations or observations that are affected by some form

of measurement error. The first contribution in this area is examines the most commonly used class of methods for missing observations (the multiple imputation approaches of researchers like Donald Rubin and Joseph Schafer). In this work, I show that one can improve the quality of the common imputation approach by using a mixture of Normal distributions to summarize imputation results rather than a single t-distribution. I also examined the validity of certain choices that are made in the imputation process, such as selecting a small number of imputed data sets to use in summaries of the data. This work was done in collaboration with my Ph.D. supervisor (Adrian Raftery) and Naisyin Wang at Texas A & M.


I have also continued work in multiple imputation on a project with an under-

graduate honours statistics student. I supervised an undergraduate research pro ject

with Anne-Sophie Charest where we developed a new method for performing im-

putations for categorical data. The method uses a recently published method for

clustering categorical data to reduce the dimensionality of the categorical data space,

which greatly ameliorates computational difficulties associated with imputing cate-

gorical data for large datasets.


Finally, I am working with two groups of researchers on model selection ap-

proaches (see previous section) in situations where there is missing data. First, I

am working Antonio Ciampi (Dept. of Epidemiology and Biostatistics, McGill Uni-

versity) on developing new approaches for model selection in dynamic latent class

models when there is missing data. Currently, only heuristic approaches for model

selection exist for researchers who want to use multiple imputation to correct for

problems with missing data. We have implemented an approach for a complex latent

class problem where one is trying to choose the number of latent states for longitu-

dinal data. Also, I have done some work in the area of model selection for casusal inference problems with Robert Platt and Ian Shrier (published in a letter to the American Journal

of Epidemiology). The issues that arise with respect to model selection for causal

inference problems are quite similar to those that arise in the missing data problem

and we look forward to pursuing these connections in the future.


I have also worked with Michelle Ross (graduate student in statistics) and Robert

Platt (a biostatistics faculty member in the department of Epidemiology and Bio-

statistics at McGill) on implementing semi-parametric functional data approaches

to covariate measurement error when the measurement error distribution can be

considered to be a mixture. The semi-parametric approach contained in the paper

discretizes the measurements (appropriate for the gestational age problem that we

are working on) and assumes a hidden Markov model for the unknown true covariate

measurements. The functional data approach uses parametric model for the covari-

ate measurement error, but then uses a functional data approach for modelling the

mean response function.