Cluster analysis of genetic sequence data via the Gap Procedure

02/16/2016 - 15:30
02/16/2016 - 16:30
Irene Vrbik, PhD
Purvis Hall, 1020 Pine Ave. West, Room 24

 Phylogenetic clustering typically involves estimating a phylogenetic tree and identifying groups of sequences having small genetic pairwise distances and sufficiently high clade support (either bootstrap or posterior probabilities). In this talk, we explore a simple distance-based clustering algorithm, called the Gap Procedure, which uses gaps in sorted pairwise distances to suggest a natural divide between group members and non-members. We show that the clusters found using the Gap Procedure agree closely with computationally expensive gold standard techniques on well separated groups of HIV DNA sequence data. Simulation studies are also presented to illustrate the scenarios in which this fast and easy to implement algorithm may be employed, and more importantly, when more sophisticated methods are required

