Home

Andrew D Smith, Principal Investigator

We are a research group in the Quantitative and Computational Biology Department at University of Southern California.

Our research deals with uncovering principles of gene regulation. We accomplish this by applying computational methods to analyze large-scale genomic data sets. We also design the analytic technology required to leverage these massive and complex data sets. Underlying all our work is the premise that a greater understanding of genomic information will ultimately impact biomedical science. Current projects in our group focus on the following three general areas.

DNA methylation

This fascinating epigenomic modification has multiple essential functions in mammals. DNA methylation changes as developmental regulatory programs are executed and can be replicated along with the DNA when cells divide. DNA methylation can also be influenced by environmental exposure, diet and aging, and aberrant methylation is a hallmark of cancers. We are studying how methylation patterns along the genome evolve in tumorigenesis, how the methylation patterns in germ cells influence the earliest stages of embryogenesis, and how methylation patterns can be used as diagnostic markers.

We’ve brought hundreds of methylomes from different cell types and species together into MethBase, the first comprehensive, curated database for visualization and analysis of DNA methylation data and features.  Get started now!

Protein-RNA interactions

RNA-binding proteins function to regulate all aspects of RNA processing, from splicing and transport to translation and degradation. Our interest lies in deciphering how various RNA-binding proteins recognize their target RNAs. Technologies like CLIP-seq can interrogate millions of protein-RNA interactions in a single experiment. We are designing algorithms to mine those interactions and learn the characteristic sequence and structure patterns of RNAs that identify them for regulation by specific proteins in specific contexts.

Capture-recapture statistics in DNA sequencing

Current analysis of genomic sequencing interprets the data given but ignores the possibility that things were missed by random chance in the sequencing process.  We are developing techniques to model DNA sequencing as random sampling from a population of molecules, borrowing methodology from capture recapture statistics.  With these techniques, we can analyze the benefit of deeper sequencing to gain more information, which allows researchers to optimize their experiments based on cost and information gained.

We’ve developed a software package called preseq for all its purposes. For people who are not familiar with the command line, we provide a convenient R package called preseqR, which makes the functionality of preseq available in the R statistical computing environment.