Developing computational and statistical tools for complex trait genetic analysis

We are interested in a variety of methodological issues that arise in complex trait genetic studies, both in terms of the statistical framing of research questions and their implementation in freely-available software that the community can use. Below, we give some illustrative examples these.


GWAS and polygenes

How can we efficiently analyse genome-wide genetic variation data to identify specific single nucleotide polymorphism (SNP) risk alleles? How should we control for likely sources of false positive and false negative errors? As well as implicating specific loci, can we ask broader questions about the role of common variation en masse, that may arise from a very large number of alleles, most of which are individually unlikely to be detected? How best can we characterize the genetic architecture of a common, polygenic disease such as schizophrenia, and what are the implications for future genetic studies?

Previous work: PLINK toolset for GWAS analysis (PubMed | PDF | code); initial demonstration of high polygenicity in schizophrenia (PubMed | PDF); controlling for subtle population substructure via a family-based polygenic analysis (PubMed | PDF); evaluating alternative synthetic association models (PubMed | PDF).




Copy number variation

Deletions and duplications of genetic material (sometimes involving dozens of genes) are rare events, although very many people will carry at least one such copy number variant somewhere in their genome (sometimes inherited, sometimes arising de novo). CNVs such as the 22q11.2 deletion have long been known to cause multiorgan syndromes that include psychiatric symptoms; more recently, the broader role of such mutations in conferring risk for common psychiatric diseases has been documented, including work in autism and schizophrenia. How can CNVs (that is, ploidy or copy number status) be accurately called, from SNP microarrays or sequence data? How should CNVs be classified according to likely pathogenicity? What is the best way to find specific loci that harbor risk CNVs? How should one establish genome-wide burdens of (rare) CNVs? Do common variants and CNVs tend to impact the same genes or genetic pathways?

Previous work: calling CNVs from exome sequencing data (XHMM) (PubMed | PDF | code); application to International Schizophrenia Consortium dataset (PubMed | PDF); robust methods to establish that classes of gene show enriched CNV burden (PubMed | PDF).


Rare variants and sequencing studies

We are primarily engaged in exome sequencing studies of schizophrenia and bipolar disorder. A large number of methodological issues arise in such studies. What is the optimal way to test genes or groups of genes for association with disease? What are the pros and cons of different design strategies? How should we test for enrichment of genes or sets of genes from case-only trio studies of de novo mutation? We also need tools to handle the large, complex variation datasets that emerge from next-generation sequencing pipelines.

Current projects: development of the PLINK/Seq software package (code; manuscript in preparation); the dnenrich tool for the analysis of de novo mutation (available on request).




Geneset enrichment and integrative analyses

Any one gene does not act alone. Consequently, analysing patterns of variation across gene networks and pathways can potentially help us to better understand the underlying biological processes that underlie risk. How should one test entire genes and networks for association or enrichment? How can different types of prior knowledge by incorporated in gene-mapping, including gene sets, protein-protein interaction and co-expression networks? How best to integrate genetic association data across common SNPs, CNVs, rare variants and de novo mutation?

Previous work: interval-based geneset enrichment analysis (PubMed | PDF | code ); combining GWAS summary statistics (PubMed | PDF).

Current projects: PPI network based analyses of rare variant data; Bayesian networks for cross-modality analysis.


Ancestry and identity-by-descent

Individuals of different genetic ancestry can often differ in their risk of certain diseases and exposures to environmental risk factors. Separating out the potential confounding influences of ancestry in genetic association studies, as well as looking for population-specific genetic risks, poses a number of methodological questions, especially for studies of very rare variation (which is more likely to be population specific). Furthermore, just as some individuals in a sample may be less similar to each other than expected by chance (implying they likely come from different ancestral populations) some individuals may be more similar than expected by chance (implying they are related, sharing genetic material identical-by-descent (IBD) from a relatively recent common ancestor). Can we leverage information on IBD to augment genetic association studies? What advantages are there to studying families versus populations (that is, closely-related versus only distantly-related or unrelated individuals)?

Previous work: multidimensional scaling to infer ancestry, complete linkage hierarchical clustering and the pairwise population concordance test (PubMed | PDF | code); latent class analysis applied to multilocus genotype data to infer ancestry (PubMed | PDF | code); how ascertainment on family history of disease impacts the power of family-based association studies (PubMed | PDF).


Pleiotropy in complex trait genetics

Just as common diseases exhibit many-to-one relationships in terms of genotype-to-phenotype association, the converse one-to-many relationship, that a single allele or gene may influence multiple phenotypic outcomes, is repeatedly observed too. How can we best leverage these cross-disorder or pleiotropic relations in gene discovery? Is sharing at the level of the same allele, gene, or only broader gene networks? To what extent can genetic information be used to define genetically and clinically more homogeneous subsets of patients within a single diagnostic category, such as schizophrenia? Can we also focus on what distinguishes two related disorders genetically?

Previous work: review of methodological issues in studying pleiotropy (PubMed | PDF); methods for testing case-subtype association models (PubMed | PDF).


Genetic risk prediction

In what scenarios might genetic information be potentially relevant for clinical or reproductive decision-making? How can we make optimal use of available environmental and family-history information in addition to genotypes at risk alleles? How can very rare and de novo variants in risk genes be integrated into calculations for liability to common diseases?

Previous work: incorporating family history into genetic risk prediction (PubMed | PDF)

© Purcell Lab 2012-2015