Errors in sample annotation or labeling occur frequently in large-scale genetic or genomic studies and are difficult to be completely avoided in the process of data generation and management. Identifying and correcting these errors are critical for integrative genomic studies. Different types of genetic and genomic data are inter-connected by cis-regulations. Based on these cis-regulations among different types of data, we develop a computational approach, named Multi-Omics Data Matcher (MODMatcher), to identify and correct sample labeling errors in the multiple types of molecular data that can be subsequently used in further integrative analysis. Our results indicate that inspection of sample annotation and labeling error is an indispensable data quality assurance step. Application to a large lung genomic study identified greatly increased statistically significant genetic associations and genomic correlations, a more than two-fold improvement. A simulation study shows that MODMatcher using three types of omics data is more robust than MODMatcher using two types of omics data. Details are described in Yoo et al (PLoS Comp. Biol, 2014).
(Activity of miRNAs):
MicroRNAs post-transcriptionally regulate a large number of mRNAs and play a key role in regulating cell growth, differentiation, and apoptosis. However, miRNA expression level is not equivalent to its functional activity (Mullokandov et al., 2012). We developed a computational approach to explicitly infer the activity of miRNAs based on the change in the expression levels of target genes. We showed in multiple cancer types (such as breast cancers, ovarian cancers, and GBM) that our estimated miRNA activities were consistently associated with clinical data in multiple independent data sets while the associations based on miRNA expression level itself couldn't be replicated. The result is published in Lee et al. (Bioinformatics, 2015).
DDSClassifier (Deconvoluted Disease-Specific
Diagnostic and prognostic models based on peripheral blood gene expression have been reported for various types of disease. However, whole blood gene expression represents a mixture of hematopoietic cells, and is greatly influenced by the cell type frequency. Multiple common pathological and physiological changes result in similar blood cell type frequency change, which affects blood-based biomarkers' specificity.To address these issues, we carried out a meta-analysis of 46 whole blood gene expression datasets covering a wide range of diseases or physiological conditions. Our analysis shows a striking overlap of signature genes shared by multiple diseases, which is driven by the underlying common patterns of cell component change. These observations suggest the necessity to develop disease-specific classifiers that can distinguish different disease types as well as normal controls. To build such models, we develop a new classification strategy that can take into consideration of both cell component changes and cell molecular stage changes. Particularly, we deconvoluted the original gene expression profile into a cell component profile and a residual expression profile for each sample, and built classifiers based on these deconvoluted features. Testing independent datasets, we show that the classifiers with cell component profiles and residual expression profiles incorporated performed significantly better than those without. Both the assembled datasets and the algorithms developed can be found in the R package. A detailed document can be found here. The result is published in Wang et al. (Scientific Reports, 2016)
Multi-polynomial Temporal Genetic
Association (MPTGA) and
Temporal Genetic Causality Test (TGCT)
Methods to leverage both temporal and genetic information in association and causality test.