Errors in sample annotation or labeling occur frequently in large-scale genetic or genomic studies and are difficult to be completely avoided in the process of data generation and management. Identifying and correcting these errors are critical for integrative genomic studies. Different types of genetic and genomic data are inter-connected by cis-regulations. Based on these cis-regulations among different types of data, we develop a computational approach, named Multi-Omics Data Matcher (MODMatcher), to identify and correct sample labeling errors in the multiple types of molecular data that can be subsequently used in further integrative analysis. Our results indicate that inspection of sample annotation and labeling error is an indispensable data quality assurance step. Application to a large lung genomic study identified greatly increased statistically significant genetic associations and genomic correlations, a more than two-fold improvement. A simulation study shows that MODMatcher using three types of omics data is more robust than MODMatcher using two types of omics data. Details are described in Yoo et al (PLoS Comp. Biol, 2014).
(Activity of miRNAs):
MicroRNAs post-transcriptionally regulate a large number of mRNAs and play a key role in regulating cell growth, differentiation, and apoptosis. However, miRNA expression level is not equivalent to its functional activity (Mullokandov et al., 2012). We developed a computational approach to explicitly infer the activity of miRNAs based on the change in the expression levels of target genes. We showed in multiple cancer types (such as breast cancers, ovarian cancers, and GBM) that our estimated miRNA activities were consistently associated with clinical data in multiple independent data sets while the associations based on miRNA expression level itself couldn't be replicated. The result is published in Lee et al. (Bioinformatics, 2015).