Splicing defects play important roles in many aspects of human cancers. However, the complex relationship between somatic variants and splicing alterations has limited the opportunities for systematic analysis of the extent and consequences of splicing-associated variants (SAVs) in these studies.
We have developed a novel approach, SAVNet (splicing-associated variant detection by network modeling, https://github.com/friend1ws/SAVNet), for detecting SAVs from a list of somatic mutations in a cohort and its matched RNA sequencing (RNA-seq) data based on a rigorous Bayesian statistical framework. Through this approach, we performed a comprehensive analysis of 8,976 primary cancer samples across 31 cancer types deposited in The Cancer Genome Atlas (TCGA), constructing a catalogue of 14,438 SAVs. Through downstream analyses of collected SAVs, we have characterized a number of properties such as positional differences, genomic features and mutational processes. The proposed approach will be helpful to collect all driver variants including previously overlooked SAVs in cancer patients, which will be an important resource in the era of precision medicine.
The increasing amount of whole exome or genome sequencing data brings forth the challenge of analyzing the association of rare variants that have extremely small minor allele frequencies. Various statistical tests have been proposed, which are specifically configured to increase power for rare variants by conducting the test within a certain bin, such as a gene or a pathway. However, a gene may contain from several to thousands of markers, and not all of them are related to the phenotype. Combining functional and non-functional variants in arbitrary genomic region could impair the testing power. We propose a Fuzzy Zoom-Focus algorithm (fZFA) to locate the optimal testing region within a given genomic region. It can be applied as a wrapper function of existing rare variant association tests to increase testing power. The algorithm is very efficient and the complexity is linear to the number of variants. Simulation studies showed that fZFA substantially increased the statistical power of rare variants tests, including the burden test, SKAT, SKAT-O, and the W-test. The algorithm was applied on real exome sequencing data of hypertensive disorder, and identified biologically relevant genetic markers to metabolic disorder that were undiscoverable by gene-based method. The proposed algorithm is an efficient and powerful tool to enhance the power of association study for whole exome or genome sequencing data.
In this paper, we show that the Sobel's test is overly conservative for the detection of mediation effect in genome-wide epigenetic studies. We emphasize that the null hypothesis of mediation effect testing is composite, and we propose a divide-aggregate test (DAT) for the composite null hypothesis for the detection of mediation effects in genome-wide epigenetic studies. We further show that the DAT can outperform the Sobel's test and the joint significance test for the detection of mediation effects in genome-wide epigenetic studies. An application to the Normative Aging Study identified putative DNA methylation CpG sites as mediators in the causal pathway from smoking behaviour to lung functions.
Enhancers control spatiotemporal and cell type-specific patterns of gene expression, but it remains unclear whether enhancers contribute to gene-expression variability within the same cell type. Enhancer activity can be inferred by measuring enhancer RNA (eRNA) transcription. To simultaneously profile gene expression and enhancer activity, we developed a pipeline to detect mRNA and eRNA from single-cell RNA-seq data, and applied it to RamDA-seq data of mouse embryonic stem cells (mESC) under differentiation. We showed that proposed pipeline can detect cell-type-specific eRNAs and revealed some eRNAs showed cell-to-cell variability even in the same cell-cycle phase and correlated with nearby genes. In the presentation, we will discuss implications of the results and bioinformatic perspectives regarding enhancer identification.
Single-cell analysis has rapidly become an indispensible tool for biologists as many biological questions can only be addressed by single-cell resolution profiling of rare or heterogeneous populations of cells (e.g., stem cells). Due to the growing popularity, there are numerous diverse methodologies available for performing single-cell analysis. In my work, I systematically evaluate the sensitivity, precision, and accuracy of various approaches to single-cell RNA-seq. We also apply single-cell RNA-seq to various biological applications, including investigation of the developing mesoderm in embryonic mice to elucidate the mechanism of lineage commitment in-vivo. From this time-course study of development at single-cell resolution, we can construct a new model of the lineage trajectory for the muscle/brown adipose/dermal lineage of the embryonic mesoderm.
Drastic improvements of sequencing technologies have recently enabled
us to analyze large-scale sequence data from multiple types of cancers. We have been developing data analysis methods to investigate immunological profiles from cancer genomic data, e.g., a precise
Bayesian HLA genotyping method from Whole Genome Sequence data, a TCR repertoire analysis pipeline. In this presentation, I will explain algorithms of those methods and show results from a pan-cancer data set.
Single-molecule sequencers such as PacBio Sequel and Oxford Nanopore MinION, are becoming more common for genome analysis because of their rapidly increasing read length and throughput, demanding us to process long reads in a significantly faster way than ever. To this end, we developed a fast alignment tool, minialign, which is derived from a pseudo-aligner, minimap. We reformulated the Smith-Waterman-Gotoh algorithm in a way that enables four times of the degree of parallelism with typical single-instruction-multiple-data instructions from the original formulation. We present the algorithmic details of the difference recurrence and its relation to other string comparison algorithms, as well as the overview of the entire algorithm.
In this talk, I will present my group’s recent advances in bioinformatics. In particular, we will focus on two pattern recognition tasks in genomics. The first task is to elucidate DNA motif patterns from big sequence data while the second task is to predict the off-targets of CRISPR-Cas9 gene editing using deep learning.
[1] Ka-Chun Wong. (*Sole Authorship 2017). “MotifHyades: expectation maximization for de novo DNA motif pair discovery on paired sequences.” Bioinformatics 33.19 (2017): 3028-3035.
[2] Jiecong, Lin. & Ka-Chun, Wong*. (*Corresponding Authorship 2018). Off-target predictions in CRISPR-Cas9 gene editing using deep learning (ECCB 2018 Proceeding Special Issue). Bioinformatics.
Logical relationship analysis is a data mining method to detect gene triplets (A, B and C) that satisfy certain logical relationships (e.g. C ⇔ (¬A) ∧ B) in a data matrix. To comprehensively detect significant logical relationships in a dataset, we have to conduct hypothesis testing for all combinations of gene triplets. However, there is a problem that Bonferroni's adjusted significance level becomes too low because the number of test is huge. Here, we developed Logicome Profiler, which detects significant logical relationships based on Tarone's trick. This trick improves the statistical power by ignoring untestable hypotheses, which never become statistically significant. We verified that Logicome profiler can detect more logical relationships than Bonferroni's correction.
Mining valid scientific discoveries from genomic data is always hampered by technical artifacts and inherent biological heterogeneity. The former are usually termed “batch effects,” and the latter is often modeled by subtypes. Existing methods either correct batch effects provided that subtypes are known or cluster subtypes assuming that batch effects are absent. Consequently, there is a lack of research on the correction of batch effects with the presence of unknown subtypes. Here, we propose a novel BUS model to tackle the problem. We provide conditions for study designs under which batch effects can be corrected. We applied BUS to a breast cancer dataset and obtained much better biological insights than existing methods.
Technologies for mapping the spatial patterns of neural activity have enabled the discovery of next-generation neurotherapeutics for neurological disorders lacking effective treatments. We describe an in vivo drug screening strategy that combines high-throughput technology to generate a large scale of brain activity maps (BAMs) with machine learning for predictive analysis. This platform enables evaluation of a compound’s mechanisms of action and potential therapeutic uses based on information rich BAMs derived from drug-treated zebrafish larvae. Using machine learning strategy, we successfully predicted compounds with anti-epileptic activity in zebrafish behavioral models from a library of 121 non-clinical compounds, which may facilitate development of next-generation anti-epileptic agents.
Cancers consist of heterogeneous subclones rather than a single type of homogeneous clonal cells. This phenomenon, intratumor heterogeneity (ITH), has been a major obstacle to cancer screening and treatment. Thus, understanding how tumors evolve and accumulate mutations is essential for early detection and treatment decisions.
We developed a comprehensive and flexible framework, tumopp, to simulate spatio-temporal development of various solid tumors, and found that different assumptions can produce a huge variety of morphology and intratumor heterogeneity patterns even under neutrality. Here we overview the parameter space of the simulation and discuss the implication to the medical treatment.
Advanced colorectal cancer harbors extensive intratumor heterogeneity shaped by neutral evolution; however, intratumor heterogeneity in colorectal precancerous lesions has been poorly studied. We perform multiregion whole-exome sequencing on ten early colorectal tumors, which contained adenoma and carcinoma in situ. By comparing with sequencing data from advanced colorectal tumors, we show that the early tumors accumulate a higher proportion of subclonal driver mutations than the advanced tumors, which is highlighted by subclonal mutations in KRAS and APC. We also demonstrate that variant allele frequencies of subclonal mutations tend to be higher in early tumors, suggesting that the subclonal mutations are subject to selective sweep in early tumorigenesis while neutral evolution is dominant in advanced ones. This study establishes that the evolutionary principle underlying intratumor heterogeneity shifts from Darwinian to neutral evolution during colorectal tumor progression.