At CICEKLAB, we design and develop machine learning algorithms for the bioinformatics pipeline to improve our understanding of genetic diseases and disorders. Below are some of the recent projects we work on:

Generative models to design RNA sequences with desired properties 

The system model for RNAGEN

RNA- protein binding plays an important role in regulating protein activity by affecting localization and stability. While proteins are usually targeted via small molecules or other proteins, easy-to-design and synthesize small RNAs are a rather unexplored and promising venue. The problem is the lack of methods to generate RNA molecules that have the potential to bind to certain proteins. We research methods (RNAGEN and UTRGAN) based on generative adversarial networks (GAN) that learn to generate RNA sequences with natural RNA-like properties such as secondary structure and free energy. Using an optimization technique, we fine-tune these sequences to five them desired features such as binding to a target protein.

Learning to improve the quality of raw sequencing data as well as variant calls

Fig. 1

The system model for ECOLE

Copy number variation is a natural source of genetic diversity which also could be damaging should they affect disease risk genes. Accurate and efficient detection of copy number variants (CNVs) is of critical importance due to this association with complex genetic diseases. Although algorithms working on whole genome sequencing (WGS) data provide stable results with mostly-valid statistical assumptions, copy number detection on whole exome sequencing (WES) data has mostly been lacking with extremely high false discovery rates. This is unfortunate as WES data is cost efficient, compact and is relatively ubiquitous. We research on interpretable models (DECoNT and ECOLE) that use the matched WES and WGS data and learn to call CNVa on WES data. 

Fig. 4

Attention maps for ECOLE show which patterns in the read depth signal result in a CNV call.

Collaborators: Can Alkan (Bilkent)

Ensuring genome data is shared in a privacy preserving way among life scientists

Sharing genome data in a privacy-preserving way stands as a major bottleneck in front of the scientific progress promised by the big data era in genomics. A community-driven protocol named genomic data-sharing beacon protocol has been widely adopted for sharing genomic data. The system aims to provide a secure, easy to implement, and standardized interface for data sharing by only allowing yes/no queries on the presence of specific alleles in the dataset. However, beacon protocol was recently shown to be vulnerable against membership inference attacks. In this line of work we investigate the privacy threats against beacons and propose countermeasures. 

Other and us have showed that presence (membership) of an individual in a genome sharing beacon can be inferred by repeatedly querying the beaconWe recently have identified and analyzed a novel vulnerability of genomic data-sharing beacons: genome reconstruction. We show that it is possible to successfully reconstruct a substantial part of the genome of a victim when the attacker knows the victim has been added to the beacon in a recent update. We also show that even if multiple individuals are added to the beacon during the same update, it is possible to identify the victim's genome with high confidence using traits that are easily accessible by the attacker (e.g., eye and hair color). Moreover, we show how the reconstructed genome using a beacon that is not associated with a sensitive phenotype can be used for membership inference attacks to beacons with sensitive phenotypes (i.e., HIV+). 

Collaborators: Erman Ayday (CWRU)
Funding: NIH R01LM013429 (CO-I)

Predicting risk genes for neuropsychiatric disorders such as autism spectrum disorder

Autism Spectrum Disorder (ASD) is a complex neurodevelopmental disorder with an estimated  genetic architecture containing 1,000 risk genes. It affects around 1.5% of the children in the US and around the world. Despite large-scale sequencing studies, only a fraction of the risk genes were identified to date. This is due to the heterogeneity of the genetic profiles of the children with autism. Finding the missing pieces in this challengin puzzle is the first step to understand the problem and look for solutions.

We work on machine learning algorithms to predict gene risk and discover new susceptibility genes. For instance, we developed a method named ST-STEINER which detects a Steiner tree of genes to find a functionally related cluster that might confer autism. Recenty, we develoed DeepND which performs cross-disorder analysis to improve gene risk prediction power by exploiting the comorbidity of autism and  intellectual disability via multitask learning. This model leverages information from gene co-expression networks that model human brain development using graph convolutional neural networks and learns which spatio-temporal neurovelopmental windows are important for disorder etiologies. 

Funding: Simons Foundation Autism Research Initiative - Pilot Award (PI - #640935) and  Explorer Award (PI - #416835) 

Analyzing and understanding tumor metabolism for surgical applications

Complete resection of the tumor is important for survival in glioma patients. Even if the gross total resection was achieved, left-over micro-scale tissue in the excision cavity risks recurrence. High Resolution Magic Angle Spinning Nuclear Magnetic Resonance (HRMAS NMR) technique can distinguish healthy and malign tissue efficiently using peak intensities of biomarker metabolites. The method is fast, sensitive and can work with small and unprocessed samples, which makes it a good fit for real-time analysis and providing feedback to surgeons during surgery. However, this analysis can be inconclusive due to inherent noise and shifts in the NMR signal. The signal is analyzed by an human expert which is subjective and the analysis is quite limited due to the time constraint. Moreover, it might be the case that such experts are not available during surgery. We work on machine learning techniques to predict presence of residual tumor tissue on the excision cavity of the patient to guide the surgeons. We also predict the tumor malignancy and the survival of the patient to give further feedback to the surgeon in real time. These systems will be implemented in Hautpierre Hospital of the University of Strasbourg.,

Collaborators: Izzie Jacques Namer (University of Strasbourg). 

Drug effect and side-effect Prediction

Drug failures due to unforeseen adverse effects at clinical trials pose health risks for the participants and lead to substantial financial losses. Computational methods hold great promise for  mitigating the health and financial risks of drug development by predicting possible side effects before entering into the clinical trials. We work on neural network based architectures that utilize a diverse set of drug structure and experimental condition data to perform in-silico drug side effect predictions.

Likewise, we also working on machine learning approaches to predict drug combination synergies. Drug combination therapies have been a viable strategy for the treatment of complex diseases such as cancer due to increased efficacy and reduced side effects. However, experimentally validating all possible combinations for synergistic interaction even with high-throughout screens is intractable due to vast combinatorial search space. Our goal is to provide a smaller search space where most likely candidate combinations can be tested in-vitro and in-vivo. Currently, these predictions are restricted to the data obtained from cell lines. Our ultimate aim is to personalize these predictions for each patient. Even though there are a few such human samples, we work on specialized model architectures and techniques to enable this:

Collaborators: Oznur Tastan (Sabanci U.)

Predicting disease risk conferring genetic variant combinations

Phenotypic heritability of complex traits and diseases is seldom explained by individual genetic variants identified in genome-wide association studies (GWAS). Many methods have been developed to select a subset of variant loci, which are associated with or predictive of the phenotype. We work on feature selection methods which use a biological network of genetic variants to select a diverse subset of causative single nucleotide polymorphisms (SNP) which can let users to predict the phenotype.

Even detection of pairs of SNPs that are synergistic (epistatic) is a computationally challenging task. We work on models (model 1 and model 2) that can efficiently discover such pairs while providing biological interpretations. 

Collaborators: Oznur Tastan (Sabanci), Serhan Yilmaz (CWRU), Mehmet Koyuturk (CWRU)