Our paper on integrative analysis of multiple RNA-binding proteins has just appeared in Bioinformatics. RNA binding proteins (RBPs) are important for many cellular processes, including post-transcriptional control of gene expression, splicing, transport, polyadenylation and RNA stability. To better understand the RBP mechanisms we aimed to integrate the rapidly growing RBP experimental data with the latest genome annotation, gene function, RNA sequence and structure.
We have developed an integrative orthogonality-regularized nonnegative matrix factorization that can integrate multiple data sets and discover non-overlapping and class-specific RNA binding patterns of varying strengths. The orthogonality constraint is important here because it enables us to substantially reduce the effective size of inferred factor models.
The new models have proved powerful in predicting RBP interaction sites on RNA. We also showed that joint analysis of multiple data sets can boost retrieval accuracy of RNA binding sites, which we studied using the largest RBP data compendium to date.
The Winter issue of ACM XRDS is here! This issue discusses Internet of Things (IoT), a collection of emerging technologies that promise to seamlessly expand our sensing capabilities across the globe with imagination as our only limit. In the issue you can read about the prospect for the IoT as seen by leaders in the field, the challenges of building network awareness, the trends of IoT platforms. You will also find columns about ontology-supported stream reasoning for querying flying robots, the importance of encrypted data in IoT systems, and ways of managing droughts using tech.
My department contributed a column on predicting activities of daily living (ADL) from sensor activation profiles. Together with Lara Zupan we used an open-source data mining tool for visual programming called Orange to analyze patterns of how humans interact with household devices. These interaction patterns provided powerful clues that helped us recognize various activities that take place in home environments.
Our paper on Collective Pairwise Classification for Multi-Way Analysis has been published in the Proceedings of the 21st Pacific Symposium on Biocomputing. We will present the work at the PSB conference in January 2016.
In the paper, we develop a collective pairwise classification approach for multi-way data analysis. The approach leverages the superiority of latent factor models for analyzing large heterogeneous relational data sets and provides probabilistic estimates of relationships by optimizing a pairwise ranking loss. Although the method bears correspondence with the maximization of a non-differentiable area under the receiver operating characteristic curve, we were able to design a learning algorithm that scales well on large multi-relational data.
We used the method to infer relationships from multiplex drug data and to predict connections between clinical manifestations of diseases and their underlying molecular signatures. An appealing property of the method is its ability to make category-jumping inferences, such as predictions about diseases based solely on genomic and clinical data generated far outside the molecular context.
The Fall issue of ACM XRDS is here! In this issue we write about virtual reality. Among others, you can read about the virtual reality revolution and ways to bring virtual reality home. The issue also discusses how to use your own muscles to achieve realistic physical experience, how to manage cybersickness in virtual reality, and how to avoid danger with mine disaster simulations.
My department contributed a column on mining the Marvel comic book universe. Together with Lara Zupan we scraped Wikipedia to obtain information on the Marvel comics characters and then analyzed the structure of the Marvel multiverse network, where two characters were considered linked if they shared a skill set. Here, the analysis of complex networks allowed us to better understand how properties of fictitious networks emerge from non-trivial interactions between characters.
Our paper on Sieve-based relation extraction of gene regulatory networks from biological literature has been published in BMC Bioinformatics.
In the paper, we describe a network extraction algorithm, which is an improvement on our winning submission to BioNLP 2013. Our method is designed as a sieve-based system and uses linear-chain conditional random fields and rules for relation extraction. To enable extraction of distant relations we transform the data into skip-mention sequences. We then infer multiple models, each of which is able to extract a particular relationship type (e.g., inhibition, activation, binding). Further analysis following the challenge showed that all relation extraction sieves contribute to the predictive performance of the proposed approach. Also, features constructed by considering mention words and their prefixes and suffixes are the most important features for higher accuracy of extraction. The analysis also showed that our choice of transforming data into skip-mention sequences is appropriate for detecting relations between distant mentions.
Our paper on Gene prioritization by compressive data fusion and chaining has been published in PLoS Computational Biology.
In the paper, we present Collage, a new data fusion approach to gene prioritization. Together with collaborators from Baylor College of Medicine, we tested Collage by prioritizing bacterial response genes in Dictyostelium as a novel model system for prokaryote-eukaryote interactions. We started from four bacterial response genes and 14 different data sets ranging from gene expression to pathway and literature information. Collage proposed eight candidate genes that were tested in the wet laboratory. Mutations in all eight candidates reduced the ability of the amoebae to grow on Gram-negative bacteria. Furthermore, five out of the eight candidate genes were required for growth on Gram-negative bacteria but had no discernible effect on growth on Gram-positive bacteria. This is a remarkably accurate result since only about a hundred of the 12,000 Dictyostelium genes are estimated to be responsible for bacterial response.
Together with Blaz Zupan we organize a tutorial on data fusion at the International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).
In the tutorial, we will explore latent factor models, a popular class of approaches that have in recent years seen many successful applications in integrative data analysis. We will describe the intuition behind matrix factorization and explain why latent factor models are suitable when collectively analyzing many heterogeneous data sets. To practice data fusion, we will construct visual data fusion workflows using Orange and its Data Fusion add-on.
This tutorial would not be possible without the great support by the Bioinformatics Laboratory at University of Ljubljana.
The Summer issue of ACM XRDS is here! In this issue we write about computational biology. Our features and interviews present different perspectives about some of the most recent advances of computational biology. You can read about personalized medicine and the use of genetic data to improve drug treatment, pharmacogenetics, machine learning techniques for mapping genetic differences to phenotypes in large-scale genome-wide association studies, computational approaches towards prediction of patient outcomes based on electronic health records, statistical techniques for drug discovery, etc. This issue also includes discussions on cutting-edge techniques, such as the analysis of single cell measurement data.
My department contributed a column on mining cancer data with matrix factorization, an established class of algorithms that proved useful in many bioinformatic studies. Diversity and abundance of data provided by the cancer projects like The International Cancer Genome Consortium challenge computer scientists of all kinds to develop innovative software, hardware, and analytic solutions for data analysis. We expect that with computationally and statistically stronger approaches, such as factorization models, we will be once able to reveal biological features that drive cancer development, define cancer types relevant for prognosis, and, ultimately, enable the development of new cancer therapies.
Our paper at ISMB 2015 addresses a challenging task of inferring gene networks by taking into consideration potentially many data sets. Importantly, these data sets might be nonidentically distributed and can follow any combination of exponential family distributions. To tackle this challenge we develop an efficient Markov network model that achieves fusion by reusing latent model parameters.
Empirical studies on cancer genome data sets show an advantage of joint inference over separate network inference and the merits of incorporating information about the underlying data distribution into inference.
The slides of the talk are available.
Our poster at ISMB 2015 is concerned with data set selection and sensitivity estimation in collective factor models.
Molecular biology data is rich in volume as well as heterogeneity. We can view individual data sets as relations between objects of different types, for example, function annotations describe relationships between genes and functions. We represent a large data compendium with a multiscale and multiplex relation graph. Recently, latent factor models were developed to fuse such representations and collectively infer accurate prediction models (Zitnik & Zupan, IEEE TPAMI 2015). Here, we are interested in how changes in one relation (data set) affect the latent model of another relation in the context of a given collective latent factor model. For example, in a user-movie recommendation system, how would a change of casting affect user's movie preferences? In bioinformatics, how would a change in gene expression data influence prediction of gene-disease associations?
We address this challenge by developing an approach to estimate dependence between any two relations within a single run of inference algorithm. Forensic derives from the theory of Frechet derivation and matrix conditioning and can be used with any collective matrix factorization.
See our poster for more details.