Our paper on Jumping across biomedical contexts using compressive data fusion has just appeared in Bioinformatics. We will present the paper at ISMB 2016 in July 2016.
The rapid growth of diverse biological data allows us to consider interactions between a variety of objects, such as genes, chemicals, molecular signatures, diseases, pathways and environmental exposures. Often, any pair of objects—such as a gene and a disease—can be related in different ways, for example, directly via gene–disease associations or indirectly via functional annotations, chemicals and pathways. In this paper, we show that different ways of relating these objects carry different semantic meanings that are largely ignored by established computational methods.
We present an approach that operates on large-scale heterogeneous data collections and explicitly distinguishes between diverse data semantics. The approach detects size-k modules of objects that, taken together, appear most significant to another set of objects. The method builds on collective matrix factorization to derive different semantics, and it formulates the growing of the modules as a submodular optimization program.
In a systematic study on more than three hundred complex diseases, we show the effectiveness of the approach in associating genes with diseases and detecting disease modules.
The Summer issue of ACM XRDS is here! The issue is centered around computing, culture, postcoloniality and questions of power. In it, many fascinating authors ask whether an Anglo-European culture of computing could be made more aware of its politics and what alternative cultures of computing could be realized. Our amazing issue editors, Ahmed Ansari (CMU) and Raghavendra Kandala (CMU), have tried to give the readers a slice of the incredible heterogeneity and plurality of critical scholarship and practice around the world.
The issue provides a brief introduction to decolonial computing and raises various issues around design and innovation in China, participation of Africans in the global HCI community, the life at the forefront of Indonesia's tech emancipation, and plans to develop hundreds of smart cities in India, revealing the complex politics of technological development and class.
Jennifer Jacobs (MIT) and I served as co-editors for the issue.
The Spring issue of ACM XRDS is here! This issue is centered around digital fabrication, which in many ways highlights the expanded role of computer in today's society. Digital fabrication is not merely about 3D printing knickknacks, rather it enables individuals to create their own systems and devices using new technologies.
My department contributed a column on the Brownian wanderlust of things. Consider a gambler who starts with an initial fortune and plays the following simple coin tossing game. At each turn, the dealer throws an unbiased coin. If the outcome is head, the gambler wins a unit; if the coin comes up tails, the gambler loses a unit. The gambler continues to play until he is either bankrupted or his current holdings reach some fixed desired amount.
Stochastic models of this kind can have much wider implications than just estimating the fortune of a gambler flipping a coin. For example, the way in which information flows within social media outlets can affect mobilization and strategic interactions between participants of mass social movements, such as protests. While traditionally social movements have spread through on-the-ground unions, the use of communication platforms—such as Twitter and Facebook—has offered alternative ways for organizing such events. As we see in the column, to truly capture propagation in such environments, we need to take into consideration the stochastic nature of information propagation.
Our paper on integrative analysis of multiple RNA-binding proteins has just appeared in Bioinformatics. RNA binding proteins (RBPs) are important for many cellular processes, including post-transcriptional control of gene expression, splicing, transport, polyadenylation and RNA stability. To better understand the RBP mechanisms we aimed to integrate the rapidly growing RBP experimental data with the latest genome annotation, gene function, RNA sequence and structure.
We have developed an integrative orthogonality-regularized nonnegative matrix factorization that can integrate multiple data sets and discover non-overlapping and class-specific RNA binding patterns of varying strengths. The orthogonality constraint is important here because it enables us to substantially reduce the effective size of inferred factor models.
The new models have proved powerful in predicting RBP interaction sites on RNA. We also showed that joint analysis of multiple data sets can boost retrieval accuracy of RNA binding sites, which we studied using the largest RBP data compendium to date.
The Winter issue of ACM XRDS is here! This issue discusses Internet of Things (IoT), a collection of emerging technologies that promise to seamlessly expand our sensing capabilities across the globe with imagination as our only limit. In the issue you can read about the prospect for the IoT as seen by leaders in the field, the challenges of building network awareness, the trends of IoT platforms. You will also find columns about ontology-supported stream reasoning for querying flying robots, the importance of encrypted data in IoT systems, and ways of managing droughts using tech.
My department contributed a column on predicting activities of daily living (ADL) from sensor activation profiles. Together with Lara Zupan we used an open-source data mining tool for visual programming called Orange to analyze patterns of how humans interact with household devices. These interaction patterns provided powerful clues that helped us recognize various activities that take place in home environments.
Our paper on Collective Pairwise Classification for Multi-Way Analysis has been published in the Proceedings of the 21st Pacific Symposium on Biocomputing. We will present the work at the PSB conference in January 2016.
In the paper, we develop a collective pairwise classification approach for multi-way data analysis. The approach leverages the superiority of latent factor models for analyzing large heterogeneous relational data sets and provides probabilistic estimates of relationships by optimizing a pairwise ranking loss. Although the method bears correspondence with the maximization of a non-differentiable area under the receiver operating characteristic curve, we were able to design a learning algorithm that scales well on large multi-relational data.
We used the method to infer relationships from multiplex drug data and to predict connections between clinical manifestations of diseases and their underlying molecular signatures. An appealing property of the method is its ability to make category-jumping inferences, such as predictions about diseases based solely on genomic and clinical data generated far outside the molecular context.
The Fall issue of ACM XRDS is here! In this issue we write about virtual reality. Among others, you can read about the virtual reality revolution and ways to bring virtual reality home. The issue also discusses how to use your own muscles to achieve realistic physical experience, how to manage cybersickness in virtual reality, and how to avoid danger with mine disaster simulations.
My department contributed a column on mining the Marvel comic book universe. Together with Lara Zupan we scraped Wikipedia to obtain information on the Marvel comics characters and then analyzed the structure of the Marvel multiverse network, where two characters were considered linked if they shared a skill set. Here, the analysis of complex networks allowed us to better understand how properties of fictitious networks emerge from non-trivial interactions between characters.
Our paper on Sieve-based relation extraction of gene regulatory networks from biological literature has been published in BMC Bioinformatics.
In the paper, we describe a network extraction algorithm, which is an improvement on our winning submission to BioNLP 2013. Our method is designed as a sieve-based system and uses linear-chain conditional random fields and rules for relation extraction. To enable extraction of distant relations we transform the data into skip-mention sequences. We then infer multiple models, each of which is able to extract a particular relationship type (e.g., inhibition, activation, binding). Further analysis following the challenge showed that all relation extraction sieves contribute to the predictive performance of the proposed approach. Also, features constructed by considering mention words and their prefixes and suffixes are the most important features for higher accuracy of extraction. The analysis also showed that our choice of transforming data into skip-mention sequences is appropriate for detecting relations between distant mentions.
Our paper on Gene prioritization by compressive data fusion and chaining has been published in PLoS Computational Biology.
In the paper, we present Collage, a new data fusion approach to gene prioritization. Together with collaborators from Baylor College of Medicine, we tested Collage by prioritizing bacterial response genes in Dictyostelium as a novel model system for prokaryote-eukaryote interactions.
We started from four bacterial response genes and 14 different data sets ranging from gene expression to pathway and literature information. Collage proposed eight candidate genes that were tested in the wet laboratory. Mutations in all eight candidates reduced the ability of the amoebae to grow on Gram-negative bacteria. Furthermore, five out of the eight candidate genes were required for growth on Gram-negative bacteria but had no discernible effect on growth on Gram-positive bacteria. This is a remarkably accurate result since only about a hundred of the 12,000 Dictyostelium genes are estimated to be responsible for bacterial response.
Together with Blaz Zupan we organize a tutorial on data fusion at the International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).
In the tutorial, we will explore latent factor models, a popular class of approaches that have in recent years seen many successful applications in integrative data analysis. We will describe the intuition behind matrix factorization and explain why latent factor models are suitable when collectively analyzing many heterogeneous data sets. To practice data fusion, we will construct visual data fusion workflows using Orange and its Data Fusion add-on.
This tutorial would not be possible without the great support by the Bioinformatics Laboratory at University of Ljubljana.
The Summer issue of ACM XRDS is here! In this issue we write about computational biology. Our features and interviews present different perspectives about some of the most recent advances of computational biology. You can read about personalized medicine and the use of genetic data to improve drug treatment, pharmacogenetics, machine learning techniques for mapping genetic differences to phenotypes in large-scale genome-wide association studies, computational approaches towards prediction of patient outcomes based on electronic health records, statistical techniques for drug discovery, etc. This issue also includes discussions on cutting-edge techniques, such as the analysis of single cell measurement data.
My department contributed a column on mining cancer data with matrix factorization, an established class of algorithms that proved useful in many bioinformatic studies. Diversity and abundance of data provided by the cancer projects like The International Cancer Genome Consortium challenge computer scientists of all kinds to develop innovative software, hardware, and analytic solutions for data analysis. We expect that with computationally and statistically stronger approaches, such as factorization models, we will be once able to reveal biological features that drive cancer development, define cancer types relevant for prognosis, and, ultimately, enable the development of new cancer therapies.