Marinka Zitnik

Fusing bits and DNA

  • Increase font size
  • Default font size
  • Decrease font size

ISMB 2015: Gene Network Inference via Data Fusion

Our paper at ISMB 2015 addresses a challenging task of inferring gene networks by taking into consideration potentially many data sets. Importantly, these data sets might be nonidentically distributed and can follow any combination of exponential family distributions. To tackle this challenge we develop an efficient Markov network model that achieves fusion by reusing latent model parameters.

Empirical studies on cancer genome data sets show an advantage of joint inference over separate network inference and the merits of incorporating information about the underlying data distribution into inference.

The slides of the talk are available.

 

ISMB 2015: Integrate Everything but the Kitchen Sink

Our poster at ISMB 2015 is concerned with data set selection and sensitivity estimation in collective factor models.

Molecular biology data is rich in volume as well as heterogeneity. We can view individual data sets as relations between objects of different types, for example, function annotations describe relationships between genes and functions. We represent a large data compendium with a multiscale and multiplex relation graph. Recently, latent factor models were developed to fuse such representations and collectively infer accurate prediction models (Zitnik & Zupan, IEEE TPAMI 2015). Here, we are interested in how changes in one relation (data set) affect the latent model of another relation in the context of a given collective latent factor model. For example, in a user-movie recommendation system, how would a change of casting affect user's movie preferences? In bioinformatics, how would a change in gene expression data influence prediction of gene-disease associations?

We address this challenge by developing an approach to estimate dependence between any two relations within a single run of inference algorithm. Forensic derives from the theory of Frechet derivation and matrix conditioning and can be used with any collective matrix factorization.

See our poster for more details.

 

Compressive Data Fusion and Persistent Homology

E-mail Print PDF

My talk at the Summer School on Computational Topology in Ljubljana, Slovenia was about coupling compressive data fusion methods with algebraic topology, in particular persistent homology. There, I discussed how the latent data space obtained by fusion of heterogeneous biological data sets can be explored with topological methods.

In a case study from molecular biology, which included nearly two dozen data sets, we studied persistence (lifetime) of various topological features, e.g. connected components, loops, voids, tunnels, etc. We showed that significant topological features, i.e. features with long lifetime, also carry biologically relevant information. For example, gene modules with significant topology were enriched for cellular functions and biological processes, and, similarly, persistent drug modules captured the structural similarity between drugs.

The slides of the talk are available.

Last Updated on Thursday, 18 February 2016 07:47
 

Invited Talk on Learning Latent Factor Models by Data Fusion

E-mail Print PDF

Our invited talk at the Workshop on Matrix Computations for Biomedical Informatics at the 15th Conference on Artificial Intelligence in Medicine, AIME in Pavia, Italy, discussed the use of collective latent factor models for various predictive modeling tasks in biomedicine, such as gene prioritization, gene function prediction, network inference and discovery of disease-disease associations.

In the talk given together with Blaz Zupan, we highlighted our recent developments of data fusion approaches via latent factor models.

The slides of the talk are available at Prezi.

Last Updated on Friday, 21 August 2015 16:05
 

Bioinformatics: Gene Network Inference by Fusing Diverse Distributions

E-mail Print PDF

Our paper on Gene network inference by fusing data from diverse distributions has been published in Bioinformatics. We will present it at ISMB 2015 in Dublin.

In the paper we describe FuseNet, a Markov network formulation that infers networks from a collection of potentially nonidentically distributed datasets.

Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their state-of-the-art inference procedures assume the data arise from a Gaussian distribution. High-throughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of datasets.

FuseNet is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study, we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNA-sequencing and somatic mutation data, a novel application of graphical models. Fusion of datasets offers substantial gains relative to inference of separate networks for each dataset. We also demonstrate that network inference methods for non-Gaussian data can help in accurate modeling of the data generated by emergent high-throughput technologies.

Last Updated on Friday, 21 August 2015 16:06
 

Poster Award at the Basel Computational Biology Conference

E-mail Print PDF

Our poster on Gene prioritization by compressive data fusion and chaining got best poster award at the Basel Computational Biology Conference ([BC]^2).

The poster highlights our recent computational method that prioritizes genes by fusing heterogeneous data. An appealing property of our approach is its ability to consider data that might be provided in totally different input spaces, which is achieved through chaining of the latent data representation. We report on a very successful application in hunting genes responsible for bacterial resistance in Dictyostelium, where our predictions were validated in the wet lab.

Last Updated on Sunday, 14 June 2015 10:05
 

Data Fusion Tutorial at the Basel Computational Biology Conference

E-mail Print PDF

Together with Blaz Zupan we organize a tutorial on data fusion at the Basel Computational Biology Conference ([BC]^2). The tutorial is targeted at computational scientists, data mining researchers and molecular biologists interested in large-scale data integration and predictive modeling.

In the tutorial we focus on collective latent factor models, which have gained popularity in recent years through many successful applications in integrative predictive modeling. We have prepared a series of short lecture notes, which provide the intuition and mathematics behind the algorithms, explain why factorization approaches are suitable when collectively analyzing many heterogeneous data sets, and contain a number of case studies taken from recommendation systems, functional genomics, molecular and systems biology. We demonstrate several recent methodological advancements in hands-on sessions using Orange and its Data Fusion Add-on.

This tutorial would not be possible without the great support by the Bioinformatics Laboratory at University of Ljubljana.

Last Updated on Friday, 21 August 2015 15:52
 

Syst Biomed: Survival Regression by Data Fusion

E-mail Print PDF

Our recent paper in Systems Biomedicine describes a new computational approach that predicts patient’s survival time from a collection of heterogeneous data sets. This is the full paper of our award winning entry at CAMDA meeting at ISMB 2014, Boston, MA, USA.

The approach builds upon recently proposed collective matrix factorization and a well-known Aalen’s additive model for survival regression. Unlike existing methods for survival time prediction, we formulated a joint inference procedure that allows us to simultaneously infer model parameters of collective matrix factorization and regression coefficients of Aalen’s model. We demonstrated improved performance of our method over several baselines in case studies involving three cancer types from the International Cancer Genome Consortium and diverse data sets, such as gene and miRNA expression profiles, somatic mutation data, methylation and gene annotations from the Gene Ontology. We demonstrate that both latent data representation and joint inference, the two features of our approach, contribute substantially to the accurate prediction of survival time. Our results allude to the potential benefits of data fusion when inferring survival models that are predictive of clinical outcomes.

Last Updated on Friday, 21 August 2015 16:06
 

J Comp Biol: Network-Guided Matrix Completion

E-mail Print PDF

Our recent paper in Journal of Computational Biology introduces an interaction data imputation method called network-guided matrix completion (NG-MC). The core part of NG-MC is low-rank probabilistic matrix completion that incorporates prior knowledge presented as a collection of gene networks. NG-MC assumes that interactions are transitive, such that latent gene interaction profiles inferred by NG-MC depend on the profiles of their direct neighbors in gene networks. As the NG-MC inference algorithm progresses, it propagates latent interaction profiles through each of the networks and updates gene network weights toward improved prediction.

In a study with four different E-MAP data assays and considered protein–protein interaction and gene ontology similarity networks, NG-MC significantly surpassed existing alternative techniques. Inclusion of information from gene networks also allowed NG-MC to predict interactions for genes that were not included in original E-MAP assays, a task that could not be considered by current imputation approaches.

Epistatic miniarray profile (E-MAP) is a popular large-scale genetic interaction discovery platform. E-MAPs benefit from quantitative output, which makes it possible to detect subtle interactions with greater precision. However, due to the limits of biotechnology, E-MAP studies fail to measure genetic interactions for up to 40% of gene pairs in an assay. Missing measurements can be recovered by computational techniques for data imputation, in this way completing the interaction profiles and enabling downstream analysis algorithms that could otherwise be sensitive to missing data values.

Conference paper is available from the RECOMB 2014 website. Supplementary material is available from GitHub repository NG-MC.

Last Updated on Thursday, 18 February 2016 07:00
 

ACM XRDS: The Anatomy of a Human Disease Network

E-mail Print PDF

The Winter 2014 issue of ACM XRDS is here! This issue is on health informatics, which has received considerable attention both in research and general public in recent years. You can read about the opportunities of social media in health and well-being, #engaging health initiatives and challenges in personal health tracking, among others.

My department contributed the column about the anatomy of a human disease network. In the column, we explore the human disease network and demonstrate how network-based tools can help us understand relations between diseases at a higher level of organismal organization without considering any prior biomedical knowledge.

Tools of network analysis have recently been applied to many complex systems, to both simplify and highlight their underlying structure and the relationships that they represent. The results obtained from network-based approaches provide not only insight into interactions between online users, but also new clues about how to improve our understanding of biological systems. Network medicine in particular, a network-based approach to studying human disease, has proven effective in studying interdependence between molecular components in cells, and in identifying disease modules and biological pathways.

Last Updated on Friday, 21 August 2015 15:01
 

IEEE TPAMI: Data Fusion by Matrix Factorization

E-mail Print PDF

We recently published a paper on a new data fusion method in IEEE Transactions on Pattern Analysis and Machine Intelligence.

For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system’s constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization, called data fusion by matrix factorization (DFMF), that simultaneously factorizes many data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.

Short preprint is available at Arxiv:1307.0803. Full paper is online at IEEE.

Last Updated on Friday, 21 August 2015 16:05
 


Page 3 of 9