Marinka Zitnik

Fusing bits and DNA

  • Increase font size
  • Default font size
  • Decrease font size

Compressive Data Fusion and Persistent Homology

E-mail Print PDF

My talk at the Summer School on Computational Topology in Ljubljana, Slovenia was about coupling compressive data fusion methods with algebraic topology, in particular persistent homology. There, I discussed how the latent data space obtained by fusion of heterogeneous biological data sets can be explored with topological methods.

In a case study from molecular biology, which included nearly two dozen data sets, we studied persistence (lifetime) of various topological features, e.g. connected components, loops, voids, tunnels, etc. We showed that significant topological features, i.e. features with long lifetime, also carry biologically relevant information. For example, gene modules with significant topology were enriched for cellular functions and biological processes, and, similarly, persistent drug modules captured the structural similarity between drugs.

The slides of the talk are available.

Last Updated on Thursday, 18 February 2016 07:47

Invited Talk on Learning Latent Factor Models by Data Fusion

E-mail Print PDF

Our invited talk at the Workshop on Matrix Computations for Biomedical Informatics at the 15th Conference on Artificial Intelligence in Medicine, AIME in Pavia, Italy, discussed the use of collective latent factor models for various predictive modeling tasks in biomedicine, such as gene prioritization, gene function prediction, network inference and discovery of disease-disease associations.

In the talk given together with Blaz Zupan, we highlighted our recent developments of data fusion approaches via latent factor models.

The slides of the talk are available at Prezi.

Last Updated on Friday, 21 August 2015 16:05

Bioinformatics: Gene Network Inference by Fusing Diverse Distributions

E-mail Print PDF

Our paper on Gene network inference by fusing data from diverse distributions has been published in Bioinformatics. We will present it at ISMB 2015 in Dublin.

In the paper we describe FuseNet, a Markov network formulation that infers networks from a collection of potentially nonidentically distributed datasets.

Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their state-of-the-art inference procedures assume the data arise from a Gaussian distribution. High-throughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of datasets.

FuseNet is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study, we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNA-sequencing and somatic mutation data, a novel application of graphical models. Fusion of datasets offers substantial gains relative to inference of separate networks for each dataset. We also demonstrate that network inference methods for non-Gaussian data can help in accurate modeling of the data generated by emergent high-throughput technologies.

Last Updated on Friday, 21 August 2015 16:06

Poster Award at the Basel Computational Biology Conference

E-mail Print PDF

Our poster on Gene prioritization by compressive data fusion and chaining got best poster award at the Basel Computational Biology Conference ([BC]^2).

The poster highlights our recent computational method that prioritizes genes by fusing heterogeneous data. An appealing property of our approach is its ability to consider data that might be provided in totally different input spaces, which is achieved through chaining of the latent data representation. We report on a very successful application in hunting genes responsible for bacterial resistance in Dictyostelium, where our predictions were validated in the wet lab.

Last Updated on Sunday, 14 June 2015 10:05

Data Fusion Tutorial at the Basel Computational Biology Conference

E-mail Print PDF

Together with Blaz Zupan we organize a tutorial on data fusion at the Basel Computational Biology Conference ([BC]^2). The tutorial is targeted at computational scientists, data mining researchers and molecular biologists interested in large-scale data integration and predictive modeling.

In the tutorial we focus on collective latent factor models, which have gained popularity in recent years through many successful applications in integrative predictive modeling. We have prepared a series of short lecture notes, which provide the intuition and mathematics behind the algorithms, explain why factorization approaches are suitable when collectively analyzing many heterogeneous data sets, and contain a number of case studies taken from recommendation systems, functional genomics, molecular and systems biology. We demonstrate several recent methodological advancements in hands-on sessions using Orange and its Data Fusion Add-on.

This tutorial would not be possible without the great support by the Bioinformatics Laboratory at University of Ljubljana.

Last Updated on Friday, 21 August 2015 15:52

Syst Biomed: Survival Regression by Data Fusion

E-mail Print PDF

Our recent paper in Systems Biomedicine describes a new computational approach that predicts patient’s survival time from a collection of heterogeneous data sets. This is the full paper of our award winning entry at CAMDA meeting at ISMB 2014, Boston, MA, USA.

The approach builds upon recently proposed collective matrix factorization and a well-known Aalen’s additive model for survival regression. Unlike existing methods for survival time prediction, we formulated a joint inference procedure that allows us to simultaneously infer model parameters of collective matrix factorization and regression coefficients of Aalen’s model. We demonstrated improved performance of our method over several baselines in case studies involving three cancer types from the International Cancer Genome Consortium and diverse data sets, such as gene and miRNA expression profiles, somatic mutation data, methylation and gene annotations from the Gene Ontology. We demonstrate that both latent data representation and joint inference, the two features of our approach, contribute substantially to the accurate prediction of survival time. Our results allude to the potential benefits of data fusion when inferring survival models that are predictive of clinical outcomes.

Last Updated on Friday, 21 August 2015 16:06

J Comp Biol: Network-Guided Matrix Completion

E-mail Print PDF

Our recent paper in Journal of Computational Biology introduces an interaction data imputation method called network-guided matrix completion (NG-MC). The core part of NG-MC is low-rank probabilistic matrix completion that incorporates prior knowledge presented as a collection of gene networks. NG-MC assumes that interactions are transitive, such that latent gene interaction profiles inferred by NG-MC depend on the profiles of their direct neighbors in gene networks. As the NG-MC inference algorithm progresses, it propagates latent interaction profiles through each of the networks and updates gene network weights toward improved prediction.

In a study with four different E-MAP data assays and considered protein–protein interaction and gene ontology similarity networks, NG-MC significantly surpassed existing alternative techniques. Inclusion of information from gene networks also allowed NG-MC to predict interactions for genes that were not included in original E-MAP assays, a task that could not be considered by current imputation approaches.

Epistatic miniarray profile (E-MAP) is a popular large-scale genetic interaction discovery platform. E-MAPs benefit from quantitative output, which makes it possible to detect subtle interactions with greater precision. However, due to the limits of biotechnology, E-MAP studies fail to measure genetic interactions for up to 40% of gene pairs in an assay. Missing measurements can be recovered by computational techniques for data imputation, in this way completing the interaction profiles and enabling downstream analysis algorithms that could otherwise be sensitive to missing data values.

Conference paper is available from the RECOMB 2014 website. Supplementary material is available from GitHub repository NG-MC.

Last Updated on Thursday, 18 February 2016 07:00

ACM XRDS: The Anatomy of a Human Disease Network

E-mail Print PDF

The Winter 2014 issue of ACM XRDS is here! This issue is on health informatics, which has received considerable attention both in research and general public in recent years. You can read about the opportunities of social media in health and well-being, #engaging health initiatives and challenges in personal health tracking, among others.

My department contributed the column about the anatomy of a human disease network. In the column, we explore the human disease network and demonstrate how network-based tools can help us understand relations between diseases at a higher level of organismal organization without considering any prior biomedical knowledge.

Tools of network analysis have recently been applied to many complex systems, to both simplify and highlight their underlying structure and the relationships that they represent. The results obtained from network-based approaches provide not only insight into interactions between online users, but also new clues about how to improve our understanding of biological systems. Network medicine in particular, a network-based approach to studying human disease, has proven effective in studying interdependence between molecular components in cells, and in identifying disease modules and biological pathways.

Last Updated on Friday, 21 August 2015 15:01

IEEE TPAMI: Data Fusion by Matrix Factorization

E-mail Print PDF

We recently published a paper on a new data fusion method in IEEE Transactions on Pattern Analysis and Machine Intelligence.

For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system’s constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization, called data fusion by matrix factorization (DFMF), that simultaneously factorizes many data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.

Short preprint is available at Arxiv:1307.0803. Full paper is online at IEEE.

Last Updated on Friday, 21 August 2015 16:05

ACM XRDS: Dynamics of News from The New York Times

E-mail Print PDF

New issue of ACM XRDS is here! The focus of this issue is on techniques for natural language processing in the broader sense. You will find interesting stories about how to detect influencers in social media discussions, how to successfully transition from academia to entrepreneurship, and read about the hurdles and opportunities in research of ancient written languages.

My department contributed a short column on exploring news from The New York Times. Techniques of information extraction and natural language processing allow us to search for news articles and to analyze dynamics of published content. News organizations, such as The New York Times, provide programmatic access to their articles to retrieve headlines, abstracts, and links to published multimedia. In the column we use The New York Times Article Search API to demonstrate how to construct search queries that retrieve documents from various news sections and time periods. We also explore the pulse of climate change over the years using data extracted from published news articles.

Last Updated on Friday, 21 August 2015 15:00

@Heidelberg Laureate Forum 2014

E-mail Print PDF

Recently, I have participated as young researcher in computer science at Heidelberg Laureate Forum. I encourage the reader to check recordings of some of the talks, which are available at official HLF website. If limited by time I recommend at least one of the following talks by Michael Atiyah, Manuel Blum, Wendelin Werner, Vint Cerf, Leslie Lamport, Manjul Bhargava, Daniel Spielman, Efin Zelmanov or John Hopcroft. They are engaging, full of useful tips and strategies, and should be accessible to an interested listener.

Many CS & Math bloggers followed the event, their comments and discussions about laureates' talks can be found at HLF Blogs. Among others, our poster has been highlighted by John D. Cook. Overall, HLF has been an awesome experience for me with many opportunities to network with Turing, Fields, Abel and Nevanlinna laureates and meet other young researchers in computer science and mathematics from around the world.

Last Updated on Sunday, 28 September 2014 20:13

Page 3 of 8