Marinka Zitnik

Fusing bits and DNA

  • Increase font size
  • Default font size
  • Decrease font size

Data Fusion Tutorial at the Basel Computational Biology Conference

E-mail Print PDF

Together with Blaz Zupan we organize a tutorial on data fusion at the Basel Computational Biology Conference ([BC]^2). The tutorial is targeted at computational scientists, data mining researchers and molecular biologists interested in large-scale data integration and predictive modeling.

In the tutorial we focus on collective latent factor models, which have gained popularity in recent years through many successful applications in integrative predictive modeling. We have prepared a series of short lecture notes, which provide the intuition and mathematics behind the algorithms, explain why factorization approaches are suitable when collectively analyzing many heterogeneous data sets, and contain a number of case studies taken from recommendation systems, functional genomics, molecular and systems biology. We demonstrate several recent methodological advancements in hands-on sessions using Orange and its Data Fusion Add-on.

This tutorial would not be possible without the great support by the Bioinformatics Laboratory at University of Ljubljana.

Last Updated on Friday, 21 August 2015 15:52
 

Syst Biomed: Survival Regression by Data Fusion

E-mail Print PDF

Our recent paper in Systems Biomedicine describes a new computational approach that predicts patient’s survival time from a collection of heterogeneous data sets. This is the full paper of our award winning entry at CAMDA meeting at ISMB 2014, Boston, MA, USA.

The approach builds upon recently proposed collective matrix factorization and a well-known Aalen’s additive model for survival regression. Unlike existing methods for survival time prediction, we formulated a joint inference procedure that allows us to simultaneously infer model parameters of collective matrix factorization and regression coefficients of Aalen’s model. We demonstrated improved performance of our method over several baselines in case studies involving three cancer types from the International Cancer Genome Consortium and diverse data sets, such as gene and miRNA expression profiles, somatic mutation data, methylation and gene annotations from the Gene Ontology. We demonstrate that both latent data representation and joint inference, the two features of our approach, contribute substantially to the accurate prediction of survival time. Our results allude to the potential benefits of data fusion when inferring survival models that are predictive of clinical outcomes.

Last Updated on Friday, 21 August 2015 16:06
 

J Comp Biol: Network-Guided Matrix Completion

E-mail Print PDF

Our recent paper in Journal of Computational Biology introduces an interaction data imputation method called network-guided matrix completion (NG-MC). The core part of NG-MC is low-rank probabilistic matrix completion that incorporates prior knowledge presented as a collection of gene networks. NG-MC assumes that interactions are transitive, such that latent gene interaction profiles inferred by NG-MC depend on the profiles of their direct neighbors in gene networks. As the NG-MC inference algorithm progresses, it propagates latent interaction profiles through each of the networks and updates gene network weights toward improved prediction.

In a study with four different E-MAP data assays and considered protein–protein interaction and gene ontology similarity networks, NG-MC significantly surpassed existing alternative techniques. Inclusion of information from gene networks also allowed NG-MC to predict interactions for genes that were not included in original E-MAP assays, a task that could not be considered by current imputation approaches.

Epistatic miniarray profile (E-MAP) is a popular large-scale genetic interaction discovery platform. E-MAPs benefit from quantitative output, which makes it possible to detect subtle interactions with greater precision. However, due to the limits of biotechnology, E-MAP studies fail to measure genetic interactions for up to 40% of gene pairs in an assay. Missing measurements can be recovered by computational techniques for data imputation, in this way completing the interaction profiles and enabling downstream analysis algorithms that could otherwise be sensitive to missing data values.

Conference paper is available from the RECOMB 2014 website. Supplementary material is available from GitHub repository NG-MC.

Last Updated on Thursday, 18 February 2016 07:00
 

ACM XRDS: The Anatomy of a Human Disease Network

E-mail Print PDF

The Winter 2014 issue of ACM XRDS is here! This issue is on health informatics, which has received considerable attention both in research and general public in recent years. You can read about the opportunities of social media in health and well-being, #engaging health initiatives and challenges in personal health tracking, among others.

My department contributed the column about the anatomy of a human disease network. In the column, we explore the human disease network and demonstrate how network-based tools can help us understand relations between diseases at a higher level of organismal organization without considering any prior biomedical knowledge.

Tools of network analysis have recently been applied to many complex systems, to both simplify and highlight their underlying structure and the relationships that they represent. The results obtained from network-based approaches provide not only insight into interactions between online users, but also new clues about how to improve our understanding of biological systems. Network medicine in particular, a network-based approach to studying human disease, has proven effective in studying interdependence between molecular components in cells, and in identifying disease modules and biological pathways.

Last Updated on Friday, 21 August 2015 15:01
 

IEEE TPAMI: Data Fusion by Matrix Factorization

E-mail Print PDF

We recently published a paper on a new data fusion method in IEEE Transactions on Pattern Analysis and Machine Intelligence.

For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system’s constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization, called data fusion by matrix factorization (DFMF), that simultaneously factorizes many data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.

Short preprint is available at Arxiv:1307.0803. Full paper is online at IEEE.

Last Updated on Friday, 21 August 2015 16:05
 

ACM XRDS: Dynamics of News from The New York Times

E-mail Print PDF

New issue of ACM XRDS is here! The focus of this issue is on techniques for natural language processing in the broader sense. You will find interesting stories about how to detect influencers in social media discussions, how to successfully transition from academia to entrepreneurship, and read about the hurdles and opportunities in research of ancient written languages.

My department contributed a short column on exploring news from The New York Times. Techniques of information extraction and natural language processing allow us to search for news articles and to analyze dynamics of published content. News organizations, such as The New York Times, provide programmatic access to their articles to retrieve headlines, abstracts, and links to published multimedia. In the column we use The New York Times Article Search API to demonstrate how to construct search queries that retrieve documents from various news sections and time periods. We also explore the pulse of climate change over the years using data extracted from published news articles.

Last Updated on Friday, 21 August 2015 15:00
 

@Heidelberg Laureate Forum 2014

E-mail Print PDF

Recently, I have participated as young researcher in computer science at Heidelberg Laureate Forum. I encourage the reader to check recordings of some of the talks, which are available at official HLF website. If limited by time I recommend at least one of the following talks by Michael Atiyah, Manuel Blum, Wendelin Werner, Vint Cerf, Leslie Lamport, Manjul Bhargava, Daniel Spielman, Efin Zelmanov or John Hopcroft. They are engaging, full of useful tips and strategies, and should be accessible to an interested listener.

Many CS & Math bloggers followed the event, their comments and discussions about laureates' talks can be found at HLF Blogs. Among others, our poster has been highlighted by John D. Cook. Overall, HLF has been an awesome experience for me with many opportunities to network with Turing, Fields, Abel and Nevanlinna laureates and meet other young researchers in computer science and mathematics from around the world.

Last Updated on Sunday, 28 September 2014 20:13
 

Google Global Planning Committee for Women in Computer Science

E-mail Print PDF

I have been given an opportunity to join Google Global Planning Committee for Women in Computer Science in an effort to identify ways we can have the greatest impact and reach more women in tech. As member of this committee I will partner with Google to build the community and direct outreach activities for women in computer science. To kick things off, we will have our global meeting at the Grace Hopper Conference in Phoenix, AZ, USA. I am excited to be part of this great program to promote women to excel in computer science and information technology.

Stay tuned, there will be many possibilities to engage with fellow technologists!

Last Updated on Tuesday, 16 September 2014 16:25
 

@Stanford University, Department of Computer Science

E-mail Print PDF

I am visiting the Department of Computer Science at Stanford University, CA, USA in Summer and Fall 2014. During my stay we will study the interplay between network analysis, data integration and biology. There are many exciting challenges one can explore in these areas and I am very enthusiastic about the work.

Last Updated on Thursday, 21 August 2014 05:53
 

ISMB 2014: Epistasis-Based Gene Network Inference

E-mail Print PDF

I have presented our recent approach for epistasis-based gene network inference at ISMB 2014. We propose a factorized model of interactions that is used for scoring of different types of gene-gene relationships, such as epistasis, parallelism and partial interdependence, and assembly of gene networks that are consistent with estimated pairwise relationships. Detailed derivation of the method and its empirical comparisons with existing approaches are described in our paper published by Bioinformatics.

Last Updated on Thursday, 09 July 2015 15:08
 

CAMDA 2014: Survival Regression by Data Fusion

E-mail Print PDF

I have presented at CAMDA 2014 an extension of our recent matrix factorization-based data fusion approach that couples data fusion with survival regression. CAMDA 2014 runs as a satellite meeting at ISMB 2014, Boston, MA, USA. Our presentation got CAMDA best presentation award.

Any knowledge discovery could in principal benefit from the fusion of directly or even indirectly related data sources. In this work, we explore if a recently proposed simultaneous matrix factorization data fusion approach could be adapted for survival regression. We propose a new method that jointly infers latent factors by data fusion and estimates regression coefficients of survival model. We have applied the method to CAMDA 2014 large-scale Cancer Genomes Challenge and modeled survival time as a function of gene, protein and miRNA expression data, and data on methylated and mutated regions. We find that both joint inference of factors and regression coefficients on one side and data fusion procedure on the other are crucial for performance. Our approach is substantially more accurate than baseline Aalen's additive model. Latent factors inferred by our approach could be mined further; we found that the most informative factors are related to known cancer processes.

Last Updated on Thursday, 09 July 2015 15:08
 


Page 3 of 8