Our poster at ISMB 2015 is concerned with data set selection and sensitivity estimation in collective factor models.
Molecular biology data is rich in volume as well as heterogeneity. We can view individual data sets as relations between objects of different types, for example, function annotations describe relationships between genes and functions. We represent a large data compendium with a multiscale and multiplex relation graph. Recently, latent factor models were developed to fuse such representations and collectively infer accurate prediction models (Zitnik & Zupan, IEEE TPAMI 2015). Here, we are interested in how changes in one relation (data set) affect the latent model of another relation in the context of a given collective latent factor model. For example, in a usermovie recommendation system, how would a change of casting affect user's movie preferences? In bioinformatics, how would a change in gene expression data influence prediction of genedisease associations?
We address this challenge by developing an approach to estimate dependence between any two relations within a single run of inference algorithm. Forensic derives from the theory of Frechet derivation and matrix conditioning and can be used with any collective matrix factorization.
See our poster for more details.
Thursday, 25 June 2015 14:38
Marinka
My talk at the Summer School on Computational Topology in Ljubljana, Slovenia was about coupling compressive data fusion methods with algebraic topology, in particular persistent homology. There, I discussed how the latent data space obtained by fusion of heterogeneous biological data sets can be explored with topological methods.
In a case study from molecular biology, which included nearly two dozen data sets, we studied persistence (lifetime) of various topological features, e.g. connected components, loops, voids, tunnels, etc. We showed that significant topological features, i.e. features with long lifetime, also carry biologically relevant information. For example, gene modules with significant topology were enriched for cellular functions and biological processes, and, similarly, persistent drug modules captured the structural similarity between drugs.
The slides of the talk are available.
Last Updated on Thursday, 18 February 2016 07:47
Sunday, 21 June 2015 17:04
Marinka
Our invited talk at the Workshop on Matrix Computations for Biomedical Informatics at the 15th Conference on Artificial Intelligence in Medicine, AIME in Pavia, Italy, discussed the use of collective latent factor models for various predictive modeling tasks in biomedicine, such as gene prioritization, gene function prediction, network inference and discovery of diseasedisease associations.
In the talk given together with Blaz Zupan, we highlighted our recent developments of data fusion approaches via latent factor models.
The slides of the talk are available at Prezi.
Last Updated on Friday, 21 August 2015 16:05
Sunday, 14 June 2015 10:54
Marinka
Our paper on Gene network inference by fusing data from diverse distributions has been published in Bioinformatics. We will present it at ISMB 2015 in Dublin.
In the paper we describe FuseNet, a Markov network formulation that infers networks from a collection of potentially nonidentically distributed datasets.
Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their stateoftheart inference procedures assume the data arise from a Gaussian distribution. Highthroughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of datasets.
FuseNet is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study, we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNAsequencing and somatic mutation data, a novel application of graphical models. Fusion of datasets offers substantial gains relative to inference of separate networks for each dataset. We also demonstrate that network inference methods for nonGaussian data can help in accurate modeling of the data generated by emergent highthroughput technologies.
Last Updated on Friday, 21 August 2015 16:06
Sunday, 14 June 2015 09:43
Marinka
Our poster on Gene prioritization by compressive data fusion and chaining got best poster award at the Basel Computational Biology Conference ([BC]^2).
The poster highlights our recent computational method that prioritizes genes by fusing heterogeneous data. An appealing property of our approach is its ability to consider data that might be provided in totally different input spaces, which is achieved through chaining of the latent data representation. We report on a very successful application in hunting genes responsible for bacterial resistance in Dictyostelium, where our predictions were validated in the wet lab.
Last Updated on Sunday, 14 June 2015 10:05
Monday, 08 June 2015 20:11
Marinka
Together with Blaz Zupan we organize a tutorial on data fusion at the Basel Computational Biology Conference ([BC]^2). The tutorial is targeted at computational scientists, data mining researchers and molecular biologists interested in largescale data integration and predictive modeling.
In the tutorial we focus on collective latent factor models, which have gained popularity in recent years through many successful applications in integrative predictive modeling. We have prepared a series of short lecture notes, which provide the intuition and mathematics behind the algorithms, explain why factorization approaches are suitable when collectively analyzing many heterogeneous data sets, and contain a number of case studies taken from recommendation systems, functional genomics, molecular and systems biology. We demonstrate several recent methodological advancements in handson sessions using Orange and its Data Fusion Addon.
This tutorial would not be possible without the great support by the Bioinformatics Laboratory at University of Ljubljana.
Last Updated on Friday, 21 August 2015 15:52
Monday, 06 April 2015 15:22
Marinka
Our recent paper in Systems Biomedicine describes a new computational approach that predicts patient’s survival time from a collection of heterogeneous data sets. This is the full paper of our award winning entry at CAMDA meeting at ISMB 2014, Boston, MA, USA.
The approach builds upon recently proposed collective matrix factorization and a wellknown Aalen’s additive model for survival regression. Unlike existing methods for survival time prediction, we formulated a joint inference procedure that allows us to simultaneously infer model parameters of collective matrix factorization and regression coefficients of Aalen’s model. We demonstrated improved performance of our method over several baselines in case studies involving three cancer types from the International Cancer Genome Consortium and diverse data sets, such as gene and miRNA expression profiles, somatic mutation data, methylation and gene annotations from the Gene Ontology. We demonstrate that both latent data representation and joint inference, the two features of our approach, contribute substantially to the accurate prediction of survival time. Our results allude to the potential benefits of data fusion when inferring survival models that are predictive of clinical outcomes.
Last Updated on Friday, 21 August 2015 16:06
Tuesday, 10 February 2015 16:50
Marinka
Our recent paper in Journal of Computational Biology introduces an interaction data imputation method called networkguided matrix completion (NGMC). The core part of NGMC is lowrank probabilistic matrix completion that incorporates prior knowledge presented as a collection of gene networks. NGMC assumes that interactions are transitive, such that latent gene interaction profiles inferred by NGMC depend on the profiles of their direct neighbors in gene networks. As the NGMC inference algorithm progresses, it propagates latent interaction profiles through each of the networks and updates gene network weights toward improved prediction.
In a study with four different EMAP data assays and considered protein–protein interaction and gene ontology similarity networks, NGMC significantly surpassed existing alternative techniques. Inclusion of information from gene networks also allowed NGMC to predict interactions for genes that were not included in original EMAP assays, a task that could not be considered by current imputation approaches.
Epistatic miniarray profile (EMAP) is a popular largescale genetic interaction discovery platform. EMAPs benefit from quantitative output, which makes it possible to detect subtle interactions with greater precision. However, due to the limits of biotechnology, EMAP studies fail to measure genetic interactions for up to 40% of gene pairs in an assay. Missing measurements can be recovered by computational techniques for data imputation, in this way completing the interaction profiles and enabling downstream analysis algorithms that could otherwise be sensitive to missing data values.
Conference paper is available from the RECOMB 2014 website. Supplementary material is available from GitHub repository NGMC.
Last Updated on Thursday, 18 February 2016 07:00
Tuesday, 23 December 2014 21:02
Marinka
The Winter 2014 issue of ACM XRDS is here! This issue is on health informatics, which has received considerable attention both in research and general public in recent years. You can read about the opportunities of social media in health and wellbeing, #engaging health initiatives and challenges in personal health tracking, among others.
My department contributed the column about the anatomy of a human disease network. In the column, we explore the human disease network and demonstrate how networkbased tools can help us understand relations between diseases at a higher level of organismal organization without considering any prior biomedical knowledge.
Tools of network analysis have recently been applied to many complex systems, to both simplify and highlight their underlying structure and the relationships that they represent. The results obtained from networkbased approaches provide not only insight into interactions between online users, but also new clues about how to improve our understanding of biological systems. Network medicine in particular, a networkbased approach to studying human disease, has proven effective in studying interdependence between molecular components in cells, and in identifying disease modules and biological pathways.
Last Updated on Friday, 21 August 2015 15:01
Thursday, 11 December 2014 21:40
Marinka
We recently published a paper on a new data fusion method in IEEE Transactions on Pattern Analysis and Machine Intelligence.
For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system’s constraints. In the paper we describe a data fusion approach with penalized matrix trifactorization, called data fusion by matrix factorization (DFMF), that simultaneously factorizes many data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from featurebased representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.
Short preprint is available at Arxiv:1307.0803. Full paper is online at IEEE.
Last Updated on Friday, 21 August 2015 16:05

