Marinka Zitnik

Fusing bits and DNA

  • Increase font size
  • Default font size
  • Decrease font size

ACM XRDS: The Infinite Mixtures of Food Products

The Fall issue of ACM XRDS is here! In this issue of XRDS, we take a closer look at the marriage of physics and computer science through quantum computing. Quantum computing is a model of computation that breaks with the tradition of digital computers surround us. The issue covers recent advances in the field of quantum computing, such as computer simulation, complexity theory, simulated annealing and machine learning, as well as an in-depth profile of David Deutsch who pioneered the field of quantum computation.

My department contributed a column on the infinite mixture models applied to the problem of clustering food products. Infinite mixture models are useful because they do not impose any a priori bound on the number of clusters in the data. This is in contrast with finite mixture models, which assume a finite and fixed number of clusters that have to be specified before the analysis is started. The column describes infinite mixture models through a generative story and then uses Gibbs sampling to cluster the food facts. It can be seen that the number of clusters detected by the model varies as we feed in more food products. As expected, the model discovers more clusters as more food products arrive. Additionally, results show that detected food clusters have distinct nutritional profiles revealing interesting nutrition patterns.

 

ISMB 2016: Connecting Gene-Disease Contexts

We presented our recent approach for disease module detection at the ISMB 2016Slides are available. The method is capable of making inference over heterogeneous data collections in new interesting ways! One of them, an approach we call jumping across data contexts, connects entities, such as genes and diseases, through semantically distinct chains, which are estimated by a collective latent variable model.

 

Bioinformatics: Jumping Across Contexts Using Compressive Fusion

Our paper on Jumping across biomedical contexts using compressive data fusion has just appeared in Bioinformatics. We will present the paper at ISMB 2016 in July 2016.

The rapid growth of diverse biological data allows us to consider interactions between a variety of objects, such as genes, chemicals, molecular signatures, diseases, pathways and environmental exposures. Often, any pair of objects—such as a gene and a disease—can be related in different ways, for example, directly via gene–disease associations or indirectly via functional annotations, chemicals and pathways. In this paper, we show that different ways of relating these objects carry different semantic meanings that are largely ignored by established computational methods.

We present an approach that operates on large-scale heterogeneous data collections and explicitly distinguishes between diverse data semantics. The approach detects size-k modules of objects that, taken together, appear most significant to another set of objects. The method builds on collective matrix factorization to derive different semantics, and it formulates the growing of the modules as a submodular optimization program.

In a systematic study on more than three hundred complex diseases, we show the effectiveness of the approach in associating genes with diseases and detecting disease modules.

 

ACM XRDS: Cultures of Computing

The Summer issue of ACM XRDS is here! The issue is centered around computing, culture, postcoloniality and questions of power. In it, many fascinating authors ask whether an Anglo-European culture of computing could be made more aware of its politics and what alternative cultures of computing could be realized. Our amazing issue editors, Ahmed Ansari (CMU) and Raghavendra Kandala (CMU), have tried to give the readers a slice of the incredible heterogeneity and plurality of critical scholarship and practice around the world.

The issue provides a brief introduction to decolonial computing and raises various issues around design and innovation in China, participation of Africans in the global HCI community, the life at the forefront of Indonesia's tech emancipation, and plans to develop hundreds of smart cities in India, revealing the complex politics of technological development and class.

Jennifer Jacobs (MIT) and I served as co-editors for the issue.

 

ACM XRDS: The Brownian Wanderlust of Things

The Spring issue of ACM XRDS is here! This issue is centered around digital fabrication, which in many ways highlights the expanded role of computer in today's society. Digital fabrication is not merely about 3D printing knickknacks, rather it enables individuals to create their own systems and devices using new technologies.

My department contributed a column on the Brownian wanderlust of things. Consider a gambler who starts with an initial fortune and plays the following simple coin tossing game. At each turn, the dealer throws an unbiased coin. If the outcome is head, the gambler wins a unit; if the coin comes up tails, the gambler loses a unit. The gambler continues to play until he is either bankrupted or his current holdings reach some fixed desired amount.

Stochastic models of this kind can have much wider implications than just estimating the fortune of a gambler flipping a coin. For example, the way in which information flows within social media outlets can affect mobilization and strategic interactions between participants of mass social movements, such as protests. While traditionally social movements have spread through on-the-ground unions, the use of communication platforms—such as Twitter and Facebook—has offered alternative ways for organizing such events. As we see in the column, to truly capture propagation in such environments, we need to take into consideration the stochastic nature of information propagation.

 

Bioinformatics: Orthogonal Factorization of RNA-Binding Proteins

Our paper on integrative analysis of multiple RNA-binding proteins has just appeared in Bioinformatics. RNA binding proteins (RBPs) are important for many cellular processes, including post-transcriptional control of gene expression, splicing, transport, polyadenylation and RNA stability. To better understand the RBP mechanisms we aimed to integrate the rapidly growing RBP experimental data with the latest genome annotation, gene function, RNA sequence and structure.

We have developed an integrative orthogonality-regularized nonnegative matrix factorization that can integrate multiple data sets and discover non-overlapping and class-specific RNA binding patterns of varying strengths. The orthogonality constraint is important here because it enables us to substantially reduce the effective size of inferred factor models.

The new models have proved powerful in predicting RBP interaction sites on RNA. We also showed that joint analysis of multiple data sets can boost retrieval accuracy of RNA binding sites, which we studied using the largest RBP data compendium to date.

 

ACM XRDS: Activities of Daily Living in the Era of Internet of Things

The Winter issue of ACM XRDS is here! This issue discusses Internet of Things (IoT), a collection of emerging technologies that promise to seamlessly expand our sensing capabilities across the globe with imagination as our only limit. In the issue you can read about the prospect for the IoT as seen by leaders in the field, the challenges of building network awareness, the trends of IoT platforms. You will also find columns about ontology-supported stream reasoning for querying flying robots, the importance of encrypted data in IoT systems, and ways of managing droughts using tech.

My department contributed a column on predicting activities of daily living (ADL) from sensor activation profiles. Together with Lara Zupan we used an open-source data mining tool for visual programming called Orange to analyze patterns of how humans interact with household devices. These interaction patterns provided powerful clues that helped us recognize various activities that take place in home environments.

 

PSB 2016: Collective Pairwise Classification for Multi-Way Analysis

Our paper on Collective Pairwise Classification for Multi-Way Analysis has been published in the Proceedings of the 21st Pacific Symposium on Biocomputing. We will present the work at the PSB conference in January 2016.

In the paper, we develop a collective pairwise classification approach for multi-way data analysis. The approach leverages the superiority of latent factor models for analyzing large heterogeneous relational data sets and provides probabilistic estimates of relationships by optimizing a pairwise ranking loss. Although the method bears correspondence with the maximization of a non-differentiable area under the receiver operating characteristic curve, we were able to design a learning algorithm that scales well on large multi-relational data.

We used the method to infer relationships from multiplex drug data and to predict connections between clinical manifestations of diseases and their underlying molecular signatures. An appealing property of the method is its ability to make category-jumping inferences, such as predictions about diseases based solely on genomic and clinical data generated far outside the molecular context.

 

ACM XRDS: The Marvel Comic Book Universe

The Fall issue of ACM XRDS is here! In this issue we write about virtual reality. Among others, you can read about the virtual reality revolution and ways to bring virtual reality home. The issue also discusses how to use your own muscles to achieve realistic physical experience, how to manage cybersickness in virtual reality, and how to avoid danger with mine disaster simulations.

My department contributed a column on mining the Marvel comic book universe. Together with Lara Zupan we scraped Wikipedia to obtain information on the Marvel comics characters and then analyzed the structure of the Marvel multiverse network, where two characters were considered linked if they shared a skill set. Here, the analysis of complex networks allowed us to better understand how properties of fictitious networks emerge from non-trivial interactions between characters.

 

BMC Bioinformatics: Extracting Gene Regulatory Networks from Text

Our paper on Sieve-based relation extraction of gene regulatory networks from biological literature has been published in BMC Bioinformatics.

In the paper, we describe a network extraction algorithm, which is an improvement on our winning submission to BioNLP 2013. Our method is designed as a sieve-based system and uses linear-chain conditional random fields and rules for relation extraction. To enable extraction of distant relations we transform the data into skip-mention sequences. We then infer multiple models, each of which is able to extract a particular relationship type (e.g., inhibition, activation, binding). Further analysis following the challenge showed that all relation extraction sieves contribute to the predictive performance of the proposed approach. Also, features constructed by considering mention words and their prefixes and suffixes are the most important features for higher accuracy of extraction. The analysis also showed that our choice of transforming data into skip-mention sequences is appropriate for detecting relations between distant mentions.

 

PLoS CompBio: Gene Prioritization by Compressive Data Fusion

Our paper on Gene prioritization by compressive data fusion and chaining has been published in PLoS Computational Biology.

In the paper, we present Collage, a new data fusion approach to gene prioritization. Together with collaborators from Baylor College of Medicine, we tested Collage by prioritizing bacterial response genes in Dictyostelium as a novel model system for prokaryote-eukaryote interactions.

We started from four bacterial response genes and 14 different data sets ranging from gene expression to pathway and literature information. Collage proposed eight candidate genes that were tested in the wet laboratory. Mutations in all eight candidates reduced the ability of the amoebae to grow on Gram-negative bacteria. Furthermore, five out of the eight candidate genes were required for growth on Gram-negative bacteria but had no discernible effect on growth on Gram-positive bacteria. This is a remarkably accurate result since only about a hundred of the 12,000 Dictyostelium genes are estimated to be responsible for bacterial response.

 

Data Fusion Tutorial at the IEEE Engineering in Medicine and Biology

Together with Blaz Zupan we organize a tutorial on data fusion at the International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

In the tutorial, we will explore latent factor models, a popular class of approaches that have in recent years seen many successful applications in integrative data analysis. We will describe the intuition behind matrix factorization and explain why latent factor models are suitable when collectively analyzing many heterogeneous data sets. To practice data fusion, we will construct visual data fusion workflows using Orange and its Data Fusion add-on.

This tutorial would not be possible without the great support by the Bioinformatics Laboratory at University of Ljubljana.

 

ACM XRDS: Understanding Cancer with Matrix Factorization

The Summer issue of ACM XRDS is here! In this issue we write about computational biology. Our features and interviews present different perspectives about some of the most recent advances of computational biology. You can read about personalized medicine and the use of genetic data to improve drug treatment, pharmacogenetics, machine learning techniques for mapping genetic differences to phenotypes in large-scale genome-wide association studies, computational approaches towards prediction of patient outcomes based on electronic health records, statistical techniques for drug discovery, etc. This issue also includes discussions on cutting-edge techniques, such as the analysis of single cell measurement data.

My department contributed a column on mining cancer data with matrix factorization, an established class of algorithms that proved useful in many bioinformatic studies. Diversity and abundance of data provided by the cancer projects like The International Cancer Genome Consortium challenge computer scientists of all kinds to develop innovative software, hardware, and analytic solutions for data analysis. We expect that with computationally and statistically stronger approaches, such as factorization models, we will be once able to reveal biological features that drive cancer development, define cancer types relevant for prognosis, and, ultimately, enable the development of new cancer therapies.

 

ISMB 2015: Gene Network Inference via Data Fusion

Our paper at ISMB 2015 addresses a challenging task of inferring gene networks by taking into consideration potentially many data sets. Importantly, these data sets might be nonidentically distributed and can follow any combination of exponential family distributions. To tackle this challenge we develop an efficient Markov network model that achieves fusion by reusing latent model parameters.

Empirical studies on cancer genome data sets show an advantage of joint inference over separate network inference and the merits of incorporating information about the underlying data distribution into inference.

The slides of the talk are available.

 

ISMB 2015: Integrate Everything but the Kitchen Sink

Our poster at ISMB 2015 is concerned with data set selection and sensitivity estimation in collective factor models.

Molecular biology data is rich in volume as well as heterogeneity. We can view individual data sets as relations between objects of different types, for example, function annotations describe relationships between genes and functions. We represent a large data compendium with a multiscale and multiplex relation graph. Recently, latent factor models were developed to fuse such representations and collectively infer accurate prediction models (Zitnik & Zupan, IEEE TPAMI 2015). Here, we are interested in how changes in one relation (data set) affect the latent model of another relation in the context of a given collective latent factor model. For example, in a user-movie recommendation system, how would a change of casting affect user's movie preferences? In bioinformatics, how would a change in gene expression data influence prediction of gene-disease associations?

We address this challenge by developing an approach to estimate dependence between any two relations within a single run of inference algorithm. Forensic derives from the theory of Frechet derivation and matrix conditioning and can be used with any collective matrix factorization.

See our poster for more details.

 

Compressive Data Fusion and Persistent Homology

E-mail Print PDF

My talk at the Summer School on Computational Topology in Ljubljana, Slovenia was about coupling compressive data fusion methods with algebraic topology, in particular persistent homology. There, I discussed how the latent data space obtained by fusion of heterogeneous biological data sets can be explored with topological methods.

In a case study from molecular biology, which included nearly two dozen data sets, we studied persistence (lifetime) of various topological features, e.g. connected components, loops, voids, tunnels, etc. We showed that significant topological features, i.e. features with long lifetime, also carry biologically relevant information. For example, gene modules with significant topology were enriched for cellular functions and biological processes, and, similarly, persistent drug modules captured the structural similarity between drugs.

The slides of the talk are available.

Last Updated on Thursday, 18 February 2016 07:47
 

Invited Talk on Learning Latent Factor Models by Data Fusion

E-mail Print PDF

Our invited talk at the Workshop on Matrix Computations for Biomedical Informatics at the 15th Conference on Artificial Intelligence in Medicine, AIME in Pavia, Italy, discussed the use of collective latent factor models for various predictive modeling tasks in biomedicine, such as gene prioritization, gene function prediction, network inference and discovery of disease-disease associations.

In the talk given together with Blaz Zupan, we highlighted our recent developments of data fusion approaches via latent factor models.

The slides of the talk are available at Prezi.

Last Updated on Friday, 21 August 2015 16:05
 

Bioinformatics: Gene Network Inference by Fusing Diverse Distributions

E-mail Print PDF

Our paper on Gene network inference by fusing data from diverse distributions has been published in Bioinformatics. We will present it at ISMB 2015 in Dublin.

In the paper we describe FuseNet, a Markov network formulation that infers networks from a collection of potentially nonidentically distributed datasets.

Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their state-of-the-art inference procedures assume the data arise from a Gaussian distribution. High-throughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of datasets.

FuseNet is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study, we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNA-sequencing and somatic mutation data, a novel application of graphical models. Fusion of datasets offers substantial gains relative to inference of separate networks for each dataset. We also demonstrate that network inference methods for non-Gaussian data can help in accurate modeling of the data generated by emergent high-throughput technologies.

Last Updated on Friday, 21 August 2015 16:06
 

Poster Award at the Basel Computational Biology Conference

E-mail Print PDF

Our poster on Gene prioritization by compressive data fusion and chaining got best poster award at the Basel Computational Biology Conference ([BC]^2).

The poster highlights our recent computational method that prioritizes genes by fusing heterogeneous data. An appealing property of our approach is its ability to consider data that might be provided in totally different input spaces, which is achieved through chaining of the latent data representation. We report on a very successful application in hunting genes responsible for bacterial resistance in Dictyostelium, where our predictions were validated in the wet lab.

Last Updated on Sunday, 14 June 2015 10:05
 

Data Fusion Tutorial at the Basel Computational Biology Conference

E-mail Print PDF

Together with Blaz Zupan we organize a tutorial on data fusion at the Basel Computational Biology Conference ([BC]^2). The tutorial is targeted at computational scientists, data mining researchers and molecular biologists interested in large-scale data integration and predictive modeling.

In the tutorial we focus on collective latent factor models, which have gained popularity in recent years through many successful applications in integrative predictive modeling. We have prepared a series of short lecture notes, which provide the intuition and mathematics behind the algorithms, explain why factorization approaches are suitable when collectively analyzing many heterogeneous data sets, and contain a number of case studies taken from recommendation systems, functional genomics, molecular and systems biology. We demonstrate several recent methodological advancements in hands-on sessions using Orange and its Data Fusion Add-on.

This tutorial would not be possible without the great support by the Bioinformatics Laboratory at University of Ljubljana.

Last Updated on Friday, 21 August 2015 15:52
 

Syst Biomed: Survival Regression by Data Fusion

E-mail Print PDF

Our recent paper in Systems Biomedicine describes a new computational approach that predicts patient’s survival time from a collection of heterogeneous data sets. This is the full paper of our award winning entry at CAMDA meeting at ISMB 2014, Boston, MA, USA.

The approach builds upon recently proposed collective matrix factorization and a well-known Aalen’s additive model for survival regression. Unlike existing methods for survival time prediction, we formulated a joint inference procedure that allows us to simultaneously infer model parameters of collective matrix factorization and regression coefficients of Aalen’s model. We demonstrated improved performance of our method over several baselines in case studies involving three cancer types from the International Cancer Genome Consortium and diverse data sets, such as gene and miRNA expression profiles, somatic mutation data, methylation and gene annotations from the Gene Ontology. We demonstrate that both latent data representation and joint inference, the two features of our approach, contribute substantially to the accurate prediction of survival time. Our results allude to the potential benefits of data fusion when inferring survival models that are predictive of clinical outcomes.

Last Updated on Friday, 21 August 2015 16:06
 

J Comp Biol: Network-Guided Matrix Completion

E-mail Print PDF

Our recent paper in Journal of Computational Biology introduces an interaction data imputation method called network-guided matrix completion (NG-MC). The core part of NG-MC is low-rank probabilistic matrix completion that incorporates prior knowledge presented as a collection of gene networks. NG-MC assumes that interactions are transitive, such that latent gene interaction profiles inferred by NG-MC depend on the profiles of their direct neighbors in gene networks. As the NG-MC inference algorithm progresses, it propagates latent interaction profiles through each of the networks and updates gene network weights toward improved prediction.

In a study with four different E-MAP data assays and considered protein–protein interaction and gene ontology similarity networks, NG-MC significantly surpassed existing alternative techniques. Inclusion of information from gene networks also allowed NG-MC to predict interactions for genes that were not included in original E-MAP assays, a task that could not be considered by current imputation approaches.

Epistatic miniarray profile (E-MAP) is a popular large-scale genetic interaction discovery platform. E-MAPs benefit from quantitative output, which makes it possible to detect subtle interactions with greater precision. However, due to the limits of biotechnology, E-MAP studies fail to measure genetic interactions for up to 40% of gene pairs in an assay. Missing measurements can be recovered by computational techniques for data imputation, in this way completing the interaction profiles and enabling downstream analysis algorithms that could otherwise be sensitive to missing data values.

Conference paper is available from the RECOMB 2014 website. Supplementary material is available from GitHub repository NG-MC.

Last Updated on Thursday, 18 February 2016 07:00
 

ACM XRDS: The Anatomy of a Human Disease Network

E-mail Print PDF

The Winter 2014 issue of ACM XRDS is here! This issue is on health informatics, which has received considerable attention both in research and general public in recent years. You can read about the opportunities of social media in health and well-being, #engaging health initiatives and challenges in personal health tracking, among others.

My department contributed the column about the anatomy of a human disease network. In the column, we explore the human disease network and demonstrate how network-based tools can help us understand relations between diseases at a higher level of organismal organization without considering any prior biomedical knowledge.

Tools of network analysis have recently been applied to many complex systems, to both simplify and highlight their underlying structure and the relationships that they represent. The results obtained from network-based approaches provide not only insight into interactions between online users, but also new clues about how to improve our understanding of biological systems. Network medicine in particular, a network-based approach to studying human disease, has proven effective in studying interdependence between molecular components in cells, and in identifying disease modules and biological pathways.

Last Updated on Friday, 21 August 2015 15:01
 

IEEE TPAMI: Data Fusion by Matrix Factorization

E-mail Print PDF

We recently published a paper on a new data fusion method in IEEE Transactions on Pattern Analysis and Machine Intelligence.

For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system’s constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization, called data fusion by matrix factorization (DFMF), that simultaneously factorizes many data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.

Short preprint is available at Arxiv:1307.0803. Full paper is online at IEEE.

Last Updated on Friday, 21 August 2015 16:05
 

ACM XRDS: Dynamics of News from The New York Times

E-mail Print PDF

New issue of ACM XRDS is here! The focus of this issue is on techniques for natural language processing in the broader sense. You will find interesting stories about how to detect influencers in social media discussions, how to successfully transition from academia to entrepreneurship, and read about the hurdles and opportunities in research of ancient written languages.

My department contributed a short column on exploring news from The New York Times. Techniques of information extraction and natural language processing allow us to search for news articles and to analyze dynamics of published content. News organizations, such as The New York Times, provide programmatic access to their articles to retrieve headlines, abstracts, and links to published multimedia. In the column we use The New York Times Article Search API to demonstrate how to construct search queries that retrieve documents from various news sections and time periods. We also explore the pulse of climate change over the years using data extracted from published news articles.

Last Updated on Friday, 21 August 2015 15:00
 

@Heidelberg Laureate Forum 2014

E-mail Print PDF

Recently, I have participated as young researcher in computer science at Heidelberg Laureate Forum. I encourage the reader to check recordings of some of the talks, which are available at official HLF website. If limited by time I recommend at least one of the following talks by Michael Atiyah, Manuel Blum, Wendelin Werner, Vint Cerf, Leslie Lamport, Manjul Bhargava, Daniel Spielman, Efin Zelmanov or John Hopcroft. They are engaging, full of useful tips and strategies, and should be accessible to an interested listener.

Many CS & Math bloggers followed the event, their comments and discussions about laureates' talks can be found at HLF Blogs. Among others, our poster has been highlighted by John D. Cook. Overall, HLF has been an awesome experience for me with many opportunities to network with Turing, Fields, Abel and Nevanlinna laureates and meet other young researchers in computer science and mathematics from around the world.

Last Updated on Sunday, 28 September 2014 20:13
 

Google Global Planning Committee for Women in Computer Science

E-mail Print PDF

I have been given an opportunity to join Google Global Planning Committee for Women in Computer Science in an effort to identify ways we can have the greatest impact and reach more women in tech. As member of this committee I will partner with Google to build the community and direct outreach activities for women in computer science. To kick things off, we will have our global meeting at the Grace Hopper Conference in Phoenix, AZ, USA. I am excited to be part of this great program to promote women to excel in computer science and information technology.

Stay tuned, there will be many possibilities to engage with fellow technologists!

Last Updated on Tuesday, 16 September 2014 16:25
 

@Stanford University, Department of Computer Science

E-mail Print PDF

I am visiting the Department of Computer Science at Stanford University, CA, USA in Summer and Fall 2014. During my stay we will study the interplay between network analysis, data integration and biology. There are many exciting challenges one can explore in these areas and I am very enthusiastic about the work.

Last Updated on Thursday, 21 August 2014 05:53
 

ISMB 2014: Epistasis-Based Gene Network Inference

E-mail Print PDF

I have presented our recent approach for epistasis-based gene network inference at ISMB 2014. We propose a factorized model of interactions that is used for scoring of different types of gene-gene relationships, such as epistasis, parallelism and partial interdependence, and assembly of gene networks that are consistent with estimated pairwise relationships. Detailed derivation of the method and its empirical comparisons with existing approaches are described in our paper published by Bioinformatics.

Last Updated on Thursday, 09 July 2015 15:08
 

CAMDA 2014: Survival Regression by Data Fusion

E-mail Print PDF

I have presented at CAMDA 2014 an extension of our recent matrix factorization-based data fusion approach that couples data fusion with survival regression. CAMDA 2014 runs as a satellite meeting at ISMB 2014, Boston, MA, USA. Our presentation got CAMDA best presentation award.

Any knowledge discovery could in principal benefit from the fusion of directly or even indirectly related data sources. In this work, we explore if a recently proposed simultaneous matrix factorization data fusion approach could be adapted for survival regression. We propose a new method that jointly infers latent factors by data fusion and estimates regression coefficients of survival model. We have applied the method to CAMDA 2014 large-scale Cancer Genomes Challenge and modeled survival time as a function of gene, protein and miRNA expression data, and data on methylated and mutated regions. We find that both joint inference of factors and regression coefficients on one side and data fusion procedure on the other are crucial for performance. Our approach is substantially more accurate than baseline Aalen's additive model. Latent factors inferred by our approach could be mined further; we found that the most informative factors are related to known cancer processes.

Last Updated on Thursday, 09 July 2015 15:08
 

Gene network inference by probabilistic scoring of relationships from a factorized model of interactions

E-mail Print PDF

Bioinformatics just published a special issue devoted to ISMB 2014 proceedings papers that will be presented next month at the world's premier conference on computational biology -- ISMB 2014 in Boston, MA, USA.

Our paper, Gene network inference by probabilistic scoring of relationships from a factorized model of interactions, which you will find in this issue of Bioinformatics, describes a conceptually new probabilistic approach to gene network inference from quantitative interaction data called Red. Red is founded on epistasis analysis. Epistasis analysis is an essential tool of classical genetics for inferring the order of function of genes in a common pathway. Typically, it considers single and double mutant phenotypes and for a pair of genes observes if a change in the first gene masks the effects of the mutation in the second gene. Despite the recent emergence of biotechnology techniques that can provide gene interaction data on a large, possibly genomic scale, very few methods are available for quantitative epistasis analysis and epistasis-based network reconstruction.

The features of Red are joint treatment of the mutant phenotype data with a factorized model and probabilistic scoring of pairwise gene relationships that are inferred from the latent gene representation. The resulting gene network is assembled from scored pairwise relationships. In an experimental study, we show that the proposed approach can accurately reconstruct several known pathways and that it surpasses the accuracy of current approaches.

Last Updated on Wednesday, 13 August 2014 05:21
 


Page 2 of 4