Marinka Zitnik

Fusing bits and DNA

  • Increase font size
  • Default font size
  • Decrease font size

Compressive Data Fusion and Persistent Homology

E-mail Print PDF

My talk at the Summer School on Computational Topology in Ljubljana, Slovenia was about coupling compressive data fusion methods with algebraic topology, in particular persistent homology. There, I discussed how the latent data space obtained by fusion of heterogeneous biological data sets can be explored with topological methods.

In a case study from molecular biology, which included nearly two dozen data sets, we studied persistence (lifetime) of various topological features, e.g. connected components, loops, voids, tunnels, etc. We showed that significant topological features, i.e. features with long lifetime, also carry biologically relevant information. For example, gene modules with significant topology were enriched for cellular functions and biological processes, and, similarly, persistent drug modules captured the structural similarity between drugs.

The slides of the talk are available.

Last Updated on Thursday, 18 February 2016 07:47
 

Invited Talk on Learning Latent Factor Models by Data Fusion

E-mail Print PDF

Our invited talk at the Workshop on Matrix Computations for Biomedical Informatics at the 15th Conference on Artificial Intelligence in Medicine, AIME in Pavia, Italy, discussed the use of collective latent factor models for various predictive modeling tasks in biomedicine, such as gene prioritization, gene function prediction, network inference and discovery of disease-disease associations.

In the talk given together with Blaz Zupan, we highlighted our recent developments of data fusion approaches via latent factor models.

The slides of the talk are available at Prezi.

Last Updated on Friday, 21 August 2015 16:05
 

Bioinformatics: Gene Network Inference by Fusing Diverse Distributions

E-mail Print PDF

Our paper on Gene network inference by fusing data from diverse distributions has been published in Bioinformatics. We will present it at ISMB 2015 in Dublin.

In the paper we describe FuseNet, a Markov network formulation that infers networks from a collection of potentially nonidentically distributed datasets.

Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their state-of-the-art inference procedures assume the data arise from a Gaussian distribution. High-throughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of datasets.

FuseNet is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study, we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNA-sequencing and somatic mutation data, a novel application of graphical models. Fusion of datasets offers substantial gains relative to inference of separate networks for each dataset. We also demonstrate that network inference methods for non-Gaussian data can help in accurate modeling of the data generated by emergent high-throughput technologies.

Last Updated on Friday, 21 August 2015 16:06
 

Poster Award at the Basel Computational Biology Conference

E-mail Print PDF

Our poster on Gene prioritization by compressive data fusion and chaining got best poster award at the Basel Computational Biology Conference ([BC]^2).

The poster highlights our recent computational method that prioritizes genes by fusing heterogeneous data. An appealing property of our approach is its ability to consider data that might be provided in totally different input spaces, which is achieved through chaining of the latent data representation. We report on a very successful application in hunting genes responsible for bacterial resistance in Dictyostelium, where our predictions were validated in the wet lab.

Last Updated on Sunday, 14 June 2015 10:05
 

Data Fusion Tutorial at the Basel Computational Biology Conference

E-mail Print PDF

Together with Blaz Zupan we organize a tutorial on data fusion at the Basel Computational Biology Conference ([BC]^2). The tutorial is targeted at computational scientists, data mining researchers and molecular biologists interested in large-scale data integration and predictive modeling.

In the tutorial we focus on collective latent factor models, which have gained popularity in recent years through many successful applications in integrative predictive modeling. We have prepared a series of short lecture notes, which provide the intuition and mathematics behind the algorithms, explain why factorization approaches are suitable when collectively analyzing many heterogeneous data sets, and contain a number of case studies taken from recommendation systems, functional genomics, molecular and systems biology. We demonstrate several recent methodological advancements in hands-on sessions using Orange and its Data Fusion Add-on.

This tutorial would not be possible without the great support by the Bioinformatics Laboratory at University of Ljubljana.

Last Updated on Friday, 21 August 2015 15:52
 

Syst Biomed: Survival Regression by Data Fusion

E-mail Print PDF

Our recent paper in Systems Biomedicine describes a new computational approach that predicts patient’s survival time from a collection of heterogeneous data sets. This is the full paper of our award winning entry at CAMDA meeting at ISMB 2014, Boston, MA, USA.

The approach builds upon recently proposed collective matrix factorization and a well-known Aalen’s additive model for survival regression. Unlike existing methods for survival time prediction, we formulated a joint inference procedure that allows us to simultaneously infer model parameters of collective matrix factorization and regression coefficients of Aalen’s model. We demonstrated improved performance of our method over several baselines in case studies involving three cancer types from the International Cancer Genome Consortium and diverse data sets, such as gene and miRNA expression profiles, somatic mutation data, methylation and gene annotations from the Gene Ontology. We demonstrate that both latent data representation and joint inference, the two features of our approach, contribute substantially to the accurate prediction of survival time. Our results allude to the potential benefits of data fusion when inferring survival models that are predictive of clinical outcomes.

Last Updated on Friday, 21 August 2015 16:06
 

J Comp Biol: Network-Guided Matrix Completion

E-mail Print PDF

Our recent paper in Journal of Computational Biology introduces an interaction data imputation method called network-guided matrix completion (NG-MC). The core part of NG-MC is low-rank probabilistic matrix completion that incorporates prior knowledge presented as a collection of gene networks. NG-MC assumes that interactions are transitive, such that latent gene interaction profiles inferred by NG-MC depend on the profiles of their direct neighbors in gene networks. As the NG-MC inference algorithm progresses, it propagates latent interaction profiles through each of the networks and updates gene network weights toward improved prediction.

In a study with four different E-MAP data assays and considered protein–protein interaction and gene ontology similarity networks, NG-MC significantly surpassed existing alternative techniques. Inclusion of information from gene networks also allowed NG-MC to predict interactions for genes that were not included in original E-MAP assays, a task that could not be considered by current imputation approaches.

Epistatic miniarray profile (E-MAP) is a popular large-scale genetic interaction discovery platform. E-MAPs benefit from quantitative output, which makes it possible to detect subtle interactions with greater precision. However, due to the limits of biotechnology, E-MAP studies fail to measure genetic interactions for up to 40% of gene pairs in an assay. Missing measurements can be recovered by computational techniques for data imputation, in this way completing the interaction profiles and enabling downstream analysis algorithms that could otherwise be sensitive to missing data values.

Conference paper is available from the RECOMB 2014 website. Supplementary material is available from GitHub repository NG-MC.

Last Updated on Thursday, 18 February 2016 07:00
 

ACM XRDS: The Anatomy of a Human Disease Network

E-mail Print PDF

The Winter 2014 issue of ACM XRDS is here! This issue is on health informatics, which has received considerable attention both in research and general public in recent years. You can read about the opportunities of social media in health and well-being, #engaging health initiatives and challenges in personal health tracking, among others.

My department contributed the column about the anatomy of a human disease network. In the column, we explore the human disease network and demonstrate how network-based tools can help us understand relations between diseases at a higher level of organismal organization without considering any prior biomedical knowledge.

Tools of network analysis have recently been applied to many complex systems, to both simplify and highlight their underlying structure and the relationships that they represent. The results obtained from network-based approaches provide not only insight into interactions between online users, but also new clues about how to improve our understanding of biological systems. Network medicine in particular, a network-based approach to studying human disease, has proven effective in studying interdependence between molecular components in cells, and in identifying disease modules and biological pathways.

Last Updated on Friday, 21 August 2015 15:01
 

IEEE TPAMI: Data Fusion by Matrix Factorization

E-mail Print PDF

We recently published a paper on a new data fusion method in IEEE Transactions on Pattern Analysis and Machine Intelligence.

For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system’s constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization, called data fusion by matrix factorization (DFMF), that simultaneously factorizes many data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.

Short preprint is available at Arxiv:1307.0803. Full paper is online at IEEE.

Last Updated on Friday, 21 August 2015 16:05
 

ACM XRDS: Dynamics of News from The New York Times

E-mail Print PDF

New issue of ACM XRDS is here! The focus of this issue is on techniques for natural language processing in the broader sense. You will find interesting stories about how to detect influencers in social media discussions, how to successfully transition from academia to entrepreneurship, and read about the hurdles and opportunities in research of ancient written languages.

My department contributed a short column on exploring news from The New York Times. Techniques of information extraction and natural language processing allow us to search for news articles and to analyze dynamics of published content. News organizations, such as The New York Times, provide programmatic access to their articles to retrieve headlines, abstracts, and links to published multimedia. In the column we use The New York Times Article Search API to demonstrate how to construct search queries that retrieve documents from various news sections and time periods. We also explore the pulse of climate change over the years using data extracted from published news articles.

Last Updated on Friday, 21 August 2015 15:00
 

@Heidelberg Laureate Forum 2014

E-mail Print PDF

Recently, I have participated as young researcher in computer science at Heidelberg Laureate Forum. I encourage the reader to check recordings of some of the talks, which are available at official HLF website. If limited by time I recommend at least one of the following talks by Michael Atiyah, Manuel Blum, Wendelin Werner, Vint Cerf, Leslie Lamport, Manjul Bhargava, Daniel Spielman, Efin Zelmanov or John Hopcroft. They are engaging, full of useful tips and strategies, and should be accessible to an interested listener.

Many CS & Math bloggers followed the event, their comments and discussions about laureates' talks can be found at HLF Blogs. Among others, our poster has been highlighted by John D. Cook. Overall, HLF has been an awesome experience for me with many opportunities to network with Turing, Fields, Abel and Nevanlinna laureates and meet other young researchers in computer science and mathematics from around the world.

Last Updated on Sunday, 28 September 2014 20:13
 

Google Global Planning Committee for Women in Computer Science

E-mail Print PDF

I have been given an opportunity to join Google Global Planning Committee for Women in Computer Science in an effort to identify ways we can have the greatest impact and reach more women in tech. As member of this committee I will partner with Google to build the community and direct outreach activities for women in computer science. To kick things off, we will have our global meeting at the Grace Hopper Conference in Phoenix, AZ, USA. I am excited to be part of this great program to promote women to excel in computer science and information technology.

Stay tuned, there will be many possibilities to engage with fellow technologists!

Last Updated on Tuesday, 16 September 2014 16:25
 

@Stanford University, Department of Computer Science

E-mail Print PDF

I am visiting the Department of Computer Science at Stanford University, CA, USA in Summer and Fall 2014. During my stay we will study the interplay between network analysis, data integration and biology. There are many exciting challenges one can explore in these areas and I am very enthusiastic about the work.

Last Updated on Thursday, 21 August 2014 05:53
 

ISMB 2014: Epistasis-Based Gene Network Inference

E-mail Print PDF

I have presented our recent approach for epistasis-based gene network inference at ISMB 2014. We propose a factorized model of interactions that is used for scoring of different types of gene-gene relationships, such as epistasis, parallelism and partial interdependence, and assembly of gene networks that are consistent with estimated pairwise relationships. Detailed derivation of the method and its empirical comparisons with existing approaches are described in our paper published by Bioinformatics.

Last Updated on Thursday, 09 July 2015 15:08
 

CAMDA 2014: Survival Regression by Data Fusion

E-mail Print PDF

I have presented at CAMDA 2014 an extension of our recent matrix factorization-based data fusion approach that couples data fusion with survival regression. CAMDA 2014 runs as a satellite meeting at ISMB 2014, Boston, MA, USA. Our presentation got CAMDA best presentation award.

Any knowledge discovery could in principal benefit from the fusion of directly or even indirectly related data sources. In this work, we explore if a recently proposed simultaneous matrix factorization data fusion approach could be adapted for survival regression. We propose a new method that jointly infers latent factors by data fusion and estimates regression coefficients of survival model. We have applied the method to CAMDA 2014 large-scale Cancer Genomes Challenge and modeled survival time as a function of gene, protein and miRNA expression data, and data on methylated and mutated regions. We find that both joint inference of factors and regression coefficients on one side and data fusion procedure on the other are crucial for performance. Our approach is substantially more accurate than baseline Aalen's additive model. Latent factors inferred by our approach could be mined further; we found that the most informative factors are related to known cancer processes.

Last Updated on Thursday, 09 July 2015 15:08
 

Gene network inference by probabilistic scoring of relationships from a factorized model of interactions

E-mail Print PDF

Bioinformatics just published a special issue devoted to ISMB 2014 proceedings papers that will be presented next month at the world's premier conference on computational biology -- ISMB 2014 in Boston, MA, USA.

Our paper, Gene network inference by probabilistic scoring of relationships from a factorized model of interactions, which you will find in this issue of Bioinformatics, describes a conceptually new probabilistic approach to gene network inference from quantitative interaction data called Red. Red is founded on epistasis analysis. Epistasis analysis is an essential tool of classical genetics for inferring the order of function of genes in a common pathway. Typically, it considers single and double mutant phenotypes and for a pair of genes observes if a change in the first gene masks the effects of the mutation in the second gene. Despite the recent emergence of biotechnology techniques that can provide gene interaction data on a large, possibly genomic scale, very few methods are available for quantitative epistasis analysis and epistasis-based network reconstruction.

The features of Red are joint treatment of the mutant phenotype data with a factorized model and probabilistic scoring of pairwise gene relationships that are inferred from the latent gene representation. The resulting gene network is assembled from scored pairwise relationships. In an experimental study, we show that the proposed approach can accurately reconstruct several known pathways and that it surpasses the accuracy of current approaches.

Last Updated on Wednesday, 13 August 2014 05:21
 

ACM XRDS: Exploring Data with Topological Tools

E-mail Print PDF

The Summer issue of ACM XRDS is here! This issue focuses on diversity in computer science. You will find columns about how to make the tech more inclusive, women in computing, self-teaching and how hip-hop lyrics can be used in combination with artificial intelligence to engage more students in computer science. Also, you should not miss the Features section! There, you will learn, among others, about a research project in Germany that integrates gender and diversity in STEM fields and read about how neuroscience has revealed that we sometimes judge others by their gender or ethnicity without even realizing it. What can be done to address these issues? Check out the ACM XRDS's advice.

For the computationally inspired among you I have contributed a column that describes one of many possible usages of computational topology for exploratory data analysis. Tools from topology increasingly serve to inspire the development of novel computational methods for data analysis. With these methods we can study qualitative geometric information of the data to understand how they are organized on a large scale and focus on intrinsic shape properties rather than on characteristics that depend on a particular choice of a coordinate system. The column applies a topological tool called Mapper to extract and visualize simple descriptions of data sets.

Last Updated on Friday, 21 August 2015 15:01
 

Young Researcher in the Heidelberg Laureate Forum 2014

E-mail Print PDF

I have been selected to participate as young researcher in the Heidelberg Laureate Forum 2014 (HLF). The Forum will take place in September and will bring together winners of the Abel Prize and Fields Medal (mathematics) as well as the Turing Award and Nevanlinna Prize (computer science) with young researchers from around the world selected by an international committee of experts primarily from the award granting organizations. I was fortunate and was given an opportunity to be one of 200 young researchers (there are 100 spaces for each discipline of mathematics and computer science) that will be part of this Forum.

The HLF is an event inspired by Lindau Nobel Laureates Meetings, which provide a forum where people dedicated to science, both role models and young researchers in physics, chemistry and life sciences, can interact. This event spawned an idea to create something similar for scientific disciplines of mathematics and computer science. The list of participating Laureates is impressive and includes, among others, Manuel Blum, Stephen Cook, Antony Hoare, John Hopcroft, Leslie Lamport, John Torrence Tate and Wendelin Werner. I am looking forward to meet these distinguished experts from both disciplines and learn many new things.

Last Updated on Friday, 21 August 2015 16:06
 

ACM XRDS: Efficient Sensor Placement for Environmental Monitoring

E-mail Print PDF

The Spring 2014 issue of XRDS: Crossroads, the ACM magazine for students is about cyber-physical systems.

My XRDS department contributed a column on efficient sensor placement for environmental monitoring. The column is about an important problem of observation selection that received considerable research attention in recent years. Consider, for example, the air quality monitoring in a large research lab, the monitoring of algae biomass in a lake or the placement of a network of sensors in a water distribution system for early detection of contaminants. In all these settings we have to decide where to place the sensors in order to effectively collect information about the environment. Since acquiring observations is typically expensive and we have a limited budget, we want to select a small number of most informative locations for monitoring. Thus, we usually trade off the informativeness of sensor measurements for the cost of data acquisition. The column gives an example of large sensor deployment in a research lab and applies tools of submodular optimization to tackle the task effectively with some theoretical performance guarantees of near optimal observation selection.

Last Updated on Friday, 21 August 2015 15:01
 

@RECOMB 2014, Pittsburgh, PA (Part II)

E-mail Print PDF

We are presenting a poster about our recent data fusion methodology (ArXiv preprint) at RECOMB Conference. Thanks to Prof. Blaz Zupan for the storyline and Prof. Richard H. Kessin for valuable comments. xkcd.com served as an inspiration of poster design (HiRes). See also other post (part I) about our RECOMB paper.

Best Poster Award at RECOMB 2014!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Last Updated on Sunday, 14 June 2015 10:52
 

@RECOMB 2014, Pittsburgh, PA (Part I)

E-mail Print PDF

We got accepted a paper on Imputation of Quantitative Genetic Interactions in Epistatic MAPs by Interaction Propagation Matrix Completion to RECOMB 2014.

Epistatic Miniarray Profile (E-MAP) is a popular large-scale gene interaction discovery platform. E-MAPs benefit from quantitative output, which makes it possible to detect subtle interactions. However, due to the limits of biotechnology, E-MAP studies fail to measure genetic interactions for up to 40% of gene pairs in an assay. Missing measurements can be recovered by computational techniques for data imputation, thus completing the interaction profiles and enabling downstream analysis algorithms that could otherwise be sensitive to largely incomplete data sets. In the paper, we introduce a new interaction data imputation method called interaction propagation matrix completion (IP-MC). The core part of IP-MC is a low-rank (latent) probabilistic matrix completion approach that considers additional knowledge presented through a gene network. IP-MC assumes that interactions are transitive, such that latent gene interaction profiles depend on the profiles of their direct neighbors in a given gene network. As the IP-MC inference algorithm progresses, the latent interaction profiles propagate through the branches of the network. In a study with three different E-MAP data assays and the considered protein-protein interaction and Gene Ontology similarity networks, IP-MC significantly surpassed existing alternative techniques. Inclusion of information from gene networks also allows IP-MC to predict interactions for genes that were not included in original E-MAP assays, a task that could not be considered by current imputation approaches.

Presentation is available at Prezi.

Last Updated on Wednesday, 02 April 2014 21:48
 

@Pacific Symposium on Biocomputing 2014, Hawaii

E-mail Print PDF

I am participating at PSB 2014, Pacific Symposium on Biocomputing, an international conference of current research in the theory and application of computational methods in problems of biological significance, which is held on the Big Island of Hawaii.

We got accepted a paper on Matrix factorization-based data fusion for gene function prediction in baker's yeast and slime mold to PSB. In the paper, we have examined the applicability of our recently proposed matrix factorization-based data fusion approach on the problem of gene function prediction. We studied three fusion scenarios to demonstrate high accuracy of our approach when learning from disparate, incomplete and noisy data. The studies were successfully carried out for two different organisms, where, for example, the protein-protein interaction network for yeast is nearly complete but it is noisy, whereas the sets of available interactions for slime mold are rather sparse and only about one-tenth of its genes have experimentally derived annotations.

Last Updated on Monday, 07 December 2015 21:17
 

@Baylor College of Medicine, Department of Molecular and Human Genetics

E-mail Print PDF

Between December 2013 and August 2014 I am visiting the Department of Molecular and Human Genetics at Baylor College of Medicine, Houston, TX, USA. During my stay we will do research on computational methods for data fusion and their applications in systems biology. We will investigate our recently developed data fusion algorithms and applied them to tasks such as gene function prediction, gene ranking (prioritization), missing value imputation, association mining and inference of gene networks from mutant data. I anticipate that large-scale applications of our methods may provide valuable feedback on whether such functionality is useful for biological community and provide new insights into the correspondence between biological and algorithmic concepts.

Last Updated on Sunday, 14 June 2015 10:52
 

ACM XRDS: On Constructing the Tree of Life

E-mail Print PDF

The Winter 2013 issue of XRDS: Crossroads, the ACM magazine for students features the latest in wearable computing, such as wearable brain computer interface, human motion capturing and tracking how we read, the augmented reality and airwriting. In this issue there is a fascinating insider's look at what a Google technical interview is all about. Check it out!

I contributed a column on constructing, interpreting and visualizing phylogenetic trees, diagrams of relatedness between organisms, species, or genes that show a history of descent from common ancestry. As more and more life sciences data are freely available in public databases, some of the analyses that would have been performed in well-equipped research laboratories just few years ago are nowadays accessible to any interested individual with a commodity computer. Such a shift was only possible due to unprecedented technological and theoretical advancements across a broad spectrum of science and technology. Check it out!

Last Updated on Friday, 21 August 2015 15:00
 

Press Coverage of Our Recent Study About Connections Between Human Diseases

E-mail Print PDF

BioTechniques, The International Journal of Life Science Methods highlighted our recent paper on Discovering disease-disease associations by fusing systems-level molecular data, which was published by Nature's Scientific Reports. In the paper we applied our novel computational approach for data fusion to a plethora of molecular data in order to discover disease-disease associations.

Complete article featuring our study and a commmentary by paper's senior author prof. Blaz Zupan, PhD are available at BioTechniques site.

Last Updated on Sunday, 30 March 2014 16:37
 

Discovering Disease-Disease Associations by Fusing Molecular Data

E-mail Print PDF

Nature's Scientific Reports has published our latest paper on data fusion, Discovering disease-disease associations by fusing systems-level molecular data, in which we combine various sources of biological information to discover human disease-disease associations.

The advent of genome-scale genetic and genomic studies allows new insight into disease classification. Recently, a shift was made from linking diseases simply based on their shared genes towards systems-level integration of molecular data. We aim to find relationships between diseases based on evidence from fusing all available molecular interaction and ontology data. We propose a multi-level hierarchy of disease classes that significantly overlaps with existing disease classification. In it, we find 14 disease-disease associations currently not present in Disease Ontology and provide evidence for their relationships through comorbidity data and literature curation. Interestingly, even though the number of known human genetic interactions is currently very small, we find they are the most important predictor of a link between diseases. Finally, we show that omission of any one of the included data sources reduces prediction quality, further highlighting the importance in the paradigm shift towards systems-level data fusion. Check it out!

Last Updated on Wednesday, 15 June 2016 22:07
 

ACM XRDS: Zero-Knowledge Proofs

E-mail Print PDF

The Fall 2013 issue of XRDS: Crossroads, the ACM magazine for students is about the complexities of privacy and anonymity.

The issue is motivated by the current research problems and recent societal concerns about digital privacy. When real and digital worlds collide things can get messy. Complicated problems surrounding privacy and anonymity arise as our interconnected world evolves technically, culturally, and politically. But what do we mean by privacy? By anonymity? Inside this issue there are contributions from lawyers, researchers, computer scientists, policy makers, and industry heavyweights all of whom try to answer the tough questions surrounding privacy, anonymity, and security. From cryptocurrencies to differential privacy, the issue looks at how technology is used to protect our digital selves, and how that same technology can expose our vulnerabilities causing lasting, real-world effects. Check it out!

Department that I'm responsible for contributed a column on zero-knowledge proofs. A zero-knowledge proof allows one person to convince another person of some statement without revealing any information about the proof other than the fact that the statement is indeed true. Zero-knowledge proofs are of practical and theoretical interests in cryptography and mathematics. They achieve a seemingly contradictory goal of proving a statement without revealing it. In the column we describe the interactive proof systems and some implications that zero-knowledge proofs have on the complexity theory. We conclude with an application of zero-knowledge proofs in cryptography, the Fiat-Shamir identification protocol, which is the basis of current zero-knowledge entity authentication schemes. Check it out!

Last Updated on Friday, 21 August 2015 15:00
 

MLSS 2013, Max Planck Institute for Intelligent Systems, Tübingen

E-mail Print PDF

This year I am participating at Machine Learning Summer School (MLSS) that is held in Tübingen, Germany. The Summer School offers an opportunity to learn about fundamental and advanced aspects of machine learning, data analysis and inference, from leaders of the field. Topics are diverse and include graphical models, multilayer networks, cognitive and kernel learning, network modeling and information propagation, distributed M, structured-output prediction, reinforcement learning, sparse models, learning theory, causality and much more. I am looking forward to it. Also, posters are a long-standing tradition at the MLSS. Below is an image of a poster presentation that covers some of my recent work.

 

Last Updated on Thursday, 09 July 2015 15:09
 

Extracting Gene Regulation Networks Using Linear-Chain Conditional Random Fields and Rules @ACL 2013, BioNLP Workshop

E-mail Print PDF

This week Slavko Zitnik will present our paper (he is the first author) at ACLACL BioNLP Workshop on extending linear-chain conditional random fields (CRF) with skip-mentions to extract gene regulatory networks from biomedical literature and a sieve-based system architecture, which is the complete pipeline of data processing that includes data preparation, linear-chain CRF and rule based relation detection and data cleaning.

Published literature in molecular genetics may collectively provide much information on gene regulation networks. Dedicated computational approaches are required to sip through large volumes of text and infer gene interactions. We propose a novel sieve-based relation extraction system that uses linear-chain conditional random fields and rules. Also, we introduce a new skip-mention data representation to enable distant relation extraction using first-order models. To account for a variety of relation types, multiple models are inferred. The system was applied to the BioNLP 2013 Gene Regulation Network Shared Task. Our approach was ranked first of five, with a slot error rate of 0.73.

Presentation slides.

Last Updated on Sunday, 25 August 2013 21:40
 

ISMB/ECCB 2013 - 21st International Conference on Intelligent Systems in Molecular Biology & 12th European Conference on Computational Biology

E-mail Print PDF

I participated in CAMDA Satellite Meeting on critical assessment of massive data analysis during 29th and 20th July at ISMB in Berlin, where I presented our matrix factorization-based data fusion approach to predicting drug-induced liver injury from toxicogenomics data sets and circumstantial evidence from related data sources. The outcome was positive and our work has been recognized as an excellent research.

The main conference days of 21st Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) and 12th European Conference on Computational Biology (ECCB) were in Berlin, 21st to 23rd July. Overall, the meeting was enjoyable and the talks there offered novel insights from both computational and biological perspectives. As a side note, in 2014 ISMB and ECCB will be organized separately, the ISMB conference will be in July in Boston and the ECCB meeting will be in September in Strasbourg.

Here, I list some of the talks I attended at ISMB/ECCB. At some point it was difficult to pick the most interesting talk due to nine parallel sessions. Note that only the presenting authors are provided here.

First day:

  • Simple topological properties predict functional misannotations in a metabolic network (J. Pinney).
  • Of men ad not mice. Comparative genome analysis of human diseases and mouse models (W. Xiao).
  • Integration of heterogeneous -seq and -omics data sets: ongoing research and development projects at CLC bio (M. Lappe). Technology track.
  • System based metatranscriptomic analysis (X. Xiong).
  • Integrative analysis of large scale data (M. Spivakov, S. Menon). Workshop track.
  • Multi-task learning for host-pathogen interactions (M. Kshirsagar).
  • Integrative modelling coupled with mass spectrometry-based approaches reveals the structure and dynamics of protein assemblies (A. Politis).
  • Synthetic lethality between gene defects affecting a single non-essential molecular pathway with reversible steps (I. Kupperstein).
Second day:
  • KeyPathwayMiner - extracting disease specific pathways by combining omics data and biological networks (J. Baumbach). Technology track.
  • Compressive genomics (M. Baym).
  • Predicting drug-target interactions using restricted Boltzmann machines (J. Zeng).
  • Efficient network-guided multi locus associationmapping with graph cuts (C. Azencott).
  • Differential genetic interactions of S. cerevisiae stress response pathways (P. Beltrao). Special session on dynamic interaction networks.
  • Coordination of post-translational  modifications in human protein interaction networks (J. Woodsmith). Special session on dynamic interaction networks.
  • Prediction and analysis of protein interaction networks (A. Valencia). Special session on dynamic interaction networks.
  • Characterizing the context of human protein-protein interactions for an improved understanding of drug mechanism of action (M. Kotlyar). Special session on dynamic interaction networks.
  • GPU acceleration of bioinformatics pipeline (M. Berger and a team from NVIDIA).
Third day:
  • Using the world's public big data to find novel uses for drugs (P. Bourne).
  • A top-down systems biology approach to novel therapeutic strategies (P. Aloy).
  • A large-scale evaluation of computational protein function prediction (P. Radivojac).
  • Deciphering the gene expression code via a combined synthetic computational biology approach (T. Tuller).
  • Interplay of microRNAs, transcription factors and genes: linking dynamic expression changes to function (P. Nazarov).
  • Visual analytics, the human back in the loop (J. Aerts).
  • Turning networks into ontologies of gene function (J. Dutkowski).
  • A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text (S. Ananiadou).
I enjoyed the keynote talks:
  • How chromatin organization and epigenetics talk with alternative splicing (G. Ast).
  • Insights from sequencing thousands of human genomes (G. Abecasis).
  • Sequencing based functional genomics (analysis) (L. Pachter).
  • Searching for signals in sequences (G. Stormo).
  • Results may vary. What is reproducible? Why do open science and who gets the credit? (C. A. Goble).
  • Protein interactions in health and disease (D. Eisenberg).
It has been quite lively on Twitter as well. The official hashtag was #ISMBECCB, at some point it was even a trending hashtag on Twitter. Check the archive, tweets captured important insights from the talks and take-away messages as well as some entertaining ideas such as the unofficial ISMB Bingo card by @jonathancairns.
Last Updated on Thursday, 25 July 2013 19:50
 

CAMDA 2013: Matrix Factorization-Based Data Fusion for Drug-Induced Liver Injury Prediction

E-mail Print PDF

This work was recognized as first prize winner for excellent research at ISMB/ECCB CAMDA 2013 Conference.

I am giving a talk at CAMDA 2013 Conference, which runs as a satellite meeting of ISMB/ECCB 2013 Conference. CAMDA focuses on challenges in the analysis of the massive data sets that are increasingly produced in several fields of the life sciences. The conference offers researchers from the computer sciences, statistics, molecular biology, and other fields a unique opportunity to benefit from a critical comparative evaluation of the latest approaches in the analysis of life sciences' “Big Data”.

Currently, the Big Data explosion is the grand challenge in life sciences. Analysing large data sets is emerging to one of the scientific key techniques in the post genomic era. Still the data analysis bottleneck prevents new biotechnologies from providing new medical and biological insights in a larger scale. This trend towards the need for analysing massive data sets is further accelerated by novel high throughput sequencing technologies and the increasing size of biomedical studies. CAMDA provides new approaches and solutions to the big data problem, presents new techniques in the field of bioinformatics, data analysis, and statistics for handling and processing large data sets. This year, CAMDA's scientific committee set up two challenges; the prediction of drug compatibility from an extremely large toxicogenomic data set, and the decoding of genomes from the Korean Personal Genome Project.

The keynote talks were given by Atul Butte from Stanford University School of Medicine and Nikolaus Rajewsky from Max-Delbrück-Center for Molecular Medicine in Berlin. Atul Butte talked about translational bioinformatics and emphasized the importance of converting molecular, clinical and epidemiological data into diagnostics and therapeutics to ease the bench-to-bedsize translation. Nikolaus Rajewsky presented his group work on circular RNAs and findings on RNA-protein interactions.

I was involved in the prediction of drug compatibility from an extremely large toxicogenomic data set to answer two most important questions in toxicology. We investigated whether animal studies can be replaced with in vitro assays and if liver injuries in humans can be predicted using toxicogenomics data from animals.

In this work, we demonstrate that data fusion allows us to simultaneously consider the available data for outcome prediction of drug-induced liver injury. Its models can surpass accuracy of standard machine learning approaches. Our results also indicate that future prediction models should exploit circumstantial evidence from related data sources in addition to standard toxicogenomics data sets. We anticipate that efforts in data analysis have the promise to replace animal studies with in vitro assays and predict the outcome of liver injuries in humans using toxicogenomics data from animals.

 

Last Updated on Thursday, 09 July 2015 15:08
 


Page 2 of 4