Sunday, 15 December 2013 05:13
Marinka
The Winter 2013 issue of XRDS: Crossroads, the ACM magazine for students features the latest in wearable computing, such as wearable brain computer interface, human motion capturing and tracking how we read, the augmented reality and airwriting. In this issue there is a fascinating insider's look at what a Google technical interview is all about. Check it out!
I contributed a column on constructing, interpreting and visualizing phylogenetic trees, diagrams of relatedness between organisms, species, or genes that show a history of descent from common ancestry. As more and more life sciences data are freely available in public databases, some of the analyses that would have been performed in wellequipped research laboratories just few years ago are nowadays accessible to any interested individual with a commodity computer. Such a shift was only possible due to unprecedented technological and theoretical advancements across a broad spectrum of science and technology. Check it out!
Last Updated on Friday, 21 August 2015 15:00
Saturday, 14 December 2013 04:18
Marinka
BioTechniques, The International Journal of Life Science Methods highlighted our recent paper on Discovering diseasedisease associations by fusing systemslevel molecular data, which was published by Nature's Scientific Reports. In the paper we applied our novel computational approach for data fusion to a plethora of molecular data in order to discover diseasedisease associations.
Complete article featuring our study and a commmentary by paper's senior author prof. Blaz Zupan, PhD are available at BioTechniques site.
Last Updated on Sunday, 30 March 2014 16:37
Sunday, 24 November 2013 15:01
Marinka
Nature's Scientific Reports has published our latest paper on data fusion, Discovering diseasedisease associations by fusing systemslevel molecular data, in which we combine various sources of biological information to discover human diseasedisease associations.
The advent of genomescale genetic and genomic studies allows new insight into disease classification. Recently, a shift was made from linking diseases simply based on their shared genes towards systemslevel integration of molecular data. We aim to find relationships between diseases based on evidence from fusing all available molecular interaction and ontology data. We propose a multilevel hierarchy of disease classes that significantly overlaps with existing disease classification. In it, we find 14 diseasedisease associations currently not present in Disease Ontology and provide evidence for their relationships through comorbidity data and literature curation. Interestingly, even though the number of known human genetic interactions is currently very small, we find they are the most important predictor of a link between diseases. Finally, we show that omission of any one of the included data sources reduces prediction quality, further highlighting the importance in the paradigm shift towards systemslevel data fusion. Check it out!
Last Updated on Wednesday, 15 June 2016 22:07
Sunday, 24 November 2013 14:42
Marinka
The Fall 2013 issue of XRDS: Crossroads, the ACM magazine for students is about the complexities of privacy and anonymity.
The issue is motivated by the current research problems and recent societal concerns about digital privacy. When real and digital worlds collide things can get messy. Complicated problems surrounding privacy and anonymity arise as our interconnected world evolves technically, culturally, and politically. But what do we mean by privacy? By anonymity? Inside this issue there are contributions from lawyers, researchers, computer scientists, policy makers, and industry heavyweights all of whom try to answer the tough questions surrounding privacy, anonymity, and security. From cryptocurrencies to differential privacy, the issue looks at how technology is used to protect our digital selves, and how that same technology can expose our vulnerabilities causing lasting, realworld effects. Check it out!
Department that I'm responsible for contributed a column on zeroknowledge proofs. A zeroknowledge proof allows one person to convince another person of some statement without revealing any information about the proof other than the fact that the statement is indeed true. Zeroknowledge proofs are of practical and theoretical interests in cryptography and mathematics. They achieve a seemingly contradictory goal of proving a statement without revealing it. In the column we describe the interactive proof systems and some implications that zeroknowledge proofs have on the complexity theory. We conclude with an application of zeroknowledge proofs in cryptography, the FiatShamir identification protocol, which is the basis of current zeroknowledge entity authentication schemes. Check it out!
Last Updated on Friday, 21 August 2015 15:00
Monday, 19 August 2013 15:29
Marinka
This year I am participating at Machine Learning Summer School (MLSS) that is held in Tübingen, Germany. The Summer School offers an opportunity to learn about fundamental and advanced aspects of machine learning, data analysis and inference, from leaders of the field. Topics are diverse and include graphical models, multilayer networks, cognitive and kernel learning, network modeling and information propagation, distributed M, structuredoutput prediction, reinforcement learning, sparse models, learning theory, causality and much more. I am looking forward to it. Also, posters are a longstanding tradition at the MLSS. Below is an image of a poster presentation that covers some of my recent work.
Last Updated on Thursday, 09 July 2015 15:09
Saturday, 03 August 2013 23:01
Marinka
This week Slavko Zitnik will present our paper (he is the first author) at ACL, ACL BioNLP Workshop on extending linearchain conditional random fields (CRF) with skipmentions to extract gene regulatory networks from biomedical literature and a sievebased system architecture, which is the complete pipeline of data processing that includes data preparation, linearchain CRF and rule based relation detection and data cleaning.
Published literature in molecular genetics may collectively provide much information on gene regulation networks. Dedicated computational approaches are required to sip through large volumes of text and infer gene interactions. We propose a novel sievebased relation extraction system that uses linearchain conditional random fields and rules. Also, we introduce a new skipmention data representation to enable distant relation extraction using firstorder models. To account for a variety of relation types, multiple models are inferred. The system was applied to the BioNLP 2013 Gene Regulation Network Shared Task. Our approach was ranked first of five, with a slot error rate of 0.73.
Presentation slides.
Last Updated on Sunday, 25 August 2013 21:40
Monday, 22 July 2013 21:57
Marinka
I participated in CAMDA Satellite Meeting on critical assessment of massive data analysis during 29th and 20th July at ISMB in Berlin, where I presented our matrix factorizationbased data fusion approach to predicting druginduced liver injury from toxicogenomics data sets and circumstantial evidence from related data sources. The outcome was positive and our work has been recognized as an excellent research.
The main conference days of 21st Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) and 12th European Conference on Computational Biology (ECCB) were in Berlin, 21st to 23rd July. Overall, the meeting was enjoyable and the talks there offered novel insights from both computational and biological perspectives. As a side note, in 2014 ISMB and ECCB will be organized separately, the ISMB conference will be in July in Boston and the ECCB meeting will be in September in Strasbourg.
Here, I list some of the talks I attended at ISMB/ECCB. At some point it was difficult to pick the most interesting talk due to nine parallel sessions. Note that only the presenting authors are provided here.
First day:
 Simple topological properties predict functional misannotations in a metabolic network (J. Pinney).
 Of men ad not mice. Comparative genome analysis of human diseases and mouse models (W. Xiao).
 Integration of heterogeneous seq and omics data sets: ongoing research and development projects at CLC bio (M. Lappe). Technology track.
 System based metatranscriptomic analysis (X. Xiong).
 Integrative analysis of large scale data (M. Spivakov, S. Menon). Workshop track.
 Multitask learning for hostpathogen interactions (M. Kshirsagar).
 Integrative modelling coupled with mass spectrometrybased approaches reveals the structure and dynamics of protein assemblies (A. Politis).
 Synthetic lethality between gene defects affecting a single nonessential molecular pathway with reversible steps (I. Kupperstein).
Second day:
 KeyPathwayMiner  extracting disease specific pathways by combining omics data and biological networks (J. Baumbach). Technology track.
 Compressive genomics (M. Baym).
 Predicting drugtarget interactions using restricted Boltzmann machines (J. Zeng).
 Efficient networkguided multi locus associationmapping with graph cuts (C. Azencott).
 Differential genetic interactions of S. cerevisiae stress response pathways (P. Beltrao). Special session on dynamic interaction networks.
 Coordination of posttranslational modifications in human protein interaction networks (J. Woodsmith). Special session on dynamic interaction networks.
 Prediction and analysis of protein interaction networks (A. Valencia). Special session on dynamic interaction networks.
 Characterizing the context of human proteinprotein interactions for an improved understanding of drug mechanism of action (M. Kotlyar). Special session on dynamic interaction networks.
 GPU acceleration of bioinformatics pipeline (M. Berger and a team from NVIDIA).
Third day:
 Using the world's public big data to find novel uses for drugs (P. Bourne).
 A topdown systems biology approach to novel therapeutic strategies (P. Aloy).
 A largescale evaluation of computational protein function prediction (P. Radivojac).
 Deciphering the gene expression code via a combined synthetic computational biology approach (T. Tuller).
 Interplay of microRNAs, transcription factors and genes: linking dynamic expression changes to function (P. Nazarov).
 Visual analytics, the human back in the loop (J. Aerts).
 Turning networks into ontologies of gene function (J. Dutkowski).
 A method for integrating and ranking the evidence for biochemical pathways by mining reactions from text (S. Ananiadou).
I enjoyed the keynote talks:
 How chromatin organization and epigenetics talk with alternative splicing (G. Ast).
 Insights from sequencing thousands of human genomes (G. Abecasis).
 Sequencing based functional genomics (analysis) (L. Pachter).
 Searching for signals in sequences (G. Stormo).
 Results may vary. What is reproducible? Why do open science and who gets the credit? (C. A. Goble).
 Protein interactions in health and disease (D. Eisenberg).
It has been quite lively on Twitter as well. The official hashtag was #ISMBECCB, at some point it was even a trending hashtag on Twitter. Check the archive, tweets captured important insights from the talks and takeaway messages as well as some entertaining ideas such as the unofficial ISMB Bingo card by @jonathancairns.
Last Updated on Thursday, 25 July 2013 19:50
Monday, 15 July 2013 15:37
Marinka
This work was recognized as first prize winner for excellent research at ISMB/ECCB CAMDA 2013 Conference.
I am giving a talk at CAMDA 2013 Conference, which runs as a satellite meeting of ISMB/ECCB 2013 Conference. CAMDA focuses on challenges in the analysis of the massive data sets that are increasingly produced in several fields of the life sciences. The conference offers researchers from the computer sciences, statistics, molecular biology, and other fields a unique opportunity to benefit from a critical comparative evaluation of the latest approaches in the analysis of life sciences' “Big Data”.
Currently, the Big Data explosion is the grand challenge in life sciences. Analysing large data sets is emerging to one of the scientific key techniques in the post genomic era. Still the data analysis bottleneck prevents new biotechnologies from providing new medical and biological insights in a larger scale. This trend towards the need for analysing massive data sets is further accelerated by novel high throughput sequencing technologies and the increasing size of biomedical studies. CAMDA provides new approaches and solutions to the big data problem, presents new techniques in the field of bioinformatics, data analysis, and statistics for handling and processing large data sets. This year, CAMDA's scientific committee set up two challenges; the prediction of drug compatibility from an extremely large toxicogenomic data set, and the decoding of genomes from the Korean Personal Genome Project.
The keynote talks were given by Atul Butte from Stanford University School of Medicine and Nikolaus Rajewsky from MaxDelbrückCenter for Molecular Medicine in Berlin. Atul Butte talked about translational bioinformatics and emphasized the importance of converting molecular, clinical and epidemiological data into diagnostics and therapeutics to ease the benchtobedsize translation. Nikolaus Rajewsky presented his group work on circular RNAs and findings on RNAprotein interactions.
I was involved in the prediction of drug compatibility from an extremely large toxicogenomic data set to answer two most important questions in toxicology. We investigated whether animal studies can be replaced with in vitro assays and if liver injuries in humans can be predicted using toxicogenomics data from animals.
In this work, we demonstrate that data fusion allows us to simultaneously consider the available data for outcome prediction of druginduced liver injury. Its models can surpass accuracy of standard machine learning approaches. Our results also indicate that future prediction models should exploit circumstantial evidence from related data sources in addition to standard toxicogenomics data sets. We anticipate that efforts in data analysis have the promise to replace animal studies with in vitro assays and predict the outcome of liver injuries in humans using toxicogenomics data from animals.
Last Updated on Thursday, 09 July 2015 15:08
Tuesday, 09 July 2013 19:06
Marinka
I have spent some time recently studying matrix functions, both from theoretical and computational perspective. There is a nice book by Nick J. Higham on functions of matrices, which I highly recommend to interested reader and which provides a thorough overview of current theoretical results on matrix functions and several efficient numerical methods for computing them. Another well written text is by Rajendra Bhatia on matrix analysis (graduate texts in mathematics), which includes topics such as the theory of majorization, variational principles for eigenvalues, operator monotone and convex functions, matrix inequalities and perturbation of matrix functions. Bhatia's book is more functional analytic in spirit, whereas Higham's book focuses more on numerical linear algebra.
Below you will find a report that I produced and which contains a few interesting (some are elementary) proofs and implementations of algorithms. Interested reader should check the literature above to be able to follow the text.
Last Updated on Sunday, 25 August 2013 21:36
Monday, 01 July 2013 23:04
Marinka
I had a talk at ACAT Summer school on computational topology and topological data analysis held at University of Ljubljana.
Abstract: Fast growth in the amount of data emerging from studies across various scientific disciplines and engineering requires alternative approaches to understand large and complex data sets in order to turn data into useful knowledge. Topological methods are making an increasing contribution in revealing patterns and shapes of highdimensional data sets. Ideas, such as studying the shapes in a coordinate free ways, compressed representations and invariance to data deformations are important when one is dealing with large data sets. In this talk we consider which key concepts make topological methods appropriate for data analysis and survey some machine learning techniques proposed in the literature, which exploit them. We illustrate their utility with examples from computational biology, text classification and data visualization.
Slides (in English).
Last Updated on Thursday, 25 June 2015 14:41
Thursday, 06 June 2013 20:43
Marinka
In May I participated at the first BioDay meeting organized by Hekovnik in Ljubljana, Slovenia. The aim of the BioDay events is the exchange of ideas, knowledge and fostering collaboration and networking between life scientists, computer scientists, bioinformaticians, mathematicians and physicists.
The first event focused on recent trends in bioinformatics, specifically on experimental methods in systems biology (by Spela Baebler, PhD) and biomedical data fusion. I presented the latter topic and discussed how heterogeneous data sources in biology can be collectively mined by data fusion. The video of the event is available at video.hekovnik.com/bioday_trendi_v_bioinformatiki (in Slovene). Enjoy!
Last Updated on Sunday, 25 August 2013 21:33
Tuesday, 23 April 2013 16:32
Marinka
I have recently participated in BioNLP Shared Task 2013 Challenge together with Slavko Zitnik and won the first place in the task extracting gene regulation networks.
The goal of the challenge was to assess the performance of information extraction systems to extract a gene regulation network of a specific cellular function in Bacillus Subutilis. This function was sporulation and is related to the adaptation of bacteria to scarce resource conditions. The automatic reconstruction of gene regulation networks is of great importance in biology, because it furthers the understanding of cellular regulation systems.
We were provided a manually curated annotation of the training corpus including entities, events and relations with gene interactions. Also, the regulation network that can be reconstructed with interactions mentioned in sentences of training data was provided (picture on the right). The task required to estimate gene regulation network from test data by specifying a directed graph where vertices represent genes, and arcs represent interactions between genes extracted from the text. The arcs were labeled with an interaction type (e.g., inhibition, activation, binding, transcription).
We hope to describe our approach using conditional random fields and rules in a paper but the details are not public yet (stay tuned).
P.S. I have been accepted to Machine Learning Summer School (MLSS) 2013 (acceptance rate 26%) that will take place at Max Planck Institute for Intelligent Systems, Tubingen, Germany late in August this year. There is a list of highly acclaimed speakers and I am looking forward to it!
Last Updated on Sunday, 25 August 2013 21:40
Sunday, 17 March 2013 00:30
Marinka
Some of the ACM XRDS Editors are participating these days in a meeting to discuss the magazine's future direction in print and online. We will do our best to further promote the XRDS, enhance its departments, improve web presence, build a community of readers, and provide high quality content from various CS disciplines.
See the current ACM XRDS issue on Information and Communication Technologies and Development (ICTD). The Spring issue is coming very soon! It will be on Scientific Computing. And, the Summer issue will be on Computer Science and Creativity.
If you are interested in submitting an article or have some crazy good ideas, please share them with us (contact the editorial staff). Or, share with your colleagues any interesting columns/featured articles you read in ACM XRDS.
XRDS is the ACM's flagship magazine for students, established in 1994. It is published quarterly and invites submissions of high quality articles of interest to computer science students (from Editorial Calendar).
See http://xrds.acm.org.
Last Updated on Thursday, 09 July 2015 15:09
Wednesday, 05 December 2012 20:35
Marinka
In first week of December U of Ljubljana celebrates traditional "Week of University" (Why?) during which numerous invited lectures, presentations and award ceremonies are organized.
This year, I was awarded the faculty award for best students and Preseren Award of U of Ljubljana for thesis "A Matrix Factorization Approach for Inference of Prediction Models from Heterogeneous Data Sources" (slo: univerzitetna Prešernova nagrada za delo "Pristop matrične faktorizacije za gradnjo napovednih modelov iz heterogenih podatkovnih virov"). I would like to thank my supervisor and mentor Prof. dr. Blaž Zupan for encouragement and advice he provides throughout my time as his student. I am lucky to have a supervisor who cares so much about my work and responds to my questions promptly. I could not have won the award without his support and mentoring.
Last Updated on Sunday, 30 March 2014 16:37
Monday, 26 November 2012 22:28
Marinka
Recent days at Department of Computing, Imperial College London, were pleasant (though intense) and our efforts in data fusion produced some very good results.
Below are images taken at Imperial and nearby Chrome Web Lab, located in Science Museum. More about Google Chrome Web Lab experiment.
Last Updated on Sunday, 25 August 2013 21:31
Saturday, 17 November 2012 22:29
Marinka
I have just arrived to London, United Kingdom, where I will stay until the end of November this year. I will be working at Imperial College London, Department of Computing, Computational Network Biology Research Group led by Prof. Dr. Natasa Przulj. For this great opportunity I need to thank to my supervisor Prof. Dr. Blaz Zupan, Head of Bioinformatics Laboratory at UofLj.
My work here will be about network integration for disease classification, specifically inferring prediction models from heterogenous data sources through matrix factorization. More about it in the next days. For now you can check the Interactive map of the Diseasome (Below is an image showing a part of diseasome. Interested reader is referred to Barabasi's paper The Human Disease Network.) linked from the NYTimes article Redefining Disease, Genes and All.
I am very much looking forward to it :)
Last Updated on Sunday, 25 August 2013 21:40
Tuesday, 28 August 2012 00:00
Marinka
The 13th international conference on systems biology was held in Toronto, 19th23rd August 2012. Here is a list of talks from platform sessions which I found especially interesting:
 Modeling the regulatory diversity of human cancers (S. Nelander)
 Tissue specific modeling of functional genomics data: from networks to understanding human disease (O. Troyanskaya)
 An evaluation of methods for the modeling of transcription factor sequence specificity (M. T. Weirauch)
 SEEK and find: data management for systems biology projects (O. Krebs)
 Excerbt: nextgeneration knowledge extraction and hypothesis generation from massive amounts of biomedical literature (B. Wachinger)
 Combining multiple biological domains using patient network fusion (B. Wang)
 Combining many interaction networks to predict gene function and analyze gene lists (Q. Morris)
 Assembling global maps of cellular function through integrative analysis of physical and genetic networks (R. K. Srivas)
 iCAVE: immersive 3d visualization of biomolecular interaction network (Z. Gumus)
 Systemslevel insights from the global yeast genetic interaction network (C. Myers)
 Monopoly systems edition: advance to GO collect $200 (T. Idekar) (*actually about NeXO, a network extracted ontology and functional enrichment)
 Genomescale metabolic models: a bridge between bioinformatics and systems biology (J. Nielsen)
The organizers came up with a nice social program, parts of it is depicted on images below. At opening ceremony Tanja Tagaq, an Inuit woman, performed a unique style of traditional throat singing, Amanda and Rasmus from Sweden made performance at first poster session, Serena Ryder entertained us at conference reception dinner. Shonen Knife, a Japanese punk band that opened Nirvana, played at second poster session at Hart House.
I attended workshops on Designing experiments using state of the art Bayesian global parameter search methodology (M. Goldstein), Introduction to the statistical inference or regulatory networks (F. EmmertStreib), Imaging flow cytometry: a new view on systems biology (R. DeMarco). In addition to parallel sessions I also enjoyed special lectures and plenary sessions. A few of them are: Reading and writing omes (G. Church), Towards unification of genetic and hierarchy models of tumor heterogeneity (J. Dick), Interactome networks and human disease (M. Vidal), The genetics of individuals (B. Lehner), Synthetic genetic interaction analysis by highthroughput imaging to map cellular networks, Unraveling principles of gene regulation using thousands of designed promotor sequences (E. Segal), Systems biology applications of imaging flow cytometry (T. Galitski).
Last Updated on Sunday, 25 August 2013 21:41
Monday, 10 September 2012 20:25
Marinka
So what have I been up to in recent weeks here at Toronto? Highlights include my first ride with famous American yellow school bus to a reception at ICSB12 conference, some sightseeing in Toronto city and a trip to Niagara Falls.
Besides, I have finished with data analysis of realtime yeast S. cerevisiae microscopy screens, an idea about it can be captured here. I am now starting with time series analysis and will probably have time to work on integration of phenomics data with genetic interaction and protein interaction data.
Recently a quantum optics research group here at UofT demonstrated a violation of Heisenberg's uncertainty principle and I was really excited about their work. "The quantum world is still full of uncertainty, but at least our attempts to look at it don't have to add as much uncertainty as we used to think!" ... and an easy reading to motivate you to learn more.
I have also come upon a nice realworld (I do not like this term) implementation of an argument based machine learning offered through classification module in CellProfiler Analyst package, participated in a discussion about Gaussian processes (intro, notes) at ccbrstats meeting and much more. The Lab organized a farewell lunch for summer students only two weeks after my arrival to Toronto, as here and in US classes have already begun (after the Labour Day), I considered it as a welcome event :)
Below are images of Toronto CN Tower, Niagara Falls as seen from Skylon Tower and squirrels at UofT campus (Yes, one cannot miss numerous squirrels playing in parks at campus. A careful look should reveal four of them.), respectively.
Last Updated on Sunday, 25 August 2013 21:42
Sunday, 19 August 2012 18:15
Marinka
In the past few days I have settled in Toronto, Canada, where I will stay until October this year. As a graduate student I will be working at the University of Toronto, Terrence Donnelly Centre for Cellular and Biomolecular Research in the Charlie Boone's Lab.
My work will be mostly data analysis of S. cerevisiae screens by employing various statistical and machine learning methods to gain new knowledge about identification of yeast mutant strains with nonWT phenotype. Possibly I will also work on timeseries analysis of actin patches in yeast cells to differentiate them. First impressions are great, I have already met some great people and am looking forward to meet some at the International Conference on Systems Biology (ICSB12), which is held in Toronto in the next week and have a fortunate opportunity to attend.
Last Updated on Sunday, 30 March 2014 17:21
Wednesday, 08 August 2012 11:06
Marinka
Professor of Computer Science Dick Lipton from Georgia Tech has a few days ago posted an interesting post on his blog about a new way of solving systems of linear equations over prime fields. Solving linear systems of equations is a very well explored area and a new approach came as a pleasant surprise to me. The author of the proposed method is Prasad Raghavendra from Berekeley. He wrote a draft paper with the algorithm and the analysis in which he proves the soundness and completeness of the randomized algorithm he had proposed. The proof is not very complicated, you need to be familiar with some probabilistic bounds, specifically a generalization of the Bernstein inequality, the Hoeffding inequality, and foundations of finite fields.
Solving linear systems of equations is indeed an ancient problem and a new approach is therefore warmly welcomed. As early as 200 BC the Chinese has devised a clever method for solving 2by2 systems and there is evidence that they had developed a method essentially equivalent to Gaussian elimination (2000 years before Gauss came up with it!). Then Gabriel Cramer has in 1750 published a famous and simple O(n^3) rule for solving system of linear equations by computing matrix determinants and cofactors. Later in 1810, Carl Friedrich Gauss became interested in discovering the orbit of Pallas, the secondlargest asteroid of the solar system. His work led him to a system of six linear equations in six unknowns. Gauss invented the method of Gaussian elimination which is still used today. The method consists of performing rowoperations on the coefficient matrix to obtain an equivalent system of equations whose coefficient matrix is uppertriangular. This means that the last equation will involve only one unknown and can be easily solved. Substituting that solution into the secondtolast equation, one can then solve for another unknown. Gauss' work was continued by Wilhelm Jordan who convert a given system of linear equations to a triangular system which can be reduced to a diagonal one and then solved directly (GaussJordan method)  LU decomposition. We now have faster methods for general systems based on discoveries of Volker Strassen and almost linear time algorithms which require systems to have special structure.
Most readers probably know methods mentioned above very well, so let us do not spend any time discussing them. So, what is the trick in the new method? Prasad starts with the set of random vectors, V0. We expect that about half of them will satisfy the first linear equation. He then throws away all vectors that do not satisfy the first equation and call the remaining set S1. The idea is successively applied to further equations until we reach the set Sm. Vectors in Sm will then satisfy all linear equations and represent the solutions to the system. The problem with this winnowingdown process is that with high probability the set Si winnows down to zero before all equations in the system are satisfied. An exception is if we start with a set V0 of exponential size, but this is obviously too much.
Prasad came up with a brilliant answer that indeed we do not need a large initial V0 (he proved it can be of linear order in the number of variables) and still have a high probability of getting a solution to the complete system of linear equations if the system is feasible. The trick is to use an amplifier and a property of finite fields. Remember, the method is for systems of equations over finite field and the exploited property limits its usage. Over finite field with characteristics p, the trick is to form Vi (being of the same size as V0) as sums of random p+1 vectors from Si. Namely, summing p+1 solutions to the first p equations will give in finite field a (new) solution to the first p equations and possibly solve also (p+1)th solution. See the draft paper for the details.
The new algorithm is easy to implement. Recently I am becoming quite a fan of Clojure, so I decided to implement the algorithm in this popular functional programming language that runs on JVM. I have always been very fond of functional programming languages (examples include Haskell, OCaml, F#, Lisp, ML, Erlang, Scheme etc.) and I am glad functional paradigm is gaining popularity also in "realworld" problems (I do not like this term, but ...). Functional programming is not more difficult than object oriented programming, it is just different. It is a style of programming that emphasizes firstclass functions that are pure and is inspired by ideas from lambda calculus. I acknowledge that Matlab would be the most appropriate for this problem and imperative language implementation more understandable but it is more fun, isn't it :) ?
(ns clj_concurrency.lssolver)
(declare n m q N)
(defn randomints [X] (conj X (repeatedly n #(randint q))))
(defn modreduce [X] (map #(map (fn [x] (mod x q)) %1) X))
(defn satisfy? [Ax Sx bx]
(= (mod (apply + (map * Ax Sx)) q) bx))
(defn select [S Ai bi]
(def T #{})
(doseq [Sx S :when (satisfy? Ai Sx bi)]
(def T (conj T Sx)))
(into #{} T))
(defn recombine [T]
(def R #{})
(dotimes [_ N]
(let [Q (take (+ q 1) (shuffle T))]
(def R (conj R (reduce #(map + %1 %2) Q)))))
(into #{} R))
(defn solve [A b]
(def S (apply concat (map randomints (repeat N []))))
(dotimes [i m]
(let [T (select S (nth A i) (nth b i))]
(when (empty? T)
(throw (IllegalArgumentException. "System infeasible, T.")))
(def S (recombine T))))
(if (empty? S)
(throw (IllegalArgumentException. "System infeasible, S.")))
(do
(println "")
(println "Solution" (distinct (modreduce S)))))
; toy example 1
(def n 3) ;number of variables
(def m 3) ;number of equations
(def q 3) ;finite field characteristics
(def N (* 30 n)) ;sampling size
(solve [[2 1 1] [1 1 1] [1 2 1]] [1 0 0]) ;> (1 0 2)
; toy example 2
(def q 5)
(def m 3)
(def n 3)
(def N (* 30 n))
(solve [[1 1 1] [2 3 2] [1 3 4]] [1 4 4]) ;> (1 2 3)
Last Updated on Wednesday, 30 January 2013 23:37
Saturday, 21 July 2012 19:05
Marinka
Yesterday I defended my undergraduate Thesis entitled A Matrix Factorization Approach for Inference of Prediction Models from Heterogeneous Data Sources (slo: Pristop matrične faktorizacije za gradnjo napovednih modelov iz heterogenih podatkovnih virov) at University of Ljubljana, Faculty of Computer and Information Science and Faculty of Mathematics and Physics.
With that Thesis I completed a four year interdisciplinary university program consisting of eight semesters of lectures and one year of Diploma thesis work in less than four years. Diploma leads to the degree of University dipl.ing. of Computer Science and Mathematics. I graduated summa cum laude with the average grade being 10.0 out of 10.0.
My brother Slavko took some pictures and shot the defense. The movie and pictures are available at zitnik.si site.
Abstract
Today we are witnessing rapid growth of data both in quantity and variety in all areas of human endeavour. Integrative treatment of these sources of information is a major challenge. We propose a new computation framework for inference of prediction models based on symmetric penalized matrix trifactorization and intermediate strategy for data integration. Major advantages of the approach are an elegant mathematical formulation of the problem, an integration of any kind of data that can be expressed in matrix form, and high predictive accuracy.
We tested the effectiveness of the proposed framework on predicting gene annotations of social amoebae D. dictyostelium. The developed model integrates gene expressions, proteinprotein interactions and known gene annotations. The model achieves higher accuracy than standard techniques of early and late integration, which combine inputs and predictions, respectively, and have in the past been favourably reported for their accuracy.
With the proposed approach we have also predicted that there is a set of genes of D. dictyostelium that may have a role in bacterial resistance and which were previously not associated with this function. Until now, only a handful of genes were known to participate in related bacterial recognition pathways. Expanding the list of such genes is crucial in the studies of mechanisms for bacterial resistance and can contribute to the research in development of alternative antibacterial therapy. Our predictions were experimentally confirmed in wetlab experiments at the collaborating institution (Baylor College of Medicine, Houston, USA).
Note: Complete Thesis will be available for download after some aspects of the proposed approach will be presented in the scientific journal.
I am looking forward to starting my PhD studies in the autumn. Until then ... many new interesting projects ...
Last Updated on Sunday, 30 March 2014 16:36
Monday, 04 June 2012 22:23
Marinka
I have been following the Cryptography and Theory of Coding 2 course this semester (summer 12).
During the course I have been working on the project about Commitment schemes. Please find below attached the final report produced and a short presentation. Additionally, some homework solutions are attached as well. The language is Slovene, but it is mostly pure math with proofs, so it should not be too difficult to capture the main idea.
Please note these are reports and have not been subject to the usual scrutiny of published papers.
Last Updated on Wednesday, 30 January 2013 23:42
Tuesday, 14 February 2012 16:20
Marinka
XRDS, Crossroads is the flagship academic magazine for student members of the ACM. It is published quarterly, both in print and online for ACM student members. The publication was previously named Crossroads and has been running since 1994, edited and maintained voluntarily by students. The magazine is distributed to tens of thousands of students worldwide.
Issues focus on Computer Science and include various articles from highlights in new and important research to roundtable discussions, interviews and introductory overviews. In the magazine are published exciting pieces written by top leaders in the field as well as students who submit unsolicited work.
I highly recommend reading the issues! Here is the current one.
I am writing this since I have been selected as the Department Editor for this magazine as January 2012. I am looking forward to contribute to this great platform.
Last Updated on Wednesday, 30 January 2013 23:37
Monday, 16 January 2012 23:12
Marinka
I followed a Machine Learning course this semester at the university (along with the one, offered by Stanford Uni). I have been working on a seminar studying Reliable probability estimation in classification using calibration. The final report is in the form of the scientific article, which is attached below.
Here is the abstract:
Estimating reliable class membership probabilities is of vital importance for many applications in data mining in which classification results are combined with other sources of information to produce decisions. Other sources include domain knowledge, outputs of other classifiers or exampledependent misclassification costs. We revisit the problem of classification calibration motivated by the issues of the isotonic regression and binning calibration. These methods can behave badly on small or noisy calibration sets, producing inappropriate intervals or boundary generalization. We propose an improvement of the calibration with isotonic regression and binning method by using bootstrapping technique, named bootisotonic regression and bootbinning, respectively. Confidence intervals obtained by repeatedly calibrating the set sampled with replacement from the original training set are used for merging unreliable or too narrow calibration intervals. This method has been experimentally evaluated with respect to two calibration measures, several classification methods and several problem domains. The results show that the new method outperforms the basic isotonic regression and binning methods in most configurations.
Short presentation about work and final report:
Last Updated on Sunday, 25 August 2013 21:31
Tuesday, 27 December 2011 18:34
Marinka
In November 2011, a show called Ugriznimo znanost (Bite the Science) has been shot at the Faculty of Computer and Information Science by the national public broadcasting organization (RTVSLO).
Bite the Science is a weekly series on the educational channel TV Slovenia. It is the series which aims to explain the science in a relaxed and humorous manner. The topics range from current events in science to accomplishments of Slovene scientists and scientific research in other parts of the world.
The episode scheduled for the last week of December has discussed programming. In it I am presenting my achievements at the Google Summer of Code program (Nimfa library) and discuss Orange, opensource data mining and visualization tool developed by Biolab, Bioinfomatics Laboratory. Further, researchers talk in the episode about smart house, Faculty Robot League, summer school of programming and mathematical data analysis used for predicting the results of the Eurosong competition.
Take a look (link to the episode is below)!
Relevant links:
Last Updated on Wednesday, 30 January 2013 23:38
Saturday, 17 December 2011 01:26
Marinka
Traditionally Week of The University in Ljubljana is organized in the beginning of December, during which numerous ceremonial events are held at the university and it's faculty members.
This year I was awarded Best Student award of University for overall performance in courses (slo. Svečana listina za izjemen študijski uspeh Univerze Ljubljana) in addition to faculty award which I have received in last years. The reception at the University was well organized with few music intermezzos between speeches.
Nevertheless, new challenges are on their way!
Last Updated on Wednesday, 30 January 2013 23:36
Thursday, 17 November 2011 00:07
Marinka
In this short post I would like to update you on some recent changes in the MF  Matrix Factorization Techniques for Data Mining Library.
Recently, the library MF  Matrix Factorization Techniques for Data Mining has been moved and renamed. It is now called Nimfa  A Python Library for Nonnegative Matrix Factorization or short, nimfa. The latest version with documentation, link to source code and working examples can be found at nimfa.biolab.si. Those of you still having old links, are kindly requested to update the new information  you will be automatically redirected to the new site if you visit old page.
Hope you like the new name.
Enjoy.
Last Updated on Sunday, 25 August 2013 21:36
Saturday, 24 September 2011 13:41
Marinka
I have always been fascinated by the world of fractals and have been deeply enthusiastic exploring the maths behind them. This post is announcing the support of the fractal dimension computation in the MF  Matrix Factorization for Data Mining library.
In the following paragraphs we shortly revise the most important concepts and definitions of the fractal dimension.
The embedding dimensionality of a data set is the number of attributes in the data set. The intrinsic dimensionality is defined as the actual number of dimensions in which the mdimensional original vectors can be embedded under the assumption that some distance in the reduced space is kept among them. Given a fractal as a selfsimilar set of points with selfsimilar pieces, where each is scaled down by a factor of , the fractal dimension of the object is defined as
.
Example: Sierpinski triangle (My seminar work for CG on Lsystems  figure 1 in Appendix of the Report document) consists of three selfsimilar parts and each is scaled down by a factor of two, therefore its fractal dimension is
For the finite set of points in a vector space, we say the set is statistically selfsimilar on a range of scaleson which the selfsimilarity is true. However in the theory selfsimilar object should have infinitely many points because each selfsimilar part is a scaleddown version of the original object. As a measure of the intrinsic fractal dimension of a data set, the slope of the correlation integral is used. The correlation integral for the data set is defined as
Given a data set which is statistically selfsimilar in the range, its correlation fractal dimension is
It has been shown that the correlation fractal dimension corresponds to the intrinsic dimension of a data set. Many properties hold for the correlation fractal dimension, see [1] and [2]. For us it is especially important, that the intrinsic dimensionality gives a lower bound on the number on attributes needed to keep the vital characteristics of the data set.
A fast algorithm for the computation of the intrinsic dimension of a data set presented in [2] is implemented in the MF  Matrix Factorization for Data Mining library. Intuitive explanation of the correlation fractal dimension is that is measures how the number of neighbor points increases with the increase of the distance. It therefore measures the spread of the data and the fractal dimension equal to the embedding dimension means that the spread of the points in the data set is maximum.
Of high importance is a Conjecture 1 in [1]: With all the parameters being equal, a dimensionality reduction method which achieves higher fractal dimension in the reduced space is better than the rest for any data mining task. Therefore correlation fractal dimension of a data set can be used:
 for determining the optimal number of dimensions in the reduced space,
 as a performance comparison tool between dimensionality reduction methods,
and all this can be done in way that is scalable to large data sets.
Recommended reading:
 Kumaraswamy, K., (2003). Fractal Dimension for Data Mining. Carnegie Mellon University.
 Jr, C. T., Traina, A., Wu, L., Faloutsos, C., (2010). Fast Feature Selection using Fractal Dimension. Science, 1(1), 316.
In [1] a concept of intrinsic fractal dimension of a data set is introduced and it is shown how fractal dimension can be used to aid in several data mining tasks. In [2] a fast O(n) algorithm to compute fractal dimension of a data set is presented. On top of that a fast, scalable algorithm to quickly select the most important attributes of the given set of ndimensional vectors is described.
Last Updated on Sunday, 25 August 2013 21:36
Friday, 26 August 2011 15:09
Marinka
Google Summer of Code 2011 has finished. On 22th of August it was firm "pencils down" date and today, on 26th of August, has been final evaluation deadline. Therefore, it is time for a small review to be published here on my blog.
I successfully completed the program and have met all the goals, outlined in the original project plan with some (2) additional factorization methods I have implemented. I have been very satisfied with the support and mentoring of both the organization and mentor.
The project, I have worked on, has been developing library MF  Matrix Factorization Techniques for Data Mining which includes a number of published matrix factorization algorithms, initialization methods, quality and performance measures and facilitates the combination of these to produce new strategies. The library contains examples of usage, applications of factorization methods on both synthetic and real world data sets are provided.
Matrix factorization methods have been shown to be a useful decomposition for multivariate data as low dimensional data representations are crucial to numerous applications in statistics, signal processing and machine learning.
An incomplete list of applications of matrix factorization methods includes:
 bioinformatics,
 environmetrics and chemometrics,
 image processing and computer graphics,
 text analysis,
 miscelllaneous, such as extracting speech features, transcription of polyphonic music passages, object characterization, spectral data analysis, multiway clustering, learning sound dictionaries, etc.
Example using synthetic data set is intended as demonstration of the MF library since all currently implemented factorization algorithms with different initialization methods and specific settings are ran. Others include applications on real world data sets in:
 bioinformatics,
 text analysis,
 image processing,
 recommendation systems.
I will outline only the most important content of the MF library here (for any details refer to documentation (or code)), as this is project review and not library reference (references to articles are provided in the documentation).
 Matrix Factorization Methods
 BD  Bayesian nonnegative matrix factorization Gibbs sampler [Schmidt2009]
 BMF  Binary matrix factorization [Zhang2007]
 ICM  Iterated conditional modes nonnegative matrix factorization [Schmidt2009]
 LFNMF  Fisher nonnegative matrix factorization for learning Local features [Wang2004], [Li2001]
 LSNMF  Alternative nonnegative least squares matrix factorization using projected gradient method for subproblems [Lin2007]
 NMF  Standard nonnegative matrix factorization with Euclidean / KullbackLeibler update equations and Frobenius / divergence / connectivity cost functions [Lee2001], [Brunet2004]
 NSNMF  Nonsmooth nonnegative matrix factorization [Montano2006]
 PMF  Probabilistic nonnegative matrix factorization [Laurberg2008], [Hansen2008]
 PSMF  Probabilistic sparse matrix factorization [Dueck2005], [Dueck2004], [Srebro2001], [Li2007]
 SNMF  (SNMF/L, SNMF/R) Sparse nonnegative matrix factorization based on alternating nonnegativity constrained least squares [Park2007]
 SNMNMF  Sparse network regularized multiple nonnegative matrix factorization [Zhang2011]
 Quality and Performance Measures
 Distance
 Residuals
 Connectivity matrix
 Consensus matrix
 Entropy of the fitted NMF model [Park2007]
 Dominant basis components computation
 Explained variance
 Feature score computation representing its specificity to basis vectors [Park2007]
 Computation of most basis specific features for basis vectors [Park2007]
 Purity [Park2007]
 Residual sum of squares  can be used for rank estimate [Hutchins2008], [Frigyesi2008]
 Sparseness [Hoyer2004]
 Cophenetic correlation coefficient of consensus matrix  can be used for rank estimate [Brunet2004]
 Dispersion [Park2007]
 Selected matrix factorization method specific
 Fitted factorization model tracker across multiple runs
 Residuals tracker across multiple factorizations / runs
 Different factorization models
Relevant links:
Join the GSoC next year! It is a great opportunity to spend the summer learning something new and having fun at the same time.
Last Updated on Sunday, 25 August 2013 21:36
Thursday, 04 August 2011 16:50
Marinka
It has been some time since my last post, but here is the new one. Perhaps the title sounds a bit inappropriate, but indeed it is well suited. Read till the end, where I explain it for those not figuring it yet (or consider it a puzzle :))
So, what have I been up to lately? Despite summer holidays I have been involved in quite a few projects.
First, GSoC Matrix Factorization Techniques for Data Mining project for Orange has been progressing well. Code is almost finished, no major changes in framework, factorization/initialization methods, quality measures, etc. are expected. Project is on schedule and has not diverged from initial plan, all intended techniques (plus a few additional I have found interesting along research) are implemented. I have been doing some testing, and have yet to provide more use cases/examples along with thorough explanation and example data sets. I will not go into details here, as implemented methods' descriptions with paper references are published at Orange wiki project site. The project is great, a mix of linear algebra, optimization methods, statistics and probability, numerical methods (analysis if you want to read some convergence or derivation proofs) with intensive applications in data mining, machine learning, computer vision, bioinformatics etc. and I have been really enjoying working on it, here is my post at Orange blog. The Orange and its GSoCers have been spotlighted at Google Open Source Blog.
Next, there is some image processing; segmentation, primary and secondary object detection, object tracking, morphology measures, filters etc. (no details).
Minor for keeping contact with MS world, Sharepoint Server 2010 (SP 10). I have some experience with it (and its previous version MOSS 2007), both in administration and especially in code. This time it was not about coding workflows using Win Workflow Foundation, developing Web parts/sites/custom content types/web services (...) but providing an insite publishing hierarchy for data in custom lists and integration with Excel Services (not with new 365 Cloud service). Obstacles were limited server access (hosting plan), old versions of software and usual MS stuffs (:)). In SP 10 these are SPFieldLookups filters and cascading lookups, data connections between sites/lists/other content. As always there are some nice workarounds which have resolved all issues.
Last (not least) I have been catching up with all the reading material I was forced to put aside during the year (well not entirely true: the more I read, the more should be read, so the pile of papers in iBooks and Mendeley app is not getting any smaller :)).
Here we are, what about the post's title? The sunrise problem was introduced by Laplace (french mathematician known for Bayesian interpretation of probability, Laplace transform, Laplace equation, differential operator, work in mechanics and physics). Is the probability that the sun will rise tomorrow 1 if we can infer from the observed data that is has risen every day on record? :) So what is the answer of the question in the title? The inferred probability depends on the record  whether we take the past experience of one person, humanity, or the Earth history. This is the reference class problem  with Bayes any probability is the conditional probability given what a person knows. Simple principle emerged from this, addone or Laplacian smoothing (Example: Doing spam email classification with a bag of words model or text classification with multinomial model, this allows the assignment of positive probabilities to words which do not occur in the sample) and corresponds to the expected value of the posterior.
Last Updated on Wednesday, 30 January 2013 23:42

