Marinka Zitnik

Fusing bits and DNA

  • Increase font size
  • Default font size
  • Decrease font size
Marinka Zitnik

Extracting Gene Regulation Networks Using Linear-Chain Conditional Random Fields and Rules @ACL 2013, BioNLP Workshop

This weekSlavko Zitnik will present our paper (he is the first author) atACL,ACL BioNLP Workshop on extending linear-chain conditional random fields (CRF) with skip-mentions to extract gene regulatory networks from biomedical literature anda sieve-based system architecture,which is the complete pipeline of data processing that includes data preparation, linear-chain CRF and rule based relation detection and data cleaning.

Published literature in molecular genetics may collectively provide much information on gene regulation networks. Dedicated computational approaches are required to sip through large volumes of text and infer gene interactions. We propose a novel sieve-based relation extraction system that uses linear-chain conditional random fields and rules. Also, we introduce a new skip-mention data representation to enable distant relation extraction using first-order models. To account for a variety of relation types, multiple models are inferred. The system was applied to the BioNLP 2013 Gene Regulation Network Shared Task. Our approach was ranked first of five, with a slot error rate of 0.73.

Presentation slides.

 

Recent Invited Talks & Events

 

GSoC & Orange: Matrix Factorization Techniques for Data Mining

This year I have applied for the Google Summer of Code, namely the Orange project.

Will see if I will be accepted. :)

Update 25.04.2011: Google has announced the results. My proposal has been accepted and am looking forward to start working. :)

Some links to articles in Slovenian news:

Project title: Matrix Factorization Techniques for Data Mining

Description: Matrix factorization is a fundamental building block for many of current data mining approaches and factorization techniques are widely used in applications of data mining. Our objective is to provide the Orange community with a uni fed and efficient interface to matrix factorization algorithms and methods. For that purpose we will develop a scripting library which will include a number of published factorization algorithms and initialization methods and will facilitate the combination of these to produce new strategies. Extensive documentation with working examples that will demonstrate real applications, commonly used benchmark data and visualization methods will be provided to help with the interpretation and comprehension of the results.

Main factorization techniques and their variations planned to be included in the library are: Bayesian decomposition (BD) together with linearly constrained and variational BD using Gibbs sampling, probabilistic matrix factorization (PMF), Bayesian factor regression modeling (BFRM), family of nonnegative matrix factorizations (NMF) including sparse NMF, non-smooth NMF, local factorization with Fisher NMF, least-squares NMF. Di fferent multiplicative and update algorithms for NMF will be analyzed which minimize LS error or generalized KL divergence. Further nonnegative matrix approximations (NNMA) with extensions will be implemented. For completeness algorithms such as NCA, ICA and PCA could be added to the library.

Here is proposal document.

 

Mandelbrot Set & R Language

I've been following a course on Statistical Aspects of Data Mining lately, which is not what I will write about, but this article got inspiration from it. The software environment being used in this course is the R programming language, which is used for statistical computing and graphics (it is available for Windows, Linux and Mac as part of the GNU project). If you download it from R's website, you get it with the command line interpreter, of course there are some IDEs as well, such as Rcmdr or Tinn-R. The capabilities of R are extended with numerous user-submitted packages - for the animation of the Mandelbrot Set at least the following libraries are needed: spam, fields, bitops, caTools - all are freely available at R's website. The R is influenced by S and Scheme, but I'wont go into details, as there is plenty information about it on the web.

I tried to draw the classic Mandelbrot Set (the basic code for it is available here), which is just iterating through the formula, z=z^2+c, where c is a complex parameter, starting at z=0 . The Mandelbrot Set is defined as set of all points, such that the sequence, got by iteration, does not escape to infinity. Some of the set's properties are: local connectivity, self-similarity, correspondence with the structure of Julia Set etc. Very simple formula, which gives fascinating results. In the R language animation you can observe the main cardioid, period bulbs, hyperbolic components.

Classic Mandelbrot Set

 

What is the probability that the sun will rise tomorrow?

It has been some time since my last post, but here is the new one. Perhaps the title sounds a bit inappropriate, but indeed it is well suited. Read till the end, where I explain it for those not figuring it yet (or consider it a puzzle :))

So, what have I been up to lately? Despite summer holidays I have been involved in quite a few projects.

First, GSoC Matrix Factorization Techniques for Data Mining project for Orange has been progressing well. Code is almost finished, no major changes in framework, factorization/initialization methods, quality measures, etc. are expected. Project is on schedule and has not diverged from initial plan, all intended techniques (plus a few additional I have found interesting along research) are implemented. I have been doing some testing, and have yet to provide more use cases/examples along with thorough explanation and example data sets. I will not go into details here, as implemented methods' descriptions with paper references are published at Orange wiki project site. The project is great, a mix of linear algebra, optimization methods, statistics and probability, numerical methods (analysis if you want to read some convergence or derivation proofs) with intensive applications in data mining, machine learning, computer vision, bioinformatics etc. and I have been really enjoying working on it, here is my post at Orange blog. The Orange and its GSoCers have been spotlighted at Google Open Source Blog.

Next, there is some image processing; segmentation, primary and secondary object detection, object tracking, morphology measures, filters etc. (no details).

Minor for keeping contact with MS world, Sharepoint Server 2010 (SP 10). I have some experience with it (and its previous version MOSS 2007), both in administration and especially in code. This time it was not about coding workflows using Win Workflow Foundation, developing Web parts/sites/custom content types/web services (...) but providing an in-site publishing hierarchy for data in custom lists and integration with Excel Services (not with new 365 Cloud service). Obstacles were limited server access (hosting plan), old versions of software and usual MS stuffs (:)). In SP 10 these are SPFieldLookups filters and cascading lookups, data connections between sites/lists/other content. As always there are some nice workarounds which have resolved all issues.

Last (not least) I have been catching up with all the reading material I was forced to put aside during the year (well not entirely true: the more I read, the more should be read, so the pile of papers in iBooks and Mendeley app is not getting any smaller :)).

Here we are, what about the post's title? The sunrise problem was introduced by Laplace (french mathematician known for Bayesian interpretation of probability, Laplace transform, Laplace equation, differential operator, work in mechanics and physics). Is the probability that the sun will rise tomorrow 1 if we can infer from the observed data that is has risen every day on record? :) So what is the answer of the question in the title? The inferred probability depends on the record - whether we take the past experience of one person, humanity, or the Earth history. This is the reference class problem - with Bayes any probability is the conditional probability given what a person knows. Simple principle emerged from this, add-one or Laplacian smoothing (Example: Doing spam email classification with a bag of words model or text classification with multinomial model, this allows the assignment of positive probabilities to words which do not occur in the sample) and corresponds to the expected value of the posterior.

 


Page 3 of 25