Marinka Zitnik

Fusing bits and DNA

  • Increase font size
  • Default font size
  • Decrease font size

CRY: Commitment Schemes

E-mail Print PDF

I have been following the Cryptography and Theory of Coding 2 course this semester (summer 12).

During the course I have been working on the project about Commitment schemes. Please find below attached the final report produced and a short presentation. Additionally, some homework solutions are attached as well. The language is Slovene, but it is mostly pure math with proofs, so it should not be too difficult to capture the main idea.

Please note these are reports and have not been subject to the usual scrutiny of published papers.

Last Updated on Wednesday, 30 January 2013 23:42
 

XRDS: Crossroads, The ACM Magazine for Students

E-mail Print PDF

XRDS, Crossroads is the flagship academic magazine for student members of the ACM. It is published quarterly, both in print and online for ACM student members. The publication was previously named Crossroads and has been running since 1994, edited and maintained voluntarily by students. The magazine is distributed to tens of thousands of students worldwide.

Issues focus on Computer Science and include various articles from highlights in new and important research to roundtable discussions, interviews and introductory overviews. In the magazine are published exciting pieces written by top leaders in the field as well as students who submit unsolicited work.

I highly recommend reading the issues! Here is the current one.

 

I am writing this since I have been selected as the Department Editor for this magazine as January 2012. I am looking forward to contribute to this great platform.

Last Updated on Wednesday, 30 January 2013 23:37
 

ML: Reliable Calibrated Probability Estimation in Classification

E-mail Print PDF

I followed a Machine Learning course this semester at the university (along with the one, offered by Stanford Uni). I have been working on a seminar studying Reliable probability estimation in classification using calibration. The final report is in the form of the scientific article, which is attached below.

Here is the abstract:

 

Estimating reliable class membership probabilities is of vital importance for many applications in data mining in which classification results are combined with other sources of information to produce decisions. Other sources include domain knowledge, outputs of other classifiers or example-dependent misclassification costs. We revisit the problem of classification calibration motivated by the issues of the isotonic regression and binning calibration. These methods can behave badly on small or noisy calibration sets, producing inappropriate intervals or boundary generalization. We propose an improvement of the calibration with isotonic regression and binning method by using bootstrapping technique, named boot-isotonic regression and boot-binning, respectively. Confidence intervals obtained by repeatedly calibrating the set sampled with replacement from the original training set are used for merging unreliable or too narrow calibration intervals. This method has been experimentally evaluated with respect to two calibration measures, several classification methods and several problem domains. The results show that the new method outperforms the basic isotonic regression and binning methods in most configurations.

 

Short presentation about work and final report:

 

Last Updated on Sunday, 25 August 2013 21:31
 

RTVSLO: Ugriznimo znanost (Bite the Science)

E-mail Print PDF

In November 2011, a show called Ugriznimo znanost (Bite the Science) has been shot at the Faculty of Computer and Information Science by the national public broadcasting organization (RTVSLO).

Bite the Science is a weekly series on the educational channel TV Slovenia. It is the series which aims to explain the science in a relaxed and humorous manner. The topics range from current events in science to accomplishments of Slovene scientists and scientific research in other parts of the world.

The episode scheduled for the last week of December has discussed programming. In it I am presenting my achievements at the Google Summer of Code program (Nimfa library) and discuss Orange, open-source data mining and visualization tool developed by Biolab, Bioinfomatics Laboratory. Further, researchers talk in the episode about smart house, Faculty Robot League, summer school of programming and mathematical data analysis used for predicting the results of the Eurosong competition.

Take a look (link to the episode is below)!

Shooting the Bite the Science

Relevant links:

Last Updated on Wednesday, 30 January 2013 23:38
 

University Award 2011

E-mail Print PDF

Traditionally Week of The University in Ljubljana is organized in the beginning of December, during which numerous ceremonial events are held at the university and it's faculty members.

This year I was awarded Best Student award of University for overall performance in courses (slo. Svečana listina za izjemen študijski uspeh Univerze Ljubljana) in addition to faculty award which I have received in last years. The reception at the University was well organized with few music intermezzos between speeches.

Nevertheless, new challenges are on their way!

Last Updated on Wednesday, 30 January 2013 23:36
 

Renaming: MF - Matrix Factorization Techniques for Data Mining

E-mail Print PDF

In this short post I would like to update you on some recent changes in the MF - Matrix Factorization Techniques for Data Mining Library.

Recently, the library MF - Matrix Factorization Techniques for Data Mining has been moved and renamed. It is now called Nimfa - A Python Library for Nonnegative Matrix Factorization or short, nimfa. The latest version with documentation, link to source code and working examples can be found at nimfa.biolab.si. Those of you still having old links, are kindly requested to update the new information - you will be automatically redirected to the new site if you visit old page.

Hope you like the new name.

Enjoy.

Last Updated on Sunday, 25 August 2013 21:36
 

Fractal Dimension Computation Support in MF Library

E-mail Print PDF

I have always been fascinated by the world of fractals and have been deeply enthusiastic exploring the maths behind them. This post is announcing the support of the fractal dimension computation in the MF - Matrix Factorization for Data Mining library.

In the following paragraphs we shortly revise the most important concepts and definitions of the fractal dimension.

The embedding dimensionality of a data set is the number of attributes in the data set. The intrinsic dimensionality is defined as the actual number of dimensions in which the n m-dimensional original vectors can be embedded under the assumption that some distance in the reduced space is kept among them. Given a fractal as a self-similar set of points with r self-similar pieces, where each is scaled down by a factor of s, the fractal dimension D of the object is defined as

D = { log r} / { log s} .

Example: Sierpinski triangle (My seminar work for CG on L-systems - figure 1 in Appendix of the Report document) consists of three self-similar parts and each is scaled down by a factor of two, therefore its fractal dimension is D approx 1.58.

For the finite set of points in a vector space, we say the set is statistically self-similar on a range of scales(a, b)on which the self-similarity is true. However in the theory self-similar object should have infinitely many points because each self-similar part is a scaled-down version of the original object. As a measure of the intrinsic fractal dimension of a data set, the slope of the correlation integral is used. The correlation integral C(r) for the data set Sis defined as

C(r) = Count(dist(u, v) <= r; u in S, v in S, u != v).

Given a data set S which is statistically self-similar in the range(a,b), its correlation fractal dimension D is

D = {partial log C(r)} / {partial log r}, r in [a, b].

It has been shown that the correlation fractal dimension corresponds to the intrinsic dimension of a data set. Many properties hold for the correlation fractal dimension, see [1] and [2]. For us it is especially important, that the intrinsic dimensionality gives a lower bound on the number on attributes needed to keep the vital characteristics of the data set.

A fast algorithm for the computation of the intrinsic dimension of a data set presented in [2] is implemented in the MF - Matrix Factorization for Data Mining library. Intuitive explanation of the correlation fractal dimension is that is measures how the number of neighbor points increases with the increase of the distance. It therefore measures the spread of the data and the fractal dimension equal to the embedding dimension means that the spread of the points in the data set is maximum.

Of high importance is a Conjecture 1 in [1]: With all the parameters being equal, a dimensionality reduction method which achieves higher fractal dimension in the reduced space is better than the rest for any data mining task. Therefore correlation fractal dimension of a data set can be used:

  • for determining the optimal number of dimensions in the reduced space,
  • as a performance comparison tool between dimensionality reduction methods,
and all this can be done in way that is scalable to large data sets.

 

Recommended reading:

  1. Kumaraswamy, K., (2003). Fractal Dimension for Data Mining. Carnegie Mellon University.
  2. Jr, C. T., Traina, A., Wu, L., Faloutsos, C., (2010). Fast Feature Selection using Fractal Dimension. Science, 1(1), 3-16.

In [1] a concept of intrinsic fractal dimension of a data set is introduced and it is shown how fractal dimension can be used to aid in several data mining tasks. In [2] a fast O(n) algorithm to compute fractal dimension of a data set is presented. On top of that a fast, scalable algorithm to quickly select the most important attributes of the given set of n-dimensional vectors is described.

Last Updated on Sunday, 25 August 2013 21:36
 

GSoC: MF - Matrix Factorization Techniques for Data Mining Review

E-mail Print PDF

Google Summer of Code 2011 has finished. On 22th of August it was firm "pencils down" date and today, on 26th of August, has been final evaluation deadline. Therefore, it is time for a small review to be published here on my blog.

I successfully completed the program and have met all the goals, outlined in the original project plan with some (2) additional factorization methods I have implemented. I have been very satisfied with the support and mentoring of both the organization and mentor.

The project, I have worked on, has been developing library MF - Matrix Factorization Techniques for Data Mining which includes a number of published matrix factorization algorithms, initialization methods, quality and performance measures and facilitates the combination of these to produce new strategies. The library contains examples of usage, applications of factorization methods on both synthetic and real world data sets are provided.

Matrix factorization methods have been shown to be a useful decomposition for multivariate data as low dimensional data representations are crucial to numerous applications in statistics, signal processing and machine learning.

An incomplete list of applications of matrix factorization methods includes:

  • bioinformatics,
  • environmetrics and chemometrics,
  • image processing and computer graphics,
  • text analysis,
  • miscelllaneous, such as extracting speech features, transcription of polyphonic music passages, object characterization, spectral data analysis, multiway clustering, learning sound dictionaries, etc.

Example using synthetic data set is intended as demonstration of the MF library since all currently implemented factorization algorithms with different initialization methods and specific settings are ran. Others include applications on real world data sets in:

  • bioinformatics,
  • text analysis,
  • image processing,
  • recommendation systems.

I will outline only the most important content of the MF library here (for any details refer to documentation (or code)), as this is project review and not library reference (references to articles are provided in the documentation).

  • Matrix Factorization Methods
  • BD - Bayesian nonnegative matrix factorization Gibbs sampler [Schmidt2009]
  • BMF - Binary matrix factorization [Zhang2007]
  • ICM - Iterated conditional modes nonnegative matrix factorization [Schmidt2009]
  • LFNMF - Fisher nonnegative matrix factorization for learning Local features [Wang2004], [Li2001]
  • LSNMF - Alternative nonnegative least squares matrix factorization using projected gradient method for subproblems [Lin2007]
  • NMF - Standard nonnegative matrix factorization with Euclidean / Kullback-Leibler update equations and Frobenius / divergence / connectivity cost functions [Lee2001], [Brunet2004]
  • NSNMF - Nonsmooth nonnegative matrix factorization [Montano2006]
  • PMF - Probabilistic nonnegative matrix factorization [Laurberg2008], [Hansen2008]
  • PSMF - Probabilistic sparse matrix factorization [Dueck2005], [Dueck2004], [Srebro2001], [Li2007]
  • SNMF - (SNMF/L, SNMF/R) Sparse nonnegative matrix factorization based on alternating nonnegativity constrained least squares [Park2007]
  • SNMNMF - Sparse network regularized multiple nonnegative matrix factorization [Zhang2011]
  • Initialization Methods
  • Quality and Performance Measures
  • Distance
  • Residuals
  • Connectivity matrix
  • Consensus matrix
  • Entropy of the fitted NMF model [Park2007]
  • Dominant basis components computation
  • Explained variance
  • Feature score computation representing its specificity to basis vectors [Park2007]
  • Computation of most basis specific features for basis vectors [Park2007]
  • Purity [Park2007]
  • Residual sum of squares - can be used for rank estimate [Hutchins2008], [Frigyesi2008]
  • Sparseness [Hoyer2004]
  • Cophenetic correlation coefficient of consensus matrix - can be used for rank estimate [Brunet2004]
  • Dispersion [Park2007]
  • Selected matrix factorization method specific
  • Utils:
  • Fitted factorization model tracker across multiple runs
  • Residuals tracker across multiple factorizations / runs
  • Different factorization models

Relevant links:

Join the GSoC next year! It is a great opportunity to spend the summer learning something new and having fun at the same time.

Last Updated on Sunday, 25 August 2013 21:36
 

What is the probability that the sun will rise tomorrow?

E-mail Print PDF

It has been some time since my last post, but here is the new one. Perhaps the title sounds a bit inappropriate, but indeed it is well suited. Read till the end, where I explain it for those not figuring it yet (or consider it a puzzle :))

So, what have I been up to lately? Despite summer holidays I have been involved in quite a few projects.

First, GSoC Matrix Factorization Techniques for Data Mining project for Orange has been progressing well. Code is almost finished, no major changes in framework, factorization/initialization methods, quality measures, etc. are expected. Project is on schedule and has not diverged from initial plan, all intended techniques (plus a few additional I have found interesting along research) are implemented.  I have been doing some testing, and have yet to provide more use cases/examples along with thorough explanation and example data sets. I will not go into details here, as implemented methods' descriptions with paper references are published at Orange wiki project site. The project is great, a mix of linear algebra, optimization methods, statistics and probability, numerical methods (analysis if you want to read some convergence or derivation proofs) with intensive applications in data mining, machine learning, computer vision, bioinformatics etc. and I have been really enjoying working on it, here is my post at Orange blog. The Orange and its GSoCers have been spotlighted at Google Open Source Blog.

Next, there is some image processing; segmentation, primary and secondary object detection, object tracking, morphology measures, filters etc. (no details).

Minor for keeping contact with MS world, Sharepoint Server 2010 (SP 10). I have some experience with it (and its previous version MOSS 2007), both in administration and especially in code. This time it was not about coding workflows using Win Workflow Foundation, developing Web parts/sites/custom content types/web services (...) but providing an in-site publishing hierarchy for data in custom lists and integration with Excel Services (not with new 365 Cloud service). Obstacles were limited server access (hosting plan), old versions of software and usual MS stuffs (:)). In SP 10 these are SPFieldLookups filters and cascading lookups, data connections between sites/lists/other content. As always there are some nice workarounds which have resolved all issues.

Last (not least) I have been catching up with all the reading material I was forced to put aside during the year (well not entirely true: the more I read, the more should be read, so the pile of papers in iBooks and Mendeley app is not getting any smaller :)).

Here we are, what about the post's title? The sunrise problem was introduced by Laplace (french mathematician known for Bayesian interpretation of probability, Laplace transform, Laplace equation, differential operator, work in mechanics and physics). Is the probability that the sun will rise tomorrow 1 if we can infer from the observed data that is has risen every day on record? :) So what is the answer of the question in the title? The inferred probability depends on the record - whether we take the past experience of one person, humanity, or the Earth history. This is the reference class problem - with Bayes any probability is the conditional probability given what a person knows. Simple principle emerged from this, add-one or Laplacian smoothing (Example: Doing spam email classification with a bag of words model or text classification with multinomial model, this allows the assignment of positive probabilities to words which do not occur in the sample) and corresponds to the expected value of the posterior.

Last Updated on Wednesday, 30 January 2013 23:42
 

Google Scholars' Retreat 2011

E-mail Print PDF

Impressions from Google Scholars' Retreat 2011, which took place last month in Zurich, Switzerland, where the Retreat for the EMEA region was organized, are great.  It has really been a valuable experience to meet many Googlers, other scholars and their research work.

Below is this year logo.

This year logo for Google Scholarship.

Let me mention just a few workshops and talks I have attended during the Retreat:

  • Keynote speech
  • Working in Industry
  • Product Design Workshop
  • Career Panel
  • Poster Show
  • Android Technical Talk
  • Open source and Google Support of its Advancement in the Community (Details: Open source community at Google, development model and how Google uses it and supports advancement of this community.)
  • Google Web API (Details: Generic principles of Google Web APIs and the tools Google provide for learning and experimenting with them. Focus on interaction with these tools for research, be it to harvest data or to help present data in an efficient way.)
  • *Privacy, the Ultimate Interdisciplinary Challenge (Details: The questions of deletion and digital rights management, online reputation and self-representation, the right to be forgotten.)
  • *Google.org - Using Google's Strengths to Address Global Challenges (Details: Google.org is tasked with using Google's strengths in technology and the Internet to address global challenges. The session covered development of Google Flu Trends as well as more recent work including Google Earth Engine and Google Crisis Response.)
  • *Street View 3D (Details: Many problems arise when developing such a product, such as augmenting the panoramas with 3D features for better navigation and a more immersive experience.)
  • *Priority Inbox (Details: Because importance is highly personal, email importance is predicted by learning a per-user statistical model, updated as frequently as possible. The challenges include inferring the importance of mail without explicit user labeling; finding learning methods that deal with non-stationary and noisy training data; constructing models that reduce training data requirements; storing and processing terabytes of per-user feature data; predicting in a distributed and fault-tolerant way.)
Beside that there were some very informative talks on business visions of Google, ways of involvement with Google, office tours (I liked the slides between the floors :)) and lots of small talk with Googlers.
Last Updated on Thursday, 09 July 2015 15:09
 

CG: L-Systems Fractal Generation of 3D Objects

E-mail Print PDF

One of the courses I attended this semester has been Computer Graphics (CG).

I have spent some time studying algorithmic botany and especially L-systems, formal grammars for describing fractal objects. These can be used for generation of objects in biology, botany, and even buildings and entires cities. Rome Reborn is an example of such project, in which formal grammars were used for the creation of the 3D digital model illustrating the urban development of ancient Rome.

So I have decided to visualize some of the 3D fractal objects using OpenGL and LWJGL library. Below are links to short report and presentation. Take a look :)

Those of you who are interested, great book on this topic by the father of algorithmic botany, Aristid Lindenmayer. Prusinkiewicz, Przemyslaw; Aristid Lindenmayer (1990). The Algorithmic Beauty of Plants (The Virtual Laboratory).Springer-Verlag. ISBN 0-387-97297-8

Last Updated on Wednesday, 23 July 2014 16:06
 


Page 7 of 8