Marinka Zitnik

Fusing bits and DNA

  • Increase font size
  • Default font size
  • Decrease font size

GSoC: MF - Matrix Factorization Techniques for Data Mining Review

E-mail Print PDF

Google Summer of Code 2011 has finished. On 22th of August it was firm "pencils down" date and today, on 26th of August, has been final evaluation deadline. Therefore, it is time for a small review to be published here on my blog.

I successfully completed the program and have met all the goals, outlined in the original project plan with some (2) additional factorization methods I have implemented. I have been very satisfied with the support and mentoring of both the organization and mentor.

The project, I have worked on, has been developing library MF - Matrix Factorization Techniques for Data Mining which includes a number of published matrix factorization algorithms, initialization methods, quality and performance measures and facilitates the combination of these to produce new strategies. The library contains examples of usage, applications of factorization methods on both synthetic and real world data sets are provided.

Matrix factorization methods have been shown to be a useful decomposition for multivariate data as low dimensional data representations are crucial to numerous applications in statistics, signal processing and machine learning.

An incomplete list of applications of matrix factorization methods includes:

  • bioinformatics,
  • environmetrics and chemometrics,
  • image processing and computer graphics,
  • text analysis,
  • miscelllaneous, such as extracting speech features, transcription of polyphonic music passages, object characterization, spectral data analysis, multiway clustering, learning sound dictionaries, etc.

Example using synthetic data set is intended as demonstration of the MF library since all currently implemented factorization algorithms with different initialization methods and specific settings are ran. Others include applications on real world data sets in:

  • bioinformatics,
  • text analysis,
  • image processing,
  • recommendation systems.

I will outline only the most important content of the MF library here (for any details refer to documentation (or code)), as this is project review and not library reference (references to articles are provided in the documentation).

  • Matrix Factorization Methods
  • BD - Bayesian nonnegative matrix factorization Gibbs sampler [Schmidt2009]
  • BMF - Binary matrix factorization [Zhang2007]
  • ICM - Iterated conditional modes nonnegative matrix factorization [Schmidt2009]
  • LFNMF - Fisher nonnegative matrix factorization for learning Local features [Wang2004], [Li2001]
  • LSNMF - Alternative nonnegative least squares matrix factorization using projected gradient method for subproblems [Lin2007]
  • NMF - Standard nonnegative matrix factorization with Euclidean / Kullback-Leibler update equations and Frobenius / divergence / connectivity cost functions [Lee2001], [Brunet2004]
  • NSNMF - Nonsmooth nonnegative matrix factorization [Montano2006]
  • PMF - Probabilistic nonnegative matrix factorization [Laurberg2008], [Hansen2008]
  • PSMF - Probabilistic sparse matrix factorization [Dueck2005], [Dueck2004], [Srebro2001], [Li2007]
  • SNMF - (SNMF/L, SNMF/R) Sparse nonnegative matrix factorization based on alternating nonnegativity constrained least squares [Park2007]
  • SNMNMF - Sparse network regularized multiple nonnegative matrix factorization [Zhang2011]
  • Initialization Methods
  • Quality and Performance Measures
  • Distance
  • Residuals
  • Connectivity matrix
  • Consensus matrix
  • Entropy of the fitted NMF model [Park2007]
  • Dominant basis components computation
  • Explained variance
  • Feature score computation representing its specificity to basis vectors [Park2007]
  • Computation of most basis specific features for basis vectors [Park2007]
  • Purity [Park2007]
  • Residual sum of squares - can be used for rank estimate [Hutchins2008], [Frigyesi2008]
  • Sparseness [Hoyer2004]
  • Cophenetic correlation coefficient of consensus matrix - can be used for rank estimate [Brunet2004]
  • Dispersion [Park2007]
  • Selected matrix factorization method specific
  • Utils:
  • Fitted factorization model tracker across multiple runs
  • Residuals tracker across multiple factorizations / runs
  • Different factorization models

Relevant links:

Join the GSoC next year! It is a great opportunity to spend the summer learning something new and having fun at the same time.

Last Updated on Sunday, 25 August 2013 21:36

What is the probability that the sun will rise tomorrow?

E-mail Print PDF

It has been some time since my last post, but here is the new one. Perhaps the title sounds a bit inappropriate, but indeed it is well suited. Read till the end, where I explain it for those not figuring it yet (or consider it a puzzle :))

So, what have I been up to lately? Despite summer holidays I have been involved in quite a few projects.

First, GSoC Matrix Factorization Techniques for Data Mining project for Orange has been progressing well. Code is almost finished, no major changes in framework, factorization/initialization methods, quality measures, etc. are expected. Project is on schedule and has not diverged from initial plan, all intended techniques (plus a few additional I have found interesting along research) are implemented.  I have been doing some testing, and have yet to provide more use cases/examples along with thorough explanation and example data sets. I will not go into details here, as implemented methods' descriptions with paper references are published at Orange wiki project site. The project is great, a mix of linear algebra, optimization methods, statistics and probability, numerical methods (analysis if you want to read some convergence or derivation proofs) with intensive applications in data mining, machine learning, computer vision, bioinformatics etc. and I have been really enjoying working on it, here is my post at Orange blog. The Orange and its GSoCers have been spotlighted at Google Open Source Blog.

Next, there is some image processing; segmentation, primary and secondary object detection, object tracking, morphology measures, filters etc. (no details).

Minor for keeping contact with MS world, Sharepoint Server 2010 (SP 10). I have some experience with it (and its previous version MOSS 2007), both in administration and especially in code. This time it was not about coding workflows using Win Workflow Foundation, developing Web parts/sites/custom content types/web services (...) but providing an in-site publishing hierarchy for data in custom lists and integration with Excel Services (not with new 365 Cloud service). Obstacles were limited server access (hosting plan), old versions of software and usual MS stuffs (:)). In SP 10 these are SPFieldLookups filters and cascading lookups, data connections between sites/lists/other content. As always there are some nice workarounds which have resolved all issues.

Last (not least) I have been catching up with all the reading material I was forced to put aside during the year (well not entirely true: the more I read, the more should be read, so the pile of papers in iBooks and Mendeley app is not getting any smaller :)).

Here we are, what about the post's title? The sunrise problem was introduced by Laplace (french mathematician known for Bayesian interpretation of probability, Laplace transform, Laplace equation, differential operator, work in mechanics and physics). Is the probability that the sun will rise tomorrow 1 if we can infer from the observed data that is has risen every day on record? :) So what is the answer of the question in the title? The inferred probability depends on the record - whether we take the past experience of one person, humanity, or the Earth history. This is the reference class problem - with Bayes any probability is the conditional probability given what a person knows. Simple principle emerged from this, add-one or Laplacian smoothing (Example: Doing spam email classification with a bag of words model or text classification with multinomial model, this allows the assignment of positive probabilities to words which do not occur in the sample) and corresponds to the expected value of the posterior.

Last Updated on Wednesday, 30 January 2013 23:42

Google Scholars' Retreat 2011

E-mail Print PDF

Impressions from Google Scholars' Retreat 2011, which took place last month in Zurich, Switzerland, where the Retreat for the EMEA region was organized, are great.  It has really been a valuable experience to meet many Googlers, other scholars and their research work.

Below is this year logo.

This year logo for Google Scholarship.

Let me mention just a few workshops and talks I have attended during the Retreat:

  • Keynote speech
  • Working in Industry
  • Product Design Workshop
  • Career Panel
  • Poster Show
  • Android Technical Talk
  • Open source and Google Support of its Advancement in the Community (Details: Open source community at Google, development model and how Google uses it and supports advancement of this community.)
  • Google Web API (Details: Generic principles of Google Web APIs and the tools Google provide for learning and experimenting with them. Focus on interaction with these tools for research, be it to harvest data or to help present data in an efficient way.)
  • *Privacy, the Ultimate Interdisciplinary Challenge (Details: The questions of deletion and digital rights management, online reputation and self-representation, the right to be forgotten.)
  • * - Using Google's Strengths to Address Global Challenges (Details: is tasked with using Google's strengths in technology and the Internet to address global challenges. The session covered development of Google Flu Trends as well as more recent work including Google Earth Engine and Google Crisis Response.)
  • *Street View 3D (Details: Many problems arise when developing such a product, such as augmenting the panoramas with 3D features for better navigation and a more immersive experience.)
  • *Priority Inbox (Details: Because importance is highly personal, email importance is predicted by learning a per-user statistical model, updated as frequently as possible. The challenges include inferring the importance of mail without explicit user labeling; finding learning methods that deal with non-stationary and noisy training data; constructing models that reduce training data requirements; storing and processing terabytes of per-user feature data; predicting in a distributed and fault-tolerant way.)
Beside that there were some very informative talks on business visions of Google, ways of involvement with Google, office tours (I liked the slides between the floors :)) and lots of small talk with Googlers.
Last Updated on Thursday, 09 July 2015 15:09

CG: L-Systems Fractal Generation of 3D Objects

E-mail Print PDF

One of the courses I attended this semester has been Computer Graphics (CG).

I have spent some time studying algorithmic botany and especially L-systems, formal grammars for describing fractal objects. These can be used for generation of objects in biology, botany, and even buildings and entires cities. Rome Reborn is an example of such project, in which formal grammars were used for the creation of the 3D digital model illustrating the urban development of ancient Rome.

So I have decided to visualize some of the 3D fractal objects using OpenGL and LWJGL library. Below are links to short report and presentation. Take a look :)

Those of you who are interested, great book on this topic by the father of algorithmic botany, Aristid Lindenmayer. Prusinkiewicz, Przemyslaw; Aristid Lindenmayer (1990). The Algorithmic Beauty of Plants (The Virtual Laboratory).Springer-Verlag. ISBN 0-387-97297-8

Last Updated on Wednesday, 23 July 2014 16:06

Recognized as Google Anita Borg Scholarship Finalist

E-mail Print PDF

Yet another great news concering my (little) involvement with Google. I have written few weeks ago about being accepted to the Google Summer of Code 2011 with the project on matrix factorizations techniques in data mining for the Orange platform.

Nevertheless, Google has announced Google Anita Borg Scholarship Recipients and Finalists, a Scholarship for which I have applied this year and I am among 147 undergraduate and graduate students worldwide being chosen. Just for clarification - this is completely unrelated to the GSoC (the only common denominator being the Google itself), the scholarship however being awarded based on the strength of candidates’ academic performance, leadership experience and demonstrated passion for computer science.

Scholars from Europe have Scholars' Retreat at European Google centre at Zurich in June and I am very much looking forward to this event to meet some fascinating people. The retreat will include workshops, speakers, panelists, breakout sessions and social activities scheduled over a couple of days.

  • (Official Google Blog with published results of the Scholars's selection process) link
  • (Official Google Students Blog with published announcement of the Scholars's) link
  • (Faculty News) link in Slovene

Last Updated on Wednesday, 25 December 2013 02:58

Visualizing geographic data with the WebGL Globe

E-mail Print PDF

Google Team has shared a new Chrome experiment, called the WebGL Globe, namely the visualization platform for geographic data that runs in WebGL enabled browsers - Chrome, Firefox. (Check if your browser supports the WebGL standard).

To speed up the visualization of 3D geometry, they have used vertex shader and took advantage of GLSL with two fragment shaders. 3D data spikes are drawn with Three.js, JS library for building lightweight 3D graphics.

I have embedded simple globe showing Google search traffic. Try it or try more examples that shipped with this cool open source project. Or create your own globe using the JSON data format.

Here is post of Official Google Code Blog. Nice job :)

Last Updated on Wednesday, 30 January 2013 23:39

Part1: Matrix Computations Notes

E-mail Print PDF
Labels: FactorizationMaths

Constrained LS Problems

Subset Selection Using SVD

Total LS

Comparing Subspaces Using SVD

Some Modified Eigenvalue Problems

Updating the QR Factorization


Last Updated on Wednesday, 30 January 2013 23:43

GSoC & Orange: Matrix Factorization Techniques for Data Mining

E-mail Print PDF

This year I have applied for the Google Summer of Code, namely the Orange project.

Will see if I will be accepted. :)

Update 25.04.2011: Google has announced the results. My proposal has been accepted and am looking forward to start working. :)

Some links to articles in Slovenian news:

Project title: Matrix Factorization Techniques for Data Mining

Description: Matrix factorization is a fundamental building block for many of current data mining approaches and factorization techniques are widely used in applications of data mining. Our objective is to provide the Orange community with a uni fed and efficient interface to matrix factorization algorithms and methods. For that purpose we will develop a scripting library which will include a number of published factorization algorithms and initialization methods and will facilitate the combination of these to produce new strategies. Extensive documentation with working examples that will demonstrate real applications, commonly used benchmark data and visualization methods will be provided to help with the interpretation and comprehension of the results.

Main factorization techniques and their variations planned to be included in the library are: Bayesian decomposition (BD) together with linearly constrained and variational BD using Gibbs sampling, probabilistic matrix factorization (PMF), Bayesian factor regression modeling (BFRM), family of nonnegative matrix factorizations (NMF) including sparse NMF, non-smooth NMF, local factorization with Fisher NMF, least-squares NMF. Di fferent multiplicative and update algorithms for NMF will be analyzed which minimize LS error or generalized KL divergence. Further nonnegative matrix approximations (NNMA) with extensions will be implemented. For completeness algorithms such as NCA, ICA and PCA could be added to the library.

Here is proposal document.

Last Updated on Sunday, 25 August 2013 21:33


E-mail Print PDF

Sniffit is simple network packet analyzer, which I implemented in C using libpcap library.

Here is very short presentation, with list of functional requirements which Sniffit supports.


Last Updated on Wednesday, 30 January 2013 23:40

EuroSkills Informatics 2010

E-mail Print PDF

Just today I returned from the Portugal, Lisbon, where Euroskills 2010 has been held. In the Office ICT Category (Informatics) I won two silver medals, one as a Project Manager of the team and second together with Slovenian team (members: me, Slavko Zitnik, Miha Longino, Peter Virant).

The contest was organized very well, we were accomodated in the part of the Lisbon, where Expo was held, and there was also the competition. In fifty trades there were 500 competitors and a few hundred of experts, observers and guests.

Competitors in Office ICT Category.


Slovenian team in informatics (Marinka, Peter, Slavko, Miha) with team leader.

President of RS

Sprejemi tekmovalcev:


Last Updated on Wednesday, 30 January 2013 23:41

ES 2010 is Here

E-mail Print PDF

Tomorrow we are going to Lisbona,Portugal, to EuroSkills 2010, category Informatics.

Details of our task at competition will be revealed after it.

Some news on the contest will be published daily by

The contest in our category is scheduled from 9.12.2010 to 11.12.2010.

Some of the interviews with our team members published in the news:


Last Updated on Wednesday, 30 January 2013 23:40


E-mail Print PDF

I have prepared presentation on XML DBs, which I will hold tomorrow, on 1st December 2010 at Basics of Database Systems course.

XQuery Examples.

Last Updated on Wednesday, 30 January 2013 23:38

ES 2010 Preparations

E-mail Print PDF

When I started writing this article, the countdown said: ES 2010 in 17 days 0 hours 9 minutes, and 4 seconds.

Since September we (the team of four) have actively devoted ourselves to the Office ICT Test Project, draft of the project which we will have to implement in Lisbona. The project has progressed well, we gain some expertise in vast majority (if not all, we will see, depends on the final version of TP, which will be known not before as at the beginning of the compettion) of the features. Basically the task is to set up the entire ICT system of fictive international corporation from functional and network design (IPv4, IPv6, NATv6, DHCPv6, security measures, OSPF & EIGRP routing, routing destribution, dynamic VLANs, network authentication with RADIUS, wifi etc.) to common bussines services (VPN, mail, DA, AD, DNS, DHCP, VoIP Asterisk, AAA, MDT, WSUS, SNMP Nagios, virtualization etc.) and company's portfolio, documentation maintenance, cost and time management etc.

Some photos of our preparations are available at our team member's blog.

Last Updated on Wednesday, 30 January 2013 23:38

Embedded Business Intelligence in SP 2010

E-mail Print PDF

It has been a while since SP 2010 has been released and since I have been developing quite a lot on SP 2007 and am now exploring 2010 version, I feel moral duty :) to write something about this latest version as well. So I decided to write something about embedded business intelligence, which is not mentioned very often but I think it will become one of the integral part of SP.

First of all, embedded BI is a result of incorporation of the Performance Point Server as Performance Point Services. Before SP 2010, Performance Point Server was independant and separate product but now it isn't anymore. New Services enrich SP with KPI indices, scorecarding, matrices and much more that can easily be rendered as dashboards, charting webparts or consumed through Visio Services or used by numerous improvements to Excel Services.

Together with Performance Point Services, you get a Dashboard Designer, with which is possible to gui interface or hook up the data you you want to drive your scorecard or KPI off (e.g. conect it to SQL Service Anaylsis, easily create KPIs in designer and render them as webparts in sharepoint). As data mining and OLAP are getting more and more important BI technologies it is necessary to stress their benefits. First you are in control of what is happening with the data you have: data sources can be configured by admins  nd dashboards by department business units. Furthermore it is very easy to slice and dice the data to get the answers you are looking for. One among numerous new features is the decomposition tree -  we can drill into key notes and get more details in a very visual graphical way, which enriches the models from which we pull the data, so users can get quality answers quickly.

Few improvements are included in the Excel Services allowing users to publish and share bits of or whole workbooks, but still the owner has the total control of the services users consume.

There is another novelty worth mentioning, namely the Visio Services. It is simply a matter of creating a graph (e.g. network diagram or graph of used resources on the project) that is data bound and which then checkis real present data (e.g. changing pictures or states in accordance with the progress of the project). It is more of creating a simple user-friendly workflow that updates itself than a diagram. Do not confuse this with another powerful tool WWF - Windows Workflow Foundation to create complex workflows in .NET and VS.

Last Updated on Wednesday, 30 January 2013 23:37

Mandelbrot Set & R Language

E-mail Print PDF

I've been following a course on Statistical Aspects of Data Mining lately, which is not what I will write about, but this article got inspiration from it. The software environment being used in this course is the R programming language, which is used for statistical computing and graphics (it is available for Windows, Linux and Mac as part of the GNU project). If you download it from R's website, you get it with the command line interpreter, of course there are some IDEs as well, such as Rcmdr or Tinn-R. The capabilities of R are extended with numerous user-submitted packages - for the animation of the Mandelbrot Set at least the following libraries are needed: spam, fields, bitops, caTools - all are freely available at R's website. The R is influenced by S and Scheme, but I'wont go into details, as there is plenty information about it on the web.

I tried to draw the classic Mandelbrot Set (the basic code for it is available here), which is just iterating through the formula, z=z^2+c, where c is a complex parameter, starting at z=0 . The Mandelbrot Set is defined as set of all points, such that the sequence, got by iteration, does not escape to infinity. Some of the set's properties are: local connectivity, self-similarity, correspondence with the structure of Julia Set etc. Very simple formula, which gives fascinating results. In the R language animation you can observe the main cardioid, period bulbs, hyperbolic components.

Classic Mandelbrot Set

Last Updated on Wednesday, 30 January 2013 23:42

Page 4 of 4