Marinka Zitnik

Fusing bits and DNA

  • Increase font size
  • Default font size
  • Decrease font size
Marinka Zitnik

Marinka moved to Harvard University: scholar.harvard.edu/marinka.


This site is archived. See Marinka's new website: zitniklab.hms.harvard.edu.

 

Google Scholars' Retreat 2011

Impressions from Google Scholars' Retreat 2011, which took place last month in Zurich, Switzerland, where the Retreat for the EMEA region was organized, are great.  It has really been a valuable experience to meet many Googlers, other scholars and their research work.

Below is this year logo.

This year logo for Google Scholarship.

Let me mention just a few workshops and talks I have attended during the Retreat:

  • Keynote speech
  • Working in Industry
  • Product Design Workshop
  • Career Panel
  • Poster Show
  • Android Technical Talk
  • Open source and Google Support of its Advancement in the Community (Details: Open source community at Google, development model and how Google uses it and supports advancement of this community.)
  • Google Web API (Details: Generic principles of Google Web APIs and the tools Google provide for learning and experimenting with them. Focus on interaction with these tools for research, be it to harvest data or to help present data in an efficient way.)
  • *Privacy, the Ultimate Interdisciplinary Challenge (Details: The questions of deletion and digital rights management, online reputation and self-representation, the right to be forgotten.)
  • *Google.org - Using Google's Strengths to Address Global Challenges (Details: Google.org is tasked with using Google's strengths in technology and the Internet to address global challenges. The session covered development of Google Flu Trends as well as more recent work including Google Earth Engine and Google Crisis Response.)
  • *Street View 3D (Details: Many problems arise when developing such a product, such as augmenting the panoramas with 3D features for better navigation and a more immersive experience.)
  • *Priority Inbox (Details: Because importance is highly personal, email importance is predicted by learning a per-user statistical model, updated as frequently as possible. The challenges include inferring the importance of mail without explicit user labeling; finding learning methods that deal with non-stationary and noisy training data; constructing models that reduce training data requirements; storing and processing terabytes of per-user feature data; predicting in a distributed and fault-tolerant way.)
Beside that there were some very informative talks on business visions of Google, ways of involvement with Google, office tours (I liked the slides between the floors :)) and lots of small talk with Googlers.
 

CRY: Commitment Schemes

I have been following the Cryptography and Theory of Coding 2 course this semester (summer 12).

During the course I have been working on the project about Commitment schemes. Please find below attached the final report produced and a short presentation. Additionally, some homework solutions are attached as well. The language is Slovene, but it is mostly pure math with proofs, so it should not be too difficult to capture the main idea.

Please note these are reports and have not been subject to the usual scrutiny of published papers.

 

GSoC: MF - Matrix Factorization Techniques for Data Mining Review

Google Summer of Code 2011 has finished. On 22th of August it was firm "pencils down" date and today, on 26th of August, has been final evaluation deadline. Therefore, it is time for a small review to be published here on my blog.

I successfully completed the program and have met all the goals, outlined in the original project plan with some (2) additional factorization methods I have implemented. I have been very satisfied with the support and mentoring of both the organization and mentor.

The project, I have worked on, has been developing library MF - Matrix Factorization Techniques for Data Mining which includes a number of published matrix factorization algorithms, initialization methods, quality and performance measures and facilitates the combination of these to produce new strategies. The library contains examples of usage, applications of factorization methods on both synthetic and real world data sets are provided.

Matrix factorization methods have been shown to be a useful decomposition for multivariate data as low dimensional data representations are crucial to numerous applications in statistics, signal processing and machine learning.

An incomplete list of applications of matrix factorization methods includes:

  • bioinformatics,
  • environmetrics and chemometrics,
  • image processing and computer graphics,
  • text analysis,
  • miscelllaneous, such as extracting speech features, transcription of polyphonic music passages, object characterization, spectral data analysis, multiway clustering, learning sound dictionaries, etc.

Example using synthetic data set is intended as demonstration of the MF library since all currently implemented factorization algorithms with different initialization methods and specific settings are ran. Others include applications on real world data sets in:

  • bioinformatics,
  • text analysis,
  • image processing,
  • recommendation systems.

I will outline only the most important content of the MF library here (for any details refer to documentation (or code)), as this is project review and not library reference (references to articles are provided in the documentation).

  • Matrix Factorization Methods
  • BD - Bayesian nonnegative matrix factorization Gibbs sampler [Schmidt2009]
  • BMF - Binary matrix factorization [Zhang2007]
  • ICM - Iterated conditional modes nonnegative matrix factorization [Schmidt2009]
  • LFNMF - Fisher nonnegative matrix factorization for learning Local features [Wang2004], [Li2001]
  • LSNMF - Alternative nonnegative least squares matrix factorization using projected gradient method for subproblems [Lin2007]
  • NMF - Standard nonnegative matrix factorization with Euclidean / Kullback-Leibler update equations and Frobenius / divergence / connectivity cost functions [Lee2001], [Brunet2004]
  • NSNMF - Nonsmooth nonnegative matrix factorization [Montano2006]
  • PMF - Probabilistic nonnegative matrix factorization [Laurberg2008], [Hansen2008]
  • PSMF - Probabilistic sparse matrix factorization [Dueck2005], [Dueck2004], [Srebro2001], [Li2007]
  • SNMF - (SNMF/L, SNMF/R) Sparse nonnegative matrix factorization based on alternating nonnegativity constrained least squares [Park2007]
  • SNMNMF - Sparse network regularized multiple nonnegative matrix factorization [Zhang2011]
  • Initialization Methods
  • Quality and Performance Measures
  • Distance
  • Residuals
  • Connectivity matrix
  • Consensus matrix
  • Entropy of the fitted NMF model [Park2007]
  • Dominant basis components computation
  • Explained variance
  • Feature score computation representing its specificity to basis vectors [Park2007]
  • Computation of most basis specific features for basis vectors [Park2007]
  • Purity [Park2007]
  • Residual sum of squares - can be used for rank estimate [Hutchins2008], [Frigyesi2008]
  • Sparseness [Hoyer2004]
  • Cophenetic correlation coefficient of consensus matrix - can be used for rank estimate [Brunet2004]
  • Dispersion [Park2007]
  • Selected matrix factorization method specific
  • Utils:
  • Fitted factorization model tracker across multiple runs
  • Residuals tracker across multiple factorizations / runs
  • Different factorization models

Relevant links:

Join the GSoC next year! It is a great opportunity to spend the summer learning something new and having fun at the same time.

 

BioDay: Trends in Bioinformatics @Hekovnik

In May I participated at the first BioDay meeting organized by Hekovnik in Ljubljana, Slovenia. The aim of the BioDay events is the exchange of ideas, knowledge and fostering collaboration and networking between life scientists, computer scientists, bioinformaticians, mathematicians and physicists.

The first event focused on recent trends in bioinformatics, specifically on experimental methods in systems biology (by Spela Baebler, PhD) and biomedical data fusion. I presented the latter topic and discussed how heterogeneous data sources in biology can be collectively mined by data fusion. The video of the event is available at video.hekovnik.com/bioday_trendi_v_bioinformatiki (in Slovene). Enjoy!

 


Page 2 of 25