Google Summer of Code 2011 has finished. On 22th of August it was firm "pencils down" date and today, on 26th of August, has been final evaluation deadline. Therefore, it is time for a small review to be published here on my blog.

I successfully completed the program and have met all the goals, outlined in the original project plan with some (2) additional factorization methods I have implemented. I have been very satisfied with the support and mentoring of both the organization and mentor.

The project, I have worked on, has been developing library **MF - Matrix Factorization Techniques for Data Mining** which includes a number of published matrix factorization algorithms, initialization methods, quality and performance measures and facilitates the combination of these to produce new strategies. The library contains examples of usage, applications of factorization methods on both synthetic and real world data sets are provided.

Matrix factorization methods have been shown to be a useful decomposition for multivariate data as low dimensional data representations are crucial to numerous applications in statistics, signal processing and machine learning.

An incomplete list of applications of matrix factorization methods includes:

- bioinformatics,
- environmetrics and chemometrics,
- image processing and computer graphics,
- text analysis,
- miscelllaneous, such as extracting speech features, transcription of polyphonic music passages, object characterization, spectral data analysis, multiway clustering, learning sound dictionaries, etc.

Example using synthetic data set is intended as demonstration of the MF library since all currently implemented factorization algorithms with different initialization methods and specific settings are ran. Others include applications on real world data sets in:

- bioinformatics,
- text analysis,
- image processing,
- recommendation systems.

I will outline only the most important content of the MF library here (for any details refer to documentation (or code)), as this is project review and not library reference (references to articles are provided in the documentation).

- Matrix Factorization Methods

BD- Bayesian nonnegative matrix factorization Gibbs sampler [Schmidt2009]BMF- Binary matrix factorization [Zhang2007]ICM- Iterated conditional modes nonnegative matrix factorization [Schmidt2009]LFNMF- Fisher nonnegative matrix factorization for learning Local features [Wang2004], [Li2001]LSNMF- Alternative nonnegative least squares matrix factorization using projected gradient method for subproblems [Lin2007]NMF- Standard nonnegative matrix factorization with Euclidean / Kullback-Leibler update equations and Frobenius / divergence / connectivity cost functions [Lee2001], [Brunet2004]NSNMF- Nonsmooth nonnegative matrix factorization [Montano2006]PMF- Probabilistic nonnegative matrix factorization [Laurberg2008], [Hansen2008]PSMF- Probabilistic sparse matrix factorization [Dueck2005], [Dueck2004], [Srebro2001], [Li2007]SNMF- (SNMF/L, SNMF/R)Sparse nonnegative matrix factorization based on alternating nonnegativity constrained least squares [Park2007]SNMNMF- Sparse network regularized multiple nonnegative matrix factorization [Zhang2011]

- Initialization Methods

RandomFixedNNDSVD[Boutsidis2007]Random C[Albright2006]Random VCol[Albright2006]

- Quality and Performance Measures

- Distance
- Residuals
- Connectivity matrix
- Consensus matrix
- Entropy of the fitted NMF model [Park2007]
- Dominant basis components computation
- Explained variance
- Feature score computation representing its specificity to basis vectors [Park2007]
- Computation of most basis specific features for basis vectors [Park2007]
- Purity [Park2007]
- Residual sum of squares - can be used for rank estimate [Hutchins2008], [Frigyesi2008]
- Sparseness [Hoyer2004]
- Cophenetic correlation coefficient of consensus matrix - can be used for rank estimate [Brunet2004]
- Dispersion [Park2007]
- Selected matrix factorization method specific

- Utils:

- Fitted factorization model tracker across multiple runs
- Residuals tracker across multiple factorizations / runs
- Different factorization models

Relevant links:

- Online documentation (the library will soon be integrated into Orange and this link is subject to change)
- Orange wiki site of the project
- Github repository (the library will be soon be integrated into Orange and this link is subject to change)

*Join the GSoC next year! It is a great opportunity to spend the summer learning something new and having fun at the same time. *