First of all, embedded BI is a result of incorporation of the Performance Point Server as Performance Point Services. Before SP 2010, Performance Point Server was independant and separate product but now it isn't anymore. New Services enrich SP with KPI indices, scorecarding, matrices and much more that can easily be rendered as dashboards, charting webparts or consumed through Visio Services or used by numerous improvements to Excel Services.

Together with Performance Point Services, you get a Dashboard Designer, with which is possible to gui interface or hook up the data you you want to drive your scorecard or KPI off (e.g. conect it to SQL Service Anaylsis, easily create KPIs in designer and render them as webparts in sharepoint). As data mining and OLAP are getting more and more important BI technologies it is necessary to stress their benefits. First you are in control of what is happening with the data you have: data sources can be configured by admins nd dashboards by department business units. Furthermore it is very easy to slice and dice the data to get the answers you are looking for. One among numerous new features is the decomposition tree - we can drill into key notes and get more details in a very visual graphical way, which enriches the models from which we pull the data, so users can get quality answers quickly.

Few improvements are included in the Excel Services allowing users to publish and share bits of or whole workbooks, but still the owner has the total control of the services users consume.

There is another novelty worth mentioning, namely the Visio Services. It is simply a matter of creating a graph (e.g. network diagram or graph of used resources on the project) that is data bound and which then checkis real present data (e.g. changing pictures or states in accordance with the progress of the project). It is more of creating a simple user-friendly workflow that updates itself than a diagram. Do not confuse this with another powerful tool WWF - Windows Workflow Foundation to create complex workflows in .NET and VS.

]]>Nevertheless, Google has announced Google Anita Borg Scholarship Recipients and Finalists, a Scholarship for which I have applied this year and I am among 147 undergraduate and graduate students worldwide being chosen. Just for clarification - this is completely unrelated to the GSoC (the only common denominator being the Google itself), the scholarship however being awarded based on the strength of candidates’ academic performance, leadership experience and demonstrated passion for computer science.

Scholars from Europe have Scholars' Retreat at European Google centre at Zurich in June and I am very much looking forward to this event to meet some fascinating people. The retreat will include workshops, speakers, panelists, breakout sessions and social activities scheduled over a couple of days.

- (Official Google Blog with published results of the Scholars's selection process) link
- (Official Google Students Blog with published announcement of the Scholars's) link
- (Faculty News) link in Slovene

In the following paragraphs we shortly revise the most important concepts and definitions of the fractal dimension.

The **embedding dimensionality** of a data set is the number of attributes in the data set. The** intrinsic dimensionality** is defined as the actual number of dimensions in which the {mathpublisher}n{/mathpublisher} m-dimensional original vectors can be embedded under the assumption that some distance in the reduced space is kept among them. Given a fractal as a self-similar set of points with {mathpublisher}r{/mathpublisher} self-similar pieces, where each is scaled down by a factor of {mathpublisher}s{/mathpublisher}, the fractal dimension {mathpublisher}D{/mathpublisher} of the object is defined as

{mathpublisher}D = { \log r} / { \log s} {/mathpublisher}.

*Example: Sierpinski triangle (My seminar work for CG on L-systems - figure 1 in Appendix of the Report document) consists of three self-similar parts and each is scaled down by a factor of two, therefore its fractal dimension is {mathpublisher}D \approx 1.58.{/mathpublisher}*

For the finite set of points in a vector space, we say the set is **statistically** **self-similar** on a range of scales{mathpublisher}(a, b){/mathpublisher}on which the self-similarity is true. However in the theory self-similar object should have infinitely many points because each self-similar part is a scaled-down version of the original object. As a measure of the intrinsic fractal dimension of a data set, the slope of the correlation integral is used. The correlation integral {mathpublisher}C(r){/mathpublisher} for the data set {mathpublisher}S{/mathpublisher}is defined as

{mathpublisher}C(r) = Count(dist(u, v) <= r; u \in S, v \in S, u != v). {/mathpublisher}

Given a data set {mathpublisher}S{/mathpublisher} which is statistically self-similar in the range{mathpublisher}(a,b){/mathpublisher}, its **correlation fractal dimension** {mathpublisher}D {/mathpublisher}is

{mathpublisher}D = {\partial \log C(r)} / {\partial \log r}, r \in [a, b]. {/mathpublisher}

It has been shown that the **correlation fractal dimension corresponds to the intrinsic dimension of a data set**. Many properties hold for the correlation fractal dimension, see [1] and [2]. For us it is especially important, that the **intrinsic dimensionality gives a lower bound on the number on attributes needed to keep the vital characteristics of the data set.**

A fast algorithm for the computation of the intrinsic dimension of a data set presented in [2] is implemented in the MF - Matrix Factorization for Data Mining library. Intuitive explanation of the correlation fractal dimension is that is measures how the number of neighbor points increases with the increase of the distance. It therefore measures the spread of the data and the fractal dimension equal to the embedding dimension means that the spread of the points in the data set is maximum.

Of high importance is a Conjecture 1 in [1]: *With all the parameters being equal, a dimensionality reduction method which achieves higher fractal dimension in the reduced space is better than the rest for any data mining task. *Therefore correlation fractal dimension of a data set can be used:

- for determining the optimal number of dimensions in the reduced space,
- as a performance comparison tool between dimensionality reduction methods,

and all this can be done in way that is scalable to large data sets.

Recommended reading:

- Kumaraswamy, K., (2003). Fractal Dimension for Data Mining. Carnegie Mellon University.
- Jr, C. T., Traina, A., Wu, L., Faloutsos, C., (2010). Fast Feature Selection using Fractal Dimension. Science, 1(1), 3-16.

In [1] a concept of intrinsic fractal dimension of a data set is introduced and it is shown how fractal dimension can be used to aid in several data mining tasks. In [2] a fast O(n) algorithm to compute fractal dimension of a data set is presented. On top of that a fast, scalable algorithm to quickly select the most important attributes of the given set of n-dimensional vectors is described.

]]>Below you will find a report that I produced and which contains a few interesting (some are elementary) proofs and implementations of algorithms. Interested reader should check the literature above to be able to follow the text.

]]>]]>

This site is archived. See Marinka's new website: zitniklab.hms.harvard.edu.

Below is this year logo.

Let me mention just a few workshops and talks I have attended during the Retreat:

- Keynote speech
- Working in Industry
- Product Design Workshop
- Career Panel
- Poster Show
- Android Technical Talk
- Open source and Google Support of its Advancement in the Community (Details: Open source community at Google, development model and how Google uses it and supports advancement of this community.)
- Google Web API (Details: Generic principles of Google Web APIs and the tools Google provide for learning and experimenting with them. Focus on interaction with these tools for research, be it to harvest data or to help present data in an efficient way.)
- *Privacy, the Ultimate Interdisciplinary Challenge (Details: The questions of deletion and digital rights management, online reputation and self-representation, the right to be forgotten.)
- *Google.org - Using Google's Strengths to Address Global Challenges (Details: Google.org is tasked with using Google's strengths in technology and the Internet to address global challenges. The session covered development of Google Flu Trends as well as more recent work including Google Earth Engine and Google Crisis Response.)
- *Street View 3D (Details: Many problems arise when developing such a product, such as augmenting the panoramas with 3D features for better navigation and a more immersive experience.)
- *Priority Inbox (Details: Because importance is highly personal, email importance is predicted by learning a per-user statistical model, updated as frequently as possible. The challenges include inferring the importance of mail without explicit user labeling; finding learning methods that deal with non-stationary and noisy training data; constructing models that reduce training data requirements; storing and processing terabytes of per-user feature data; predicting in a distributed and fault-tolerant way.)

Beside that there were some very informative talks on business visions of Google, ways of involvement with Google, office tours (I liked the slides between the floors :)) and lots of small talk with Googlers.

]]>I successfully completed the program and have met all the goals, outlined in the original project plan with some (2) additional factorization methods I have implemented. I have been very satisfied with the support and mentoring of both the organization and mentor.

The project, I have worked on, has been developing library **MF - Matrix Factorization Techniques for Data Mining** which includes a number of published matrix factorization algorithms, initialization methods, quality and performance measures and facilitates the combination of these to produce new strategies. The library contains examples of usage, applications of factorization methods on both synthetic and real world data sets are provided.

Matrix factorization methods have been shown to be a useful decomposition for multivariate data as low dimensional data representations are crucial to numerous applications in statistics, signal processing and machine learning.

An incomplete list of applications of matrix factorization methods includes:

- bioinformatics,
- environmetrics and chemometrics,
- image processing and computer graphics,
- text analysis,
- miscelllaneous, such as extracting speech features, transcription of polyphonic music passages, object characterization, spectral data analysis, multiway clustering, learning sound dictionaries, etc.

Example using synthetic data set is intended as demonstration of the MF library since all currently implemented factorization algorithms with different initialization methods and specific settings are ran. Others include applications on real world data sets in:

- bioinformatics,
- text analysis,
- image processing,
- recommendation systems.

I will outline only the most important content of the MF library here (for any details refer to documentation (or code)), as this is project review and not library reference (references to articles are provided in the documentation).

- Matrix Factorization Methods

BD- Bayesian nonnegative matrix factorization Gibbs sampler [Schmidt2009]BMF- Binary matrix factorization [Zhang2007]ICM- Iterated conditional modes nonnegative matrix factorization [Schmidt2009]LFNMF- Fisher nonnegative matrix factorization for learning Local features [Wang2004], [Li2001]LSNMF- Alternative nonnegative least squares matrix factorization using projected gradient method for subproblems [Lin2007]NMF- Standard nonnegative matrix factorization with Euclidean / Kullback-Leibler update equations and Frobenius / divergence / connectivity cost functions [Lee2001], [Brunet2004]NSNMF- Nonsmooth nonnegative matrix factorization [Montano2006]PMF- Probabilistic nonnegative matrix factorization [Laurberg2008], [Hansen2008]PSMF- Probabilistic sparse matrix factorization [Dueck2005], [Dueck2004], [Srebro2001], [Li2007]SNMF- (SNMF/L, SNMF/R)Sparse nonnegative matrix factorization based on alternating nonnegativity constrained least squares [Park2007]SNMNMF- Sparse network regularized multiple nonnegative matrix factorization [Zhang2011]

- Initialization Methods

RandomFixedNNDSVD[Boutsidis2007]Random C[Albright2006]Random VCol[Albright2006]

- Quality and Performance Measures

- Distance
- Residuals
- Connectivity matrix
- Consensus matrix
- Entropy of the fitted NMF model [Park2007]
- Dominant basis components computation
- Explained variance
- Feature score computation representing its specificity to basis vectors [Park2007]
- Computation of most basis specific features for basis vectors [Park2007]
- Purity [Park2007]
- Residual sum of squares - can be used for rank estimate [Hutchins2008], [Frigyesi2008]
- Sparseness [Hoyer2004]
- Cophenetic correlation coefficient of consensus matrix - can be used for rank estimate [Brunet2004]
- Dispersion [Park2007]
- Selected matrix factorization method specific

- Utils:

- Fitted factorization model tracker across multiple runs
- Residuals tracker across multiple factorizations / runs
- Different factorization models

Relevant links:

- Online documentation (the library will soon be integrated into Orange and this link is subject to change)
- Orange wiki site of the project
- Github repository (the library will be soon be integrated into Orange and this link is subject to change)

*Join the GSoC next year! It is a great opportunity to spend the summer learning something new and having fun at the same time. *

]]>

During the course I have been working on the project about Commitment schemes. Please find below attached the final report produced and a short presentation. Additionally, some homework solutions are attached as well. The language is Slovene, but it is mostly pure math with proofs, so it should not be too difficult to capture the main idea.

Please note these are reports and have not been subject to the usual scrutiny of published papers.

- Commitment Schemes report (Slovene) (.pdf)
- Commitment Schemes presentation (Slovene) (.pdf)
- HW 1 Solutions (Slovene) (.pdf)
- HW 2 Solutions (Slovene) (.pdf)
- HW 3 Solutions (Slovene) (.pdf)
- HW 4 Solutions (Slovene) (.pdf)
- HW 5 Solutions (Slovene) (.pdf)
- HW 6 Solutions (Slovene) (.pdf)

The first event focused on recent trends in bioinformatics, specifically on experimental methods in systems biology (by Spela Baebler, PhD) and biomedical data fusion. I presented the latter topic and discussed how heterogeneous data sources in biology can be collectively mined by data fusion. The video of the event is available at video.hekovnik.com/bioday_trendi_v_bioinformatiki (in Slovene). Enjoy!

]]>