Marinka Zitnik

Fusing bits and DNA

• • • Fractal Dimension Computation Support in MF Library

I have always been fascinated by the world of fractals and have been deeply enthusiastic exploring the maths behind them. This post is announcing the support of the fractal dimension computation in the MF - Matrix Factorization for Data Mining library.

In the following paragraphs we shortly revise the most important concepts and definitions of the fractal dimension.

The embedding dimensionality of a data set is the number of attributes in the data set. The intrinsic dimensionality is defined as the actual number of dimensions in which the m-dimensional original vectors can be embedded under the assumption that some distance in the reduced space is kept among them. Given a fractal as a self-similar set of points with self-similar pieces, where each is scaled down by a factor of , the fractal dimensionĀ Āof the object is defined as .

Example: Sierpinski triangle (My seminar work for CG on L-systems - figure 1 in Appendix of the Report document) consists of three self-similar parts and each is scaled down by a factor of two, therefore its fractal dimension is For the finite set of points in a vector space, we say the set is statistically self-similar on a range of scales on which the self-similarity is true. However in the theory self-similar object should have infinitely many points because each self-similar part is a scaled-down version of the original object. As a measure of the intrinsic fractal dimension of a data set, the slope of the correlation integral is used. The correlation integralĀ Āfor the data setĀ is defined as Given a data setĀ which is statistically self-similar in the range , its correlation fractal dimension is It has been shown that the correlation fractal dimension corresponds to the intrinsic dimension of a data set. Many properties hold for the correlation fractal dimension, see  and . For us it is especially important, that the intrinsic dimensionality gives a lower bound on the number on attributes needed to keep the vital characteristics of the data set.

A fast algorithm for the computation of the intrinsic dimension of a data set presented in  is implemented in the MF - Matrix Factorization for Data Mining library. Intuitive explanation of the correlation fractal dimension is that is measures how the number of neighbor points increases with the increase of the distance. It therefore measures the spread of the data and the fractal dimension equal to the embedding dimension means that the spread of the points in the data set is maximum.

Of high importance is a Conjecture 1 in : With all the parameters being equal, a dimensionality reduction method which achieves higher fractal dimension in the reduced space is better than the rest for any data mining task. Therefore correlation fractal dimension of a data set can be used:

• for determining the optimal number of dimensions in the reduced space,
• as a performance comparison tool between dimensionality reduction methods,
and all this can be done in way that is scalable to large data sets.

Ā