Data cube analysis is a powerful tool for analysing multidimensional data. Computation of interesting measures for data cubes and subsequent mining of interesting cube groups over web scale data sets are critical for many important analyses done in the real world. Previous approaches have focused on algebraic measures such as SUM that are amenable to parallel computation and can easily benefit from the recent parallel computing infrastructure such as MapReduce. However dealing with holistic measures such as TOP-K, counting distinct number of users is nontrivial. In this paper, we present real-world challenges in cube materialization and mining tasks on web-scale data sets. We begin with identifying an important subset of holistic measures and introduce MR-Cube, a MR-based framework for efficient cube computation and identification of interesting cube groups on holistic measures. We conclude that, unlike existing techniques which cannot scale to the 100 million tuple mark for our data sets, MR-Cube successfully and efficiently computes cubes with holistic measures over billion-tuple data sets.
Data cube, algebraic measures, cube materialization, MapReduce, holistic measures