Efficient Processing of Skyline Queries Using MapReduce
Academic projects in Bangalore
The skyline operator has attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data
The objective of this system is to check whether each skyline candidate point is actually a skyline point in every partition using MapReduce. We also develop the workload balancing methods to make the estimated execution times of all available machines to be similar.
The skyline operator has attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data intensive applications, the MapReduce framework has been widely used recently. In this paper, we propose the efficient parallel algorithm SKY-MR+ for processing skyline queries using MapReduce. We first build a quadtree-based histogram for space partitioning by deciding whether to split each leaf node judiciously based on the benefit of splitting in terms of the estimated execution time. In addition, we apply the dominance power filtering method to effectively prune non-skyline points in advance. We naxt partition data based on the regions divided by the quadtree and compute candidate skyline points for each partition using MapReduce. Finally, we check whether each skyline candidate point is actually a skyline point in every partition using MapReduce.
The skyline operator and its variants have attracted considerable attention recently due to their broad applications such as product recommendations, review evaluations with user ratings, querying wire-less sensor networks and graph analysis. However, computing a skyline is challenging today since we have to deal with big data. For data-intensive applications including similarity joins and top-k substring matching, the MapReduce framework has been considered as a de facto standard. Thus, several skyline processing algorithms, using MapReduce have been proposed.
SKY-MR builds a sky-quadtree from a sample of data based on the user-defined parameter split threshold which is the maximum number of points in each leaf node. As the split threshold decreases, the number of leaf nodes in the quadtree tends to increase and more points are allowed to be pruned by the dominance relationships between leaf nodes in the local skyline phase. In contrast, decreasing split threshold has an adverse effect on the network overhead by transmitting more duplicates of local skyline points to other leaf nodes in the global skyline phase. Since there is a trade-off between the costs of the local
and global skyline phases, when a reasonable split threshold is not provided, its performance suffers. Furthermore, since SKY-MR as well as the other MapReduce skyline algorithms do not consider workload balancing, the performances of the algorithms could degrade. Finally, there is still a lot of room for improvement to reduce the commu-nication and computation costs of the local skyline phase.