An Alternative Algorithm for Bulk-loading xBR+-trees

21 Jun 2016

12:30-13:00

XENIA HOTEL, PORTARIA

Nowadays, applications and systems using spatial information and embedding spatial database capabilities (e.g. location based services) are becoming more common and continuously evolving.

Spatial indexes play a crucial role in spatial databases for the efficient execution of queries involving spatial constraints. The Quadtree is a well-known hierarchical index structure based on the principle of recursive regular decomposition of space. The External Balanced Regular Plus Tree (xBR+-tree) [1] is a balanced disk-based index structure for multidimensional point data that belongs to the Quadtree family. Unlike any other Quadtree variant, this tree is a totally disk-based, height-balanced, pointer-based, multiway tree. Members of the R-tree family and especially the R*-tree have been a de-facto choice in spatial databases, so far. Recently, in [2] the performance of the xBR+-tree vs. R-trees for tree building and processing of the most popular spatial queries has been extensively studied, using medium and big datasets, and it was shown that the xBR+-tree is a big winner in execution time in all cases and a winner in I/O in most cases.

Fast building is a key issue for spatial indexes. Bulk-loading refers to the process of creating an index from scratch as a whole, when the dataset to be indexed is available beforehand, instead of creating (loading) the index gradually, when the dataset items are available one-by-one. Bulk-loading xBR+-trees is necessary for embedding this indexing method in SpatialHadoop.

In [3] an algorithm for bulk-loading xBR+-trees for big datasets residing on disk, using a limited amount of RAM was presented. It consists of four phases. During the first phase, the initial dataset file is transformed to binary format and is split in two files. During the second phase, each of the two input item files is partitioned into item blocks of size <= MemoryLimit in a regular fashion. The resulting blocks are transferred in main memory, as input for the next phase. During the third phase, for each block of items, a Quadtree is built in main memory by splitting this block as long as they correspond to regions containing more items than the capacity of xBR+-tree leaves. This Quadtree is gradually transformed to an xBR+-tree in main memory in a bottom-up fashion. During the last phase, the tree created in main memory is merged with the xBR+-tree already built in secondary memory (created during the previous iteration of the bulk-loading process), discriminating between four different cases among the heights of the trees to be merged.

In this work, we present an alternative algorithm for the third phase. Instead of building a Quadtree that is gradually transformed to an xBR+-tree in a bottom-up fashion, we directly built an xBR+-tree in main memory in a bottom-up fashion, by recursively partitioning elements that represent data items (when building leaves), or lower level nodes (when building internal nodes).

The new method focuses on using less memory, since a temporary Quadtree is not stored. Main memory is a key factor for the performance of the bulk-loading process, especially for big datasets. In the future, we plan to study the performance of the new method and also to embed it in the process of incrementally building xBR+-trees, when the dataset items are available one-by-one.

[1] G. Roumelis, M. Vassilakopoulos, T. Loukopoulos, A. Corral and Y. Manolopoulos: The xBR+-tree: An Efficient Access Method for Points, DEXA Conference, Valencia, Spain, pp. 43-58, September 2015.

[2] G. Roumelis, M. Vassilakopoulos, A. Corral and Y. Manolopoulos: Performance Comparison of xBR+-trees and R-trees for Processing Spatial Queries, submitted for publication, March 2016.

[3] G. Roumelis, M. Vassilakopoulos, A. Corral and Y. Manolopoulos: Bulk-Loading xBR+-trees, submitted for publication, May 2016.