Pontificia Universidad Católica de Chile Pontificia Universidad Católica de Chile
Arroyuelo D., Carmona G., Larranaga H., Riveros F., Rojas-Morales C., Sepulveda E. (2025)

Engineering rank/select data structures for large-alphabet strings

Revista : COMPUTER JOURNAL
Tipo de publicación : ISI Ir a publicación

Abstract

Large-alphabet strings, prevalent in information retrieval and natural language processing, pose unique storage and processing challenges. This paper explores the efficient implementation of the alphabet-partition approach, introducing a compressed data structure that efficiently supports the operations ${mathsf{rank}}$ and ${mathsf{select}}$. Our implementation significantly outperforms existing methods, improving the ${mathsf{select}}$ operation speed by 80% with only 11% additional space. We demonstrate the utility of our structure in various applications, including inverted list intersections, run-length compressed strings, and distributed computation of ${mathsf{rank}}$ and ${mathsf{select}}$. Notably, for run-length compressed strings using the Burrows-Wheeler transform, our data structure requires only 0.98-1.09 times the space of state-of-the-art RLFM-indexes to achieve 1.23-2.33 times faster pattern occurrence counting while also providing better theoretical guarantees.