Engineering compressed static functions

Wednesday, February 21, 2018

4:00 p.m.

ISI Seminar Room 1st Floor

Prof. Sebastiano Vigna Università degli Studi di Milano

Recent advances in the compact representation of static functions (with  constant access time) have made it possible to fully exploit constructions based on random linear system. Such constructions, albeit  theoretically appealing, were previously too slow to be usable. In this paper, we extend such techniques to the problem of storing compressed  static functions, in  the sense that the space used per key should be close to the entropy of the list of values. From a theoretical  viewpoint, we are inspired by the approach of Hreinsson, Krøyer and Pagh. Values are represented using a near-optimal instantaneous code.  Then, a bit array is created so that by XOR’ing its content at a fixed  number of positions depending on the key one obtains the value,  represented by its associated codeword. In the construction phase, every  bit of the array is associated with an equation on Z/2Z, and solving the  associated system provides the desired representation. Thus, we pass from one equation per key (the non-compressed case) to one equation  per bit: the size of the system is thus approximately multiplied by the  empirical entropy of the values, making the problem much more  challenging.  We show that by carefully engineering the value  representation we can obtain a practical data structure. For example, we  can store a function with geometrically distributed output in just 2.28  bits per key, independently of the key set, with a construction time  double with respect to that of a state-of-the-art non-compressed  function, which requires ≈ log log n bits per key, where n is the number  of keys, and slightly improved lookup time. We can also store a function with an output of 10^6 values distributed following a power law of  exponent 2 in just 2.75 bits per key, whereas a non-compressed function  would require more than 20, with a threefold increase in construction  time and significantly faster lookups.