Probabilistic data mechanisms
Table of Contents
Systems dealing with data face a number of constraints. It’s not possible to perform computations or operations like normal all the time.
The most common hurdles systems deal with include large amounts of data, intense computation, limited memory resources, or a need for (processing) speed.
To help mitigate these, one possible approach is to use probabilistic data mechanisms.
What are probabilistic data mechanisms? #
It’s hard to generalize these, but in essence, they all provide approximate answers to problems that are normally need a lot of resources to solve.
They tend to improve performance and resource usage at the cost of solution exactness–that’s their tradeoff.
Let’s explore a few of these.
Bloom filters #
A Bloom filter is used to determine if an element is a member of a set.
How it works #
It’s implemented as an array of bits, and supports two operations: adding an element and testing membership of an element.
Also, it uses a fixed number of hash functions that are used to take the element and find a position in the array.
The array starts off as a bit of all zeroes.
To add an element, the element is run through all hash functions and sets a 1 at each array position returned for the hash functions.
To test membership, the element is run through the same hash functions.
- If at any returned array position the array value is 0 –> the element is not in the set
- If at all returned array positions the array value is 1 –> maybe the element is in the set
The size of the array and the number of hash functions used depend on the particular implementation. Those mostly drive the detection rate / confidence in the test operation.
Advantages #
For testing membership, the usual choices would be a hash table or a prefix tree / trie. Their computational complexity is hard to beat, since hash tables are constant time and prefix trees correlate with the length of the input.
Bloom filters offer constant-time performance, which is great, but where Bloom filters win is with space complexity. Bits take the least amount of space of any data type, so Bloom filters are extremely compact.
Applications #
Bloom filters can be used anywhere where the normal operation without a Bloom filter would be slow.
Some examples are web crawlers checking to see if a URL is visited, or to find if a record exists in a database–both provide opportunities for quickening the check by using the Bloom filter.
HyperLogLog #
This is an approach to count the number of distinct elements in a multiset–basically a set where elements can repeat.
How it works #
This algorithm uses an array of registers, where each register is an integer. The number of registers is based on the number of bits used to index into the array.
A single hash function is used to map input data to a sequence of bits.
Adding
Each element of the dataset is hashed, and the hash value is divided into two parts:
- index bits: determines which register to update
- value bits: used to calculate the number of leading zeroes
First, the index bits are used to select a register. Then, the number of leading zeroes from the hash value are compared with the current value of the register. If the new value has more leading zeroes, then the register is updated with the new value.
Counting
A harmonic mean is used to help calculate the number of unique elements.
$$h = \frac{m}{\sum_{j=1}^m 2^{-M[j]}}$$
where:
- $m$ is the number of registers
- $M[j]$ is the value stored in register $j$
There is also a constant $\alpha_m$ that depends on the number of registers.
The estimated number of unique elements is then $\alpha_m \times h$.
Advantages #
HyperLogLog is really useful in large datasets, because it uses a fixed amount of memory, so it’s space-efficient.
Also, adding and counting can be done in constant time, so it tends to be fast as well.
Applications #
This approach can be used to count unique visitors or unique views of a web page.
Also, HyperLogLog can do an approximate count-distinct on elements in a data warehouse, like BigQuery or Presto do.
Count-min sketch #
This is a data structure that tries to determine how frequent elements are in a set of data–mainly a stream of data. A count-min sketch’s estimate is always equal to or greater than the true frequency of an element.
How it works #
It’s implemented as a 2D array. It has a width $w$ for the number of columns and a depth $d$ for the number of rows. The cells in this array are called buckets.
There are also $d$ independent hash functions corresponding to the number of rows. These hash functions map inputs to a column position, $[0, w - 1]$.
The matrix starts out with all buckets set to zero.
Updating the sketch
An input element comes in from the data. The element is run through each of the $d$ hash functions.
The $i$th hash function will yield a column position $c$. Then, the bucket at row $i$, column $c$ is incremented by 1.
Querying the sketch
Suppose there is an input element $x$. All $d$ hash functions are applied to get the column position at each row.
There are $d$ bucket values as a result. Taking the minimum across these bucket values gives the frequency of $x$.
Choosing parameters
Generally:
- a larger number of columns $w$ will reduce hash collisions
- a larger number of rows $d$ will reduce overall collisions
Both of these can be customized to achieve a particular error probability and confidence. The former is the chance that the estimated frequency is more than a certain multiple of the true frequency, and this drives $w$. The latter drives $d$, and increases the probability that the estimate is within a certain “error bound”.
Advantages #
A count-min sketch is quite space efficient compared to its non-probabilistic counterpart, the hash map. It is also updatable and queryable in constant time, which unlocks high speed.
The estimate is also never less than the true frequency, so there’s no risk of underestimation.
Applications #
One use case is in data streams, where count-min sketches can provide a quick idea of emerging trends or elements rising in frequency.
Another application is to get an idea of popular products sold or anywhere there are expected to be a few top items among a bunch of low-frequency ones.