Counting without counting

Introduction

Recently I came across an interesting problem, which went roughly like this:

You have a web server that logs every page hit. At the end of the day there are 50 million entries. You want to know how many of those 50 million visitors were unique. Here is the problem: you only have 12 KB of memory at your disposal. How would you do this?

Let's formalize the problem. We have a stream of elements $S = (a_1, a_2, \ldots, a_n)$ and we let $D$ denote the number of distinct elements in $S$. This is the count-distinct problem.

The straightforward approach would be to store every user ID in a hash set. Assuming 64-bit IDs and a 0.5 load factor, this requires around 128 MB of memory. The secret to doing this with far less memory is probabilistic algorithms, where the goal is to give the best possible estimate for D.

In this post, we'll build up towards the HyperLogLog algorithm, which can estimate $D$ (which can be billions of elements!) using less than 12 KB of memory with under 1% error. It sounds crazy!

Hash Functions

Every algorithm we'll see relies on a hash function $h$ that maps each element in the stream to an $L$-bit binary string. What is a binary string? Let's take an example to better understand what it does.

Let's say we have the user ID alice. Feeding it into a hash function might produce the bitstring:

$$ h(\texttt{alice})=0011010110011010... $$

This is a fixed-length string of 0s and 1s. Every time we see alice in the stream, we get the exact same bitstring back.

We need two properties of our hash functions:

Determinism. $h(x)$ is the same every time we see $x$, so duplicates map to the same hash, giving us deduplication for free.
Uniformity. For distinct elements, each bit of $h(x)$ behaves like an independent fair coin flip.

Using these properties, hash functions make the trick in the next section possible.

Leading Zeros

This is the trick that makes the cardinality estimation possible. Let's define $\rho(x)$ as the number of leading zeros in $h(x)$, i.e. the count of 0s before the first 1. For example:

$$ h(x) = 00010110\ldots \implies \rho(x) = 3 $$ $$ h(y) = 10011\ldots \implies \rho(y) = 0 $$

Next up we bring in some probability. Notice how since each bit is an independent fair coin flip, we can write:

$$ \Pr[\rho(x) \geq k] = \frac{1}{2^k} $$

This is the probability for getting $k$ (or more) leading zeros for one element. Each of the zeros have a $\frac{1}{2}$ probability of appearing, which is the reason why $k$ leading zeros has the probability $\frac{1}{2^k}$.

Now consider our $D$ distinct elements, each independently producing a leading-zero count. Let us define the maximum leading zero count as being:

$$R = \max_x \rho(x)$$

Now, what is the probability that the maximum $R$ is less than $k$? For that to happen, every single one of the $D$ elements must individually have fewer than $k$ leading zeros. Since the elements are independent:

$$ \Pr[R < k] = \left(1 - \frac{1}{2^k}\right)^D $$

We can further simplify the right-hand side using an approximation. Let's use the limit definition for $e$:

$$\lim_{n \to \infty} \left(1 + \frac{x}{n}\right)^n = e^x$$

Substituting $x = -D/2^k$ and $n = D$:

$$\left(1 - \frac{1}{2^k}\right)^D = \left(1 + \frac{-D/2^k}{D}\right)^D \approx e^{-D/2^k}$$

The approximation holds when $D$ is large, which is exactly what we care about since if $D$ is small, we don't need a probabilistic algorithm in the first place.

So to recap, we have: $$ \Pr[R < k] \approx e^{-D/2^k} $$

If we graph this, we see that it transitions from $\approx 0$ to $\approx 1$ around $k = \log_2 D$. If D gets larger, this transition gets sharper and more exact. This means that the maximum number of leading zeros across all distinct hashes is approximately $\log_2 D$. Therefore, we get $D$ by:

$$ D = 2^R $$

Pretty cool right?!

Implementation in Python

This is remarkably simple to implement in Python, taking only 5 lines:

R = 0
for x in stream:
    z = leading_zeros(hash(x))
    R = max(R, z)
return 2**R

What this program does is that it loops through each of the elements in the stream, hashes them and then calculates the leading zero count. If that count is larger than the maximum $R$, we assign it to the maximum.

The memory reduction here is extraordinary. In the naive approach, we store each element, requiring $O(n)$ memory. Here we only store $R$, which can be at most $\log_2 n$ due to the leading zero count. Therefore storing this in memory takes only $\log_2(\log_2 n)$ bits. For $n = 10^{12}$, that is about 6 bits to summarize a trillion-element stream. In other words, this algorithm uses only $O(\log \log n)$ memory.

However, as with many simple probabilistic algorithms, the relative standard error is large because $2^R$ can only be a power of 2:

$$ \frac{\sigma(2^R)}{D} \approx 0.78 $$

That is 78% error. This was proved in the original 1985 paper by Flajolet and Martin using a technique called analytic combinatorics, including the Mellin transform (named after the Finnish mathematician Hjalmar Mellin!). It's one of the most unique applications I have ever seen of the transform, so if you are into the details I suggest checking it out!

To combat the large standard error, Flajolet and Martin used a simple technique which we will discuss next.

Flajolet–Martin (1985)

Philippe Flajolet and G. Nigel Martin applied the trick of averaging independent copies to lower the standard error.

Run $k$ independent copies of the algorithm, each with its own hash function $h_i$, maintaining registers $R_1, R_2, \ldots, R_k$. The distinct count estimation therefore is:

$$ D_{FM} = \frac{1}{k} \sum_{i=1}^{k} 2^{R_i} $$

By independence, the standard error drops as $1/\sqrt{k}$:

$$ \sigma_\text{FM} \approx \frac{0.78}{\sqrt{k}} $$

For 2% error, we need $0.78/\sqrt{k} = 0.02$, giving $k \approx 1521$. Each register is only a few bits, so the memory remains fine (~1 KB). However, computing 1500 hash functions per element is expensive and this motivated the next breakthrough.

LogLog (2003)

The next improvement came when Durand and Flajolet realized that we do not need $k$ hash functions for all elements after all. One hash per element is enough, if you split it cleverly.

The idea is to maintain $m$ registers just like before, but instead of running $k$ independent hashes, we compute a single hash $h(x)$ of $L$ bits and split it in two:

h(x) = [ b₁ b₂ ··· bₚ | bₚ₊₁ bₚ₊₂ ··· b_L ]
         └─ bucket j ──┘ └─ compute ρ ─┘

where $m = 2^p$. It may seem tricky to understand why we do this, but remember that the goal is to minimize memory usage. We use the first part of the hash for getting the register index and the second part for the actual leading-zero count.

You can split the hash however you want and still get a valid leading-zero count. Because the hash is uniform, the right-hand portion is still a uniformly random bitstring, which means that the leading zeros there carry the same statistical signal as before.
The left-hand portion routes each element to exactly one register. Instead of updating all $k$ registers on every element, we only update the one register $R_j$ that the element hashes to. Each register gets updated on average $D/m$ times, so the work is spread evenly.

Each register tracks the maximum leading-zero count among the elements routed to it. Let's denote the LogLog estimator for the distinct count $D_{LL}$. The estimator takes the arithmetic mean of all registers and exponentiates:

$$ \frac{1}{m}\sum_{j=1}^{m} M(j) \approx \log_2\left(\frac{D_{LL}}{m}\right) $$

Solving for $D_{LL}$ gives us:

$$ D_{LL} = \alpha_m \cdot m \cdot 2^{\frac{1}{m} \sum_{j=1}^{m} R_j} $$

where $\alpha_m$ is a bias-correction constant. The full derivation of the constant is given in their paper using more complex methods including Poissonization analysis. They arrive at the value:

$$\alpha_m = \left(\Gamma\left(-\tfrac{1}{m}\right) \cdot \frac{1 - 2^{1/m}}{\log 2}\right)^{-m}$$

In the LogLog case, the standard error for this estimator is:

$$ \sigma_\text{LL} \approx \frac{1.30}{\sqrt{m}} $$

The error constant is a bit higher than the original algorithm by Flajolet–Martin (which had an error of $0.78$), but using only one hash per element makes this algorithm much more practical to use.

HyperLogLog (2007)

Finally we arrive at HyperLogLog, where we only change how the registers are combined. This single change drops the error constant from $1.30$ to $1.04$.

The harmonic mean

The distribution of $2^{R_j}$ is quite heavy-tailed. One register with a high maximum leading-zero count drags the arithmetic mean upward.

Therefore, we use the harmonic mean instead of the arithmetic mean when combining the registers in the final step. The harmonic mean of positive values $x_1, \ldots, x_m$ is:

$$ H = \frac{m}{\sum_{i=1}^{m} \frac{1}{x_i}} $$

If one $x_i$ is huge, $1/x_i$ is tiny and barely affects the sum. Take for example the values ${100, 100, 100, 10000}$, the harmonic mean is $\approx 133$, close to the typical value.

The HyperLogLog estimator

Apply the harmonic mean to the per-bucket estimates $2^{R_j}$:

$$ D_{HLL} = \alpha_m \cdot m^2 \cdot \left(\sum_{j=1}^{m} 2^{-R_j}\right)^{-1} $$

where the bias-correction constant is:

$$ \alpha_m = \left(m \int_0^\infty \left(\log_2\frac{2+u}{1+u}\right)^m du\right)^{-1} $$

For big $m$ (which is usually the case), we can use the following approximation which was derived in the paper: $$ \alpha_m \approx \frac{0.7213}{1 + 1.079/m} $$

And finally the standard error is derived to be:

$$ \sigma_\text{HLL} \approx \frac{1.04}{\sqrt{m}} $$

...which is a really good drop!

Concrete examples

Let's try this with some concrete numbers. Let's pick $p = 12$, which gives us $m = 4096$ registers. Each hash is 64 bits long. We use the first $p = 12$ bits as the bucket index, leaving $64 - 12 = 52$ bits for computing the leading-zero count $\rho$. Since that remaining portion is 52 bits long, the maximum possible value any register can hold is 52, since you cannot get more leading zeros than you have bits. Therefore $\lceil \log_2(52) \rceil = 6$ bits is enough to store any register value.

$$ \text{Memory} = 4096 \times 6 \text{ bits} = 24{,}576 \text{ bits} \approx 3 \text{ KB} $$

$$ \text{Standard error} = \frac{1.04}{\sqrt{4096}} = \frac{1.04}{64} \approx 0.016 $$

Another configuration is for example:

$p = 14$ ($m = 16384$): ~12 KB, < 1% error, which is what Redis uses.

Real-world applications

Although the HyperLogLog algorithm may seem theoretical, it has quite many practical applications. In 2013, Google wrote a paper outlining how they use the algorithm in practice. The paper also included even more optimizations for the algorithm.

Here are a few areas where the algorithm is used:

Web analytics. A site with 50M daily hits from 10M unique visitors would need ~160 MB for an exact set, whereas HyperLogLog uses 12 KB.
Network security. Routers monitor distinct source IPs per destination at wire speed. A sudden spike in estimated cardinality signals a DDoS attack, which HyperLogLog could be used for.
Database query planning. PostgreSQL, BigQuery, Spark, and Presto maintain HyperLogLog column statistics. Bad cardinality estimates can slow queries by 1000×.
Redis. The PFADD, PFCOUNT, PFMERGE commands provide HyperLogLog in 12 KB with < 1% error.

Casimir Rönnlöf, April 2026