Generating Zipfian Distributions Using Rejection Inversion

Introduction

In a previous post, I wrote about how rejection inversion works. In this article I will take you through using this to generate zipfian workloads to benchmark and test your software.

We will see how it works in two projects: a count min sketch within a buffer manager and a storage engine benchmarking tool.

Recap

Zipfian distributions are distributions where a few events occur exponentially more often than the rest. The most popular event occurs 2x more than the second and 3x than the third and so on.

It is exhibited in linguistics, social sciences and even in software. In a cache, certain keys tend to get significantly more hits than others. In databases, a few tables are queried more than the rest. Simulating these patterns helps you ensure that your software can handle them when they occur in the real world.

Algorithm and implementation

From the rejection-inversion paper, we have a well defined algorithm.

Computer and store $1-q$ , $\,$ $\frac{1}{1-q}$
Define:
$H(x) = \exp((1-q) * \ln(v+x)) * (\frac{1}{1-q})$
$H^{-1} = -v + \exp(\frac{1}{1-q}) * \ln((1-q) * x)$
as macros or functions.
Compute and store:
$H(x_o) = H(\frac{1}{2}) - \exp(\ln(v) * (-q))$ $s = 1 - H^{-1}(H(\frac{3}{2}) - \exp(\ln(v + 1) * (-q)))$
While true:
a. Generate a uniform random number $U \in [0, 1]$
b. Set $U = H(i_{max}+\frac{1}{2}) + U*(H(x_o) - H(i_{max} + \frac{1}{2}))$ .
c. Set $X = H^{-1}(U)$
d. Round $X$ to nearest integer and set its value to $K$ .
e. $If \,(K-X \le s)$ , then return $K$ ;
f. $Else \,if\, (U \ge H(K + \frac{1}{2}) - \exp(-\ln(v + K) * q))$ , return $K$ ;

We will use $C$ in our implementation but you can easily port this to any language of your choice. Every code written here, and more, can be found in this GitHub repo.

We’ll start with creating two functions that calculate $H(x)$ and $H^{-1}(x)$ .

double h_x(zipf* z, double x)
{
    assert(z != NULL);
    return exp(z->one_minus_q * log(z->v + x)) * z->one_minus_q_inv;
}

double h_x_inv(zipf* z, double x)
{
    assert(z != NULL);
    return (-(z->v) + (exp(z->one_minus_q_inv * log(z->one_minus_q * x))));
}

These function apply the formulas described in the algorithm. You can also have them as macros. Functions required to perform the calculations like log() and exp() are provided by the $math.h$ library.

Next we initialize our generator. We’ll need three user provided arguments:

$q$ - Zipf Exponent. This will determine how much skew is applied to our distribution.
$v$ - offset. Shifts the distribution left or right.
$i_{max}$ - This maximum discrete value to generate.

We will create a $zipf$ struct.

/*
 * Zipf contains stored constants for a zipf generator session.
 * These are precomputed to avoid computing them again throughout the iteration.
 * q zipfian exponent. determines how much skew to apply to the distribution.
 *    q should always be > 1. A lower value of q produces heavy tail/slower decay.
 * v offsets the distribution left or right. should be > 0
 * one_minus_q precomputed (1-q)
 * one_minus_q precomputed 1/(1-q)
 * h_x0 H(1/2) - exp(log(v) * (-q))
 * s squeeze applied for cheap checks: s  <-  1  - H_inv(H(3/2) - exp(log(v+1) * (-q)))
 * h_upper H(i_max + 1/2)
 *
 */
typedef struct
{
    double q;
    double v;
    double i_max;
    double one_minus_q;
    double one_minus_q_inv;
    double h_xo;
    double s;
    double h_upper;
} zipf;

All other fields other than $q$ , $v$ and $i_{max}$ will be calculated and stored.

We will create a function that initializes and returns our struct. Its function signature will look like this.

/**
 * new_zipf_generator initializes the zipf generator by precomputing essential values
 * zipf_exp: zipf exponent. This determines how much skew to apply to the distribution. Should be >
 * 1 offset: how much offset to apply to the distribution. Should be > 0. i_max: max value for the
 * distrubition. usually the number of items to be generated.
 *
 * return: pointer to zipf generator, or NULL incase of memory allocation failure.
 */
zipf* new_zipf_generator(double zip_exp, double offset, double i_max);

And here is the implementation.

zipf* new_zipf_generator(double zip_exp, double offset, double i_max)
{
    assert(zip_exp > 1);
    assert(offset > 0);
    zipf* z = malloc(sizeof(zipf));

    if (z == NULL)
    {
        return NULL;
    }

    /* calculate 1 - q, 1/(1-q), H(xo), s, and H(imax + 1/2) */
    z->q = zip_exp;
    z->v = offset;
    z->i_max = i_max;
    z->one_minus_q = 1.0 - zip_exp;

    z->one_minus_q_inv = 1.0 / (z->one_minus_q);
    z->h_xo = h_x(z, 0.5) - exp(log(z->v) * -(z->q));
    z->s = 1.0 - h_x_inv(z, h_x(z, 1.5) - exp(log(z->v + 1) * -(z->q)));
    z->h_upper = h_x(z, i_max + 0.5);

    srand(time(NULL));

    return z;
}

First we will assert that provided parameters satisfy the conditions. $q > 1$ and $v > 0$ .

Then allocate memory for our new struct and calculate each value in accordance with the algorithm. This reduces how much computation we have to do while generating a large number of values.

We also need to seed rand to avoid generating the same values later on.

The function returns a pointer to the newly allocated struct.

We will then create a function that generates an integer value and performs a check to accept or reject a candidate.

/**
 * zipf_next generates the next value in a zipf distribution using the generator
 * z: initialized generator.
 *
 * return: newly generated value
 */
uint64_t zipf_next(zipf* z)
{
    assert(z != NULL);

    uint8_t accepted = 0;
    double u, x, x_flr, k;

    while (!accepted)
    {
        /* generate uniform random number U */
        u = rand() / (double)RAND_MAX;
        u = z->h_upper + u * (z->h_xo - z->h_upper);
        x = h_x_inv(z, u);
        x_flr = floor(x);
        k = x > x_flr + 0.5 ? x_flr + 1.0 : x_flr;

        /* squeeze check */
        if (k - x <= z->s)
        {
            accepted = 1;
        } /* full check */
        else if (u >= h_x(z, k + 0.5) - exp(-log(z->v + k) * z->q))
        {
            accepted = 1;
        }
    }

    return (uint64_t)k;
}

The function takes a pointer to a zipf generator as a parameter and uses the precomputed values in the algorithm.

In the while loop, we generate a random uniform value $U$ between 0 and 1 matching the CDF range. We then map this to an area under the probability function $H(x)$ or a section on the CDF scale. This is equivalent to step $4a$ in the algorithm.

We then inverse the result and round it. When rounding, we round to the nearest integer. For example, if $U \le 2.5$ we round it to $2$ , else if $2.5 \lt U \lt 3$ , it is rounded to $3$ .

Lastly, we pass our $k$ value through the squeeze. If it evaluates to true, the value is accepted else we do a full check. If $k$ is rejected, the while loop continues until a suitable value is found.

This function is called in a loop until the maximum number of required items are generated.

Lastly we will provide a function to free memory allocated to the zipf struct.

/**
 * Frees memory allocated to z
 *
 */
void free_zipf(zipf* z)
{
    if (z == NULL)
    {
        return;
    }

    free(z);
}

These methods can be added to a header file and imported into your project.

Below is a script on how you would use this. You can replace the command line arguments with your own defined variables.

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#include "formatter.h"
#include "zipf_generator.h"

int main(int argc, char* argv[])
{
    int num_items;
    double zipf_exponent;
    double offset;
    int i_max;

    if (argc < 2)
 {
        num_items = 1000;
    }
    else
    {
        num_items = strtol(argv[1], NULL, 10);
    }

    if (argc < 3)
    {
        zipf_exponent = 1.5;
    }
    else
    {
        zipf_exponent = strtod(argv[2], NULL);
    }

    if (argc < 4)
    {
        offset = 1.1;
    }
    else
    {
        offset = strtod(argv[3], NULL);
    }

    if (argc < 5)
    {
        i_max = num_items;
    }
    else
    {
        i_max = strtod(argv[4], NULL);
    }

    printf(GREEN
           "Starting zipf generation:\nnumber of items: %.2f\nzipf exponent: %f\noffset: "
           "%f\ni_max: %d\n" RESET,
           (double)num_items, zipf_exponent, offset, i_max);
    zipf* z = new_zipf_generator(zipf_exponent, offset, (double)i_max);
    if (z == NULL)
    {
        printf(RED "Out of memory\n" RESET);
        return -1;
    }
    printf(CYAN "\nInitialized Zipf generator:\n");
    printf("\tq: %f\n", z->q);
    printf("\tv: %f\n", z->v);
    printf("\ts: %f\n", z->s);
    printf("\tone_minus_q: %f\n", z->one_minus_q);
    printf("\tone_minus_q_inv: %f\n", z->one_minus_q_inv);
    printf("\ti_max: %f\n", z->i_max);
    printf("\tH(xo): %f\n", z->h_xo);
    printf("\th_upper: %f\n\n" RESET, z->h_upper);

    uint64_t* vals = malloc(num_items * sizeof(uint64_t));
    if (vals == NULL)
    {
        printf(RED "Out of memory\n" RESET);
        free(z);
        return -1;
    }

    uint64_t zipf_val = 0;
    for (int i = 0; i < num_items; i++)
    {
        zipf_val = zipf_next(z);
        vals[i] = zipf_val;

        if (i == 0)
        {
            printf("[ %" PRIu64 "", zipf_val);
        }
        else if (i == num_items - 1)
        {
            printf(",%" PRIu64 " ]", zipf_val);
        }
        else
        {
            printf(",%" PRIu64 "", zipf_val);
        }
    }
    printf("\n");

    free(z);
    free(vals);
    printf(GREEN "Done\n" RESET);
    return 0;
}

Let's run this. Suppose we want 10 items, with a zipfian exponent of 1.1, offset of 2.5 and $i_{max}$ as 50. We pass the number of items to generate, $q$ , $v$ and $i_{max}$ as command line arguments respectively:

./build/zipf_generator 10 1.1 2.5 50

Starting zipf generation:
number of items: 10.00
zipf exponent: 1.100000
offset: 2.500000
i_max: 50

Initialized Zipf generator:
	q: 1.100000
	v: 2.500000
	s: 0.493273
	one_minus_q: -0.100000
	one_minus_q_inv: -10.000000
	i_max: 50.000000
	H(xo): -9.324562
	h_upper: -6.723144

[ 48,8,11,6,7,6,46,3,13,4 ] <-- Generated items

Done

We have an array of integers that should match the distribution. You can generate as many items as your system allows. Feel free to play around with the generator and see the results.

Applications.

Now that we know how to build a zipf generator with rejection inversion, let’s take a look on how we can apply this.

Benchmarking cache and storage engies.

Cache and storage engine access are some of the most common areas where testing zipfian distribution is key.

Given that the generator already provides integer values, you could use them directly as the keys, or you can transform the key using a method like snprintf() and it’s equivalent in other languages.

In tidesdb, an LSM-based storage engine, we perform the transformation depending on the size of the key. Here’s a snippet:

/**
 * generate_zipfian_key
 * generates a key following zipfian distribution
 * for benchmark purposes, we make this deterministic by using the operation
 * index directly rather than generating a random zipfian index. this ensures that
 * each operation index maps to a unique key, allowing proper verification.
 * the zipfian distribution is simulated by having operations access keys in a
 * pattern where early indices (hot keys) are accessed more frequently.
 * @param buffer buffer to store key
 * @param size size of buffer
 * @param max_operations total number of benchmark operations (used to set an upper limit for the
 * generated key)
 */
void generate_zipfian_key(uint8_t* buffer, const size_t size, const double max_operations)
{
    uint8_t key_num = 0;

    /** we use rejection inversion for zipfian key generation key generation*/
    key_num = zipf_next(1.3, 0.99, max_operations);

    /** format: k%010d fits in 16 bytes (k + 10 digits + null = 12 bytes) */
    snprintf((char*)buffer, size, "k%010d", key_num);
}

If the generated key is $25$ it ends up being $k0000000025$ , depending on $size$ parameter. To match zipfian distribution, a few keys will have lots of duplicates to mirror access patterns.

The same is applied to TidesDB benchtool which performs extensive benchmarks on the storage engine.

Applying rejection inversion to tidesdb benchtool resulted in ~20% of the keys appearing ~80% of the time as opposed to ~38% prior. This better matches a Zipfian distribution and is more likely to provide more accurate results while benchmarking.

If you are interested in learning more about TidesDB and TidesDB Benchtool, please checkout the repo and feel free to join the discord community.

Testing data structures.

On my personal projects, I encountered a scenario where I needed to test a Count-Min sketch as part of a buffer manager implementation.

A Count-min sketch(CMS) is a probabilistic data structure that utilizes a fixed size 2D array to store counts of items. It is used in areas like machine learning, caching(redis) and networking.

While testing, we would need to emulate patterns such as Zipfian to ensure that the count of items matches what we expect - a few items to have higher accesses than the rest.

Below is a function to test Zipfian distribution on a CMS:

// Zipfian/skewed distribution
func TestZipfianDistribution(t *testing.T) {
	n := 10000 // number of distinct keys
	ε := 0.1   // error rate(ε) 0.1%
	δ := 0.1   // error probability(δ)  0.1%
	totOperations := 0

	// zipf parameters
	var offset float64 = 200
	var zipfExponent float64 = 2.3
	var imax float64 = 5000

	type tTest struct {
		key   []byte
		count uint64
	}

	tests := make([]tTest, n)
	var test tTest

	z := zipf.NewZipf(zipfExponent, float64(offset), float64(imax))
	if z == nil {
		panic("Unable to create Zipf generator")
	}

	for i := range n {
		test.key = []byte(helpers.RandomString(5))
		// use rejection-inversion to sample values
		test.count = z.GetNext()
		tests[i] = test
		totOperations += int(test.count)
	}

	cSketch, err := NewCMS(ε, δ, NewMapHash())

	if err != nil {
		t.Fatalf("Expected no error, got %v", err)
	}

	if cSketch == nil {
		t.Fatal("Expected CM Sketch,  got nil")
	}

	// increment count
	for i, test := range tests {
		for j := range test.count {
			t.Run(fmt.Sprintf("%d_%d_test_zipfiandistribution_increment", i, j), func(t *testing.T) {
				_, err := cSketch.Increment(test.key)

				if err != nil {
					t.Fatalf("Expected no error during incrementing count, got %v", err)
				}
			})
		}
	}

	errCount := 0
	for i, test := range tests {
		// check count
		t.Run(fmt.Sprintf("%d_test_zipfiandistribution_checker", i), func(t *testing.T) {
			currCount, err := cSketch.GetCount(test.key)
			if err != nil {
				t.Fatalf("Expected not error, got %v", err)
			}

			absoluteErr := math.Abs(float64(currCount - int64(test.count)))
			relativeErr := math.Abs((absoluteErr / float64(test.count)) * 100)
			t.Logf("Relative error: %f", relativeErr)

			// At most δ num of keys should violate this bound.
			exceeded := float64(currCount) > float64(test.count)+((ε/100)*float64(totOperations))
			if exceeded {
				errCount++
			}
		})
	}

	t.Run("test_zipfiandistribution_probabilitybound", func(t *testing.T) {
		expectedErrorCount := δ * float64(totOperations)
		fmt.Printf("Expected errors: %d, Received  errors: %d\n", uint64(expectedErrorCount), errCount)
		if errCount > int(expectedErrorCount) {
			t.Fatalf("Expected error count <= %d, but got error count of %d", errCount, uint64(expectedErrorCount))
		}
	})

}

We create a tTest struct that will contain the key and how many times it will be accessed - basically it’s frequency.

We then iterate $n$ times, generating a random key and using our zipf generator to generate frequencies for each key.

We then access each key based on the set frequency and ensure that the error rate remains within the expected bound.

This access pattern simulates zipfian distribution and can be helpful when testing such data structures.

Conclusion and comments

In software benchmarking and testing, modeling Zipfian distributions closely emulates what occurs in the real world and will help you find potential bugs and uncover bottlenecks before they occur in production.

In the field of probability and statistics, this has been extensively studied and several methods of sampling from distributions, rejection inversion being one of them , have been developed and published. They continuously prove to be incredibly useful in computer science. Generally, math complements computer science and understanding the former will increase your capability in the latter. You can observe this in things like acoustic fingerprinting and erasure coding.

Hope you enjoyed this blog post. Got questions, correction or comments? Feel free to reach me at [email protected] or on LinkedIn.

Resources

Here are some resources I think you might find useful to expolore this topic further:

Rejection-inversion paper by W. Hörmann and G. Derflinger.
Zipf generator repo.
Rejection-Inversion theory blog post.

Also, feel free to checkout the TidesDB community and explore some of the work going on with regards to databases and storage engines.