Collins Mmasaba - Deduplication

Dec 29, 2025

Deduplication is a technique of removing duplicate copies of repeating data. It is useful in many different contexts such as: in storage systems to reduce storage requirement needs, in network transfer to reduce the amount of bytes sent over the network, in message-oriented systems to avoid processing the same message twice, in targeted ads systems to avoid showing the user the same ad, in product recommendation systems to avoid showing the user the same product, etc. The usecases are endless. I'll do a quick overview to contexualize why deduplication is useful and finish with a walkthrough of my implementation.

Overview

Deduplication is done by calculating and comparing hashes of the target data. The unique hashes are stored then later compared to hashes of other data, incoming messages, network bytes, etc. A match occurs when two data chunks have the same hash and this likely means there is a duplicate in the system, which can then be deleted, dropped or ignored depending on the context. I used the word likely because deduplication systems typically have false positives. A false positive is when the system incorrectly flags two different pieces of data as a match. The false positive rate depends on the size of the data chunk used to calculate the hash. Using smaller data chunks results in higher false positive rates than using larger data chunks. In practice, larger data chunks have a perfomance penalty since calculating their hashes is computationally expensive. Therefore the trade-offs involve space efficiency, computation speed and false positive rate. As we will see later with the advanced deduplication systems, the false positive rate is an important parameter that guides the selection of other system parameters in the deduplicator. In summary, the deduplication process usually involves the following steps:

assigning a unique id to each chunk of data using a cryptographic hash function and storing the hash in a repository.
checking the repository for a hash matching the data being examined with the assumption that if hashes are similar then the data they represent is also true, which may not always be the case.
if a hash already exists in the deduplication namespace then proceed accordingly i.e by deleting the duplicate or ignoring the incoming message.

Deduplication systems can be categorized according to a number of criteria:

post-process deduplication: new data is first stored on device then later a process analyzes the data looking for duplicates.
inline deduplication: done as data is incoming on the device to look for and eliminate duplicates.
target deduplication: deduplication is done where the data is stored/processed.
source deduplication: deduplication is done where the data is created or originating.

Lastly, we'll take a look at the drawbacks in deduplication systems.

Hash collisions: deduplication uses cryptographic hash functions to calculate hashes for each piece of data being examined. If two different pieces of data generate the same hash, this is considered a hash collision. If a collision occurs there needs to be additional means of verification to distinguish the pieces of information, otherwise there is risk of data corruption or data loss.
Computational power: calculating hashes is computationally expensive and can incur a perfomance penalty on the system. Ti improve perfomance some systems use both weak and strong hashes. The system first calculates a weak hash for a piece of information and only when a collision occurs does it calculate the strong hash to break the collision. Weak hashes are faster to calculate but suffer greater risk of collisions.

Implementation

I implemented three different approaches to deduplication, each with it's own benefits. They are:

Expiring key repository deduplicator.
Bloom filter deduplicator.
Cuckoo filter deduplicator.

The full source code including examples can be found at my github link.

Expiring Key Repository Deduplicator

This is the simplest implementation of deduplication in terms of the working principle and underlying mechanisms that make it work. From the name expiring key repository, it means we have a key-value store where the keys are hashes of the values we want to deduplicate. And these keys must have a time-to-live(TTL) after which they are discarded from the key-value store.
Most message-oriented systems provide at-least once delivery for improved reliability; this means a message is repeatedly sent at intervals to a destination until an acknowledgement(ACK) for that message is received from the destination signaling the it was successfully processed. The risk here is that one message may end up being processed more than once because the messaging system assumes that no ACK means the message was not processed at the destination. That's not always true, the ACK could've lost due to transiet network issues as it was being relayed back the source, or it is just delayed due to high network traffic.
The solution to this is that the service receiving messages needs to be idempotent. For every message that is successfully received and queued for processing, the deduplicator records that message with a unique identifier so that in case similar messages arrive later they can safely be ignored.
This deduplicator also has the concept of keys expiring after their TTL is up and you may be wondering why that is necessary. Back to messaging systems, each message sent to a destination service has an ACK timeout. Once a message is sent a counter for the ACK timeout is also started. If the timeout elapses before an ACK for the sent message is received, the system retries sending the message. At the destination each received message will have a TTL which accounts for the worst-case scenario time it takes to process a message and send an ACK back to the source. As long as a message's TTL has not expired, all later incoming and similar messages will be dropped. After the TTL expires the message is deleted from the deduplicator. At this point if the source is still sending the same message then it probably means either the ACK was not successfully delivered or the message was not successfully processed, either way it will be queued for processing and the cycle repeats.

For the expiring key repository the two crucial design considerations were how/where to store the key-value pairs and how to calculate the hashes for the keys. For the first part, I used Redis which is specifically built as a key-value store and supports very fast key membership tests. Keys can also be stored with an expiry time and they will be automatically cleared.


func NewRedisExpiringKeyRepo(window time.Duration, redisURL string)
(*RedisExpiringKeyRepo, error) {
    if window < time.Millisecond*5 {
        logger.Info("[kv] window cannot be less than 5ms, defaulting to 5ms")
    }

    c, err :=cache.NewCache(redisURL)
    if err !=nil {
        logger.Error("[kv] error initializing cache", "error" , err)

        return nil, err
    }

    return &RedisExpiringKeyRepo{
        cache:  c,
        window: window,
    }, nil
}

// IsDuplicate checks if the `data` is duplicate within a given time window.
func (r *RedisExpiringKeyRepo) IsDuplicate(ctx context.Context, data any)
(bool, error) {
    payload, ok := data.(Payload)
    if !ok {
        logger.Error("[kv] error checking duplicate", "error", "bad data")

        return false, errors.New("bad data")
    }

    exists, err := r.cache.Exists(ctx, payload.Key).Result()
    if err != nil {
        logger.Error("[kv] error perfoming kv lookup", "error", err)

        return false, err
    }

    if exists > 0 {
        return true, nil
    }

    err = r.cache.SetEx(ctx, payload.Key, payload.Value, r.window).Err()
    if err != nil {
        logger.Error("[kv] error perfoming kv insertion", "error", err)

        return false, err
    }

    return false, nil
}

The next part is how to calculate hashes for the keys. I created helper functions for different cryptographic hashes i.e. Adler32, SHA256, SHA512. Adler32 is the weakest but fastest hash while SHA512 is the strongest but slowest.


// ValueHasherLimitMinimum is the least number of bytes used for
// calculating the hash value of a key
const ValueHasherLimitMinimum = 64

func NewValueHasherAdler32(readLimit int64) ValueHasher {
    if readLimit < ValueHasherLimitMinimum {
        readLimit=ValueHasherLimitMinimum
    }

    return func(value []byte) (string, error) {
        h :=adler32.New()

        _, err :=io.CopyN(h, bytes.NewReader(value), readLimit)
        if err !=nil && !errors.Is(err, io.EOF) {
            logger.Error("[kv] error perfoming adler32 hashing", "error" , err)
            return "" , err
        }

        return string(h.Sum(nil)), nil
    }
}

func NewValueHasherSHA256(readLimit int64) ValueHasher {
    if readLimit < ValueHasherLimitMinimum {
        readLimit=ValueHasherLimitMinimum
    }

    return func(value []byte) (string, error) {
        h :=sha256.New()

        _, err :=io.CopyN(h, bytes.NewReader(value), readLimit)
        if err !=nil && !errors.Is(err, io.EOF) {
            logger.Error("[kv] error perfoming sha256 hashing", "error" , err)
            return "" , err
        }

        return string(h.Sum(nil)), nil
    }
}

func NewValueHasherSHA512(readLimit int64) ValueHasher {
    if readLimit < ValueHasherLimitMinimum {
        readLimit=ValueHasherLimitMinimum
    }

    return func(value []byte) (string, error) {
        h :=sha512.New()

        _, err :=io.CopyN(h, bytes.NewReader(value), readLimit)
        if err !=nil && !errors.Is(err, io.EOF) {
            logger.Error("[kv] error perfoming sha512 hashing", "error" , err)
            return "" , err
        }

        return string(h.Sum(nil)), nil
    }
}

Finally expiring key repository deduplicator is initialized and used as shown below:


type KeyValDeduplicator struct {
    KeyFactory keyvalue.ValueHasher
    Repository KeyRepository
    Timeout time.Duration
}

func NewKeyValDeduplicator(
    keyFactory keyvalue.ValueHasher, ctxTimeout, window time.Duration,
) (*KeyValDeduplicator, error) {
    r, err := keyvalue.NewRedisExpiringKeyRepo(window, os.Getenv("REDIS_HOST_URL"))
    if err != nil {
        return nil, err
    }

    if keyFactory == nil {
        keyFactory = keyvalue.NewValueHasherAdler32(math.MaxInt64)
    }

    if ctxTimeout < 5*time.Millisecond {
        ctxTimeout=5 * time.Millisecond
    }

    d := &KeyValDeduplicator{
        KeyFactory: keyFactory,
        Repository: r,
        Timeout: ctxTimeout,
    }

    return d, nil
}

func (d *KeyValDeduplicator) IsDuplicate(ctx context.Context, data any)
(bool, error) {
    var buf bytes.Buffer

    err := gob.NewEncoder(&buf).Encode(data)
    if err != nil {
        return false, err
    }

    key, err := d.KeyFactory(buf.Bytes())
    if err != nil {
        return false, err
    }

    ctx, cancel := context.WithTimeout(ctx, d.Timeout)
    defer cancel()

    return d.Repository.IsDuplicate(ctx, keyvalue.Payload{Key: key, Value: data})
}

Bloom Filter Deduplicator

A Bloom filter is a probabilistic data structure used to test whether an element is a member in a set. Instead of storing all items in a set, the Bloom filter only stores the items' hashed representations. This makes Bloom filters very space-efficient and fast. When testing for set membership in Bloom filters, there could be false positives but never false negatives i.e. querying a Bloom filter for set membership will say the element is possibly in the set or it is definitely not in the set. Also, elements can be added to the set but cannot be removed from the set.
An empty Bloom filter is an bit array of m bits which are all initialized to 0. Bloom filters also include a set of k hash functions. To reduce the risk of hash collisions, each entry is hashed by all khash functions and for each hash value the corresponding bit in the array is set. To check if an entry exists in the array, the entry is hashed by all the k hash functions and all the corresponding bit positions in the array are checked, if any bit is unset then it can be determined with certainly the entry does not exist in the set.
Unlike hash tables, Bloom filters cannot be rebalanced or entries deleted from the filter. Therefore when creating a Bloom filter we need to know beforehand how many entries will be stored in the filter. To make Bloom filters scalable so that they can store more elements that it was designed to, filters can be stacked. Once a Bloom filter reaches capacity, a new one is created atop of it with greater capacity.
Bloom filters are useful for applications where the source data is huge and would require excessive memory if conventional hashing were to be used. They are also suited for applications where insertions into the set are done more often than checking for membership. Examples of usecases for Bloom filters are:

detecting credit card fraud in payment processing services
in ad placement services to decide whether or not a user should be shown an ad
in product recommendation services to decide whether or not a user should be shown a product
checking is a username is taken or not during user registration
dicount code/coupon validation to see if a code of coupon has been redeemed

I used Redis Bloom filter as the underlying data structure for my deduplicator. When reserving a Bloom filter there are 3 important parameters that need to be set.

error rate: this is the upper bound on the false positive rate in the filter
capacity: this is the number of entries we intend to store in the filter
expansion: if we end up storing more entries than the filter can reliably hold while maintaing the error rate above, stacking will be done to allocate more space. The value specified here will be used as the factor by which the initial capacity is increased by.

The number of hash functions used by the Bloom filter is variable and calculated depending on the desired error rate and capacity specified when reserving the filter.


type BloomFilter struct {
    store bfStore
    key string
}

// New creates and returns a [BloomFilter] backed by Redis.
func New(
    connStr, bfKey string,
    errorRate float64,
    capacity, expansion int64,
) (*BloomFilter, error) {
    c, err := cache.NewCache(connStr)
    if err != nil {
        logger.Error("[bf] error initializing cache", "error", err)

        return nil, err
    }

    bf := &BloomFilter{store: c, key: bfKey}

    _, err = bf.store.BFInit(
        context.Background(),
        bfKey,
        errorRate,
        capacity,
        expansion
    )
    if err != nil {
        logger.Error("[bf] error initializing bloom filter", "error", err)

        return nil, err
    }

    return bf, nil
}

// IsDuplicate checks if the key is present in the bloom filter.
func (bf *BloomFilter) IsDuplicate(ctx context.Context, data any)
(bool, error) {
    exists, err := bf.store.BFExists(ctx, bf.key, key)
    if err != nil {
        logger.Error("[bf] error perfoming bf lookup", "error", err)

        return false, err
    }

    if exists {
        return true, nil
    }

    _, err = bf.store.BFAdd(ctx, bf.key, key)
    if err != nil {
        logger.Error("[bf] error perfoming bf insertion", "error", err)

        return false, err
    }

    return false, nil
}

Finally the Bloom filter deduplicator is initialized and used as shown below:


type BloomFilterDeduplicator struct {
    Repository KeyRepository
    Timeout time.Duration
}

func NewBloomFilterDeduplicator(
    ctxTimeout time.Duration,
    filterKey string,
    errorRate float64,
    capacity, expansion int64,
) (*BloomFilterDeduplicator, error) {
    bf, err := bloomfilter.New(
        os.Getenv("REDIS_HOST_URL"),
        filterKey,
        errorRate,
        capacity,
        expansion,
    )
    if err != nil {
        return nil, err
    }

    if ctxTimeout < 5*time.Millisecond {
        ctxTimeout=5 * time.Millisecond
    }

    d := &BloomFilterDeduplicator{
        Repository: bf,
        Timeout: ctxTimeout,
    } return d, nil
}

// IsDuplicate returns `true` if the data is present in the bloom filter.
func (d *BloomFilterDeduplicator) IsDuplicate(ctx context.Context, data any)
(bool, error) {
    ctx, cancel := context.WithTimeout(ctx, d.Timeout)
    defer cancel()

    return d.Repository.IsDuplicate(ctx, data)
}

Cuckoo Filter Deduplicator

We are finally at the last implementation I did, the Cuckoo filter. It is also a probabilistic data structure used to test set membership efficiently. Cuckoo filters have some few modifications compared to Bloom filters; mainly being they support deletion from the set and have bounded false positive rates.
A cuckoo filter is made of an array of buckets storing fingerprints of the values in one of the buckets at positions decided by two hash functions. To test for membership of an entry in the set, the fingerprint of the entry is searched in the buckets and returns true if match is found in any of the buckets otherwise it returns false. The size of the fingerprint determines the false positive rate i.e. longer fingerprints result in lower false positive rates. The Redis implementation uses 8 bits for fingerprint size.
Cuckoo filters are suited for all usecases where a Bloom filter is relevant with the added benefit that it performs better than a Bloom filter for applications doing set membership checks more often than set insertions.

The Cuckoo filter has the following parameters:

p: false positive rate(error rate)
f: fingerprint size (8 bits for Redis)
a: load factor/fill rate (0<=a<=1)
b: number entries per bucket
m: number of buckets
n: number of entries
C: average number of bits per item

When reserving the Cuckoo filter we need to specify the capacity and optionally bucket size. The capacity is calculated as:


    capacity = n * f/a

So would need to know the load factor, which is the percentage of our buckets that need to be filled before the filter self-declares itselt full. Honestly, calculating the load factor was the hardest part for me too. Alternatively, you can use reference values that been emprically calculated, which is what I did. For example, for a bucket size of 4 entries per bucket, the load factor is 0.95, then you can calculate the capacity you need based on the number of entries you have.


type CuckooFilter struct {
    store cfStore
    cfKey string
    window time.Duration
}

// New creates and returns a [CuckooFilter] backed by Redis.
func New(
    connStr, cfKey string,
    capacity, bucketSize int64,
    window time.Duration,
) (*CuckooFilter, error) {
    c, err := cache.NewCache(connStr)
    if err != nil {
        logger.Error("[cf] error initializing cache", "error", err)

        return nil, err
    }

    cf := &CuckooFilter{store: c, cfKey: cfKey, window: window}

    _, err = cf.store.CFInit(context.Background(), cfKey, capacity, bucketSize)
    if err != nil {
        logger.Error("[cf] error initializing cuckoo filter", "error", err)

        return nil, err
    }

    return cf, nil
}

// IsDuplicate checks if the key is present in the cuckoo filter.
func (cf *CuckooFilter) IsDuplicate(ctx context.Context, data any)
(bool, error) {
    exists, err := cf.store.CFExists(ctx, cf.cfKey, data)
    if err != nil {
        logger.Error("[cf] error perfoming cf lookup", "error", err)

        return false, err
    }

    if exists {
        return true, nil
    }

    _, err = cf.store.CFAdd(ctx, cf.cfKey, data)
    if err != nil {
        logger.Error("[cf] error perfoming cf insertion", "error", err)

        return false, err
    }

    return false, nil
}

Lastly, the Cuckoo filter deduplicator is initialized and used as shown below:


type CuckooFilterDeduplicator struct {
    Repository KeyRepository
    Timeout time.Duration
}

func NewCuckooFilterDeduplicator(
    ctxTimeout time.Duration,
    filterKey string,
    capacity, bucketSize int64,
    window time.Duration,
) (*CuckooFilterDeduplicator, error) {
    cf, err := cuckoofilter.New(
        os.Getenv("REDIS_HOST_URL"),
        filterKey,
        capacity,
        bucketSize,
        window,
    )
    if err != nil {
        return nil, err
    }

    if ctxTimeout < 5*time.Millisecond {
        ctxTimeout=5 * time.Millisecond
    }

    d := &CuckooFilterDeduplicator{
        Repository: cf,
        Timeout: ctxTimeout,
    }

    return d, nil
}

// IsDuplicate returns `true` if the data is present in the cuckoo filter.
func (d *CuckooFilterDeduplicator) IsDuplicate(ctx context.Context, data any)
(bool, error) {
    ctx, cancel := context.WithTimeout(ctx, d.Timeout)
    defer cancel()

    return d.Repository.IsDuplicate(ctx, data)
}

Final Thoughts

My initial plan was to only build an expiring key repo since that was what I needed for my work at that time, but while researching I learnt about Bloom and Cuckoo filters and htought why not try them out too? It's always fascinating when you learnt about the ideas driving these seemingly big and complex systems out there like targetted ad systems, product recommendations etc.
I have tried covering as much as I could here but it's already long enough, if you want to see a simple demonstration of how I used the deduplicators please check the github repo. If you have any improvements or spotted a bug, feel free to open an issue in the github repo. For a start, there's this issue.
If you've gotten this far thanks for sparing your time and see you next time. Sayōnara, tomodachi.

Overview

Implementation

Expiring Key Repository Deduplicator

Bloom Filter Deduplicator

Cuckoo Filter Deduplicator

Final Thoughts

Quick Links