Is this your first time hearing about “Bloom Filter“ ? Then you have reached the best place. I was in the same place when I read BLOOM FILTER for the first time in a youtube thumbnail. This mysterious name got me curious as the name does not give any hint about its usage or usefulness. So since then i have been learning about bloom filter and this article is kind of like a compilation of many different resources in order to de-mystify the WHAT, WHY and HOW of Bloom Filter.

The What

Bloom Filter helps is filtering. Eureka! The end.

Lets go a little deep though. Bloom filter helps in finding out if something exists or not in the set or db.

It is a probabilistic data structure based on hashing. Why probabilistic ? Because bloom filter gives us a probabilistic answer. If the bloom filter returns no as an answer that means the item definitely does not exist in the collection. But if the bloom filter returns a yes that means the item might be present in the collection. Hence bloom filter is a PROBABILISTIC data structure.

If the bloom filter returns a positive answer, we need to check the collection to ensure if the item actually exist in the collection or not. If the item does not exist, then it is said to be a False Positive.

In short, if a bloom filter returns a negative then the element definitely does not exist in the collection. On the other we need be careful of false positives if the filter returns a positive answer.

Bloom Filter is like a Hash Table but instead of mapping a key to an element we store a set of bits determined by passing the element through multiple hash functions. Say we want to add “Hello“ to the set. We pass “Hello“ to a hashing function and we get 9 as a result. So in a bit array we store 1 in the 9ᵗʰ index.

Since we are storing bits in place of actual values the space complexity reduces drastically.

The Why

The reason why bloom filter is popular is because of its efficient space and time complexity. A popular usecase of the filter is check if a username is taken. That annoying step you have to go through each time you register in a site. Instead of using linear search to check each one of the million usernames, we can use bloom filter to know if a username exists or not pretty quickly.

Now, you may wonder—if the filter can return false positives, don’t we still have to perform the search anyway? The key benefit lies in how the filter handles negatives. When the Bloom filter says a username does not exist, we can trust that result and completely avoid searching the database. This eliminates unnecessary work in the vast majority of cases, making checks both faster and more scalable. And the number of false positives can be reduced depending on the number of hashing functions we want to include and available space.

Why does False Positive occur ? Let’s understand with the help of an example.

suppose we have a bit array of size 10 and use 2 hash functions.

Insert element X → hashes to positions 2 and 5. We set those bits to 1.
Insert element Y → hashes to positions 3 and 5. We set those bits to 1.

Now, if we check for element Z, it might hash to positions 2 and 3. Since both these bits are already set to 1 (from X and Y), the Bloom filter will incorrectly say Z exists — even though it was never inserted. This is a false positive.

The How

We take a fixed size bit array (length m) and a fixed number of hashing functions (k).

Inserting

We apply the k number of hashing functions on the element we want to insert in the set. We get k numbers as a result from each of the hashing functions and then we store 1 in the bit array using each of the k results as indices.

Searching

When we search for an element, we pass it through the same k hashing functions and check the bits and each indices. If any one of the bit is 0, then the element does not exist in the set. If all the bits are 1, then the element might exist in the set.

Choosing k

Without going into too much details and math, the False Positive Rate formula:

$$(1−e⁻ᵏⁿ/ᵐ)ᵏ$$

Depending on the space constraint, we would have chosen m (bit array size) and we can estimate the number of entries that will be made, n. So we can only decide the number of hashing functions to reduce the number of false positives.

From the formula (1−e⁻ᵏⁿ/ᵐ)ᵏ, it can be observed that increasing k reduces the false positive rate. So you might wonder that we can take an enormous k. But that is not computationally efficient. It will increase the processing time drastically. So the optimum number of hash functions can be found via this formula, which minimizes the false positivity rate:

$$k=ln(2)⋅m/n.$$

Time Complexity

For both searching and insertion, the time complexity is O(k) since we just need to pass the input through the k hashing functions.

Space Complexity

The space complexity for the bloom filter is O(m). Because we are just storing bits in the bit array not the actual input, if the size of the bit array taken is m then the space complexity is also O(m)

Implementation

Now lets get started with implementation, the fun part. The implementation is for academic purposes only and cannot be used for production grade applications.

The simple implementation

package main

import (
    "crypto/md5"
    "crypto/sha256"
    "encoding/base64"
    "fmt"
    "hash/fnv"
    "math"
)

type BloomFilter struct {
    m     int
    k     int
    data  []int
    count int
}

func NewBloomFilter(m, k int) *BloomFilter {
    return &BloomFilter{
        m:     m,
        k:     k,
        data:  make([]int, m),
        count: 0,
    }
}

func (bf *BloomFilter) Insert(element string) {
    if bf.k == 1 {
        h1Idx := h1(element) % bf.m
        bf.data[h1Idx] = 1
    } else if bf.k == 2 {
        h1Idx := h1(element) % bf.m
        h2Idx := h2(element) % bf.m
        bf.data[h1Idx] = 1
        bf.data[h2Idx] = 1
    }
    bf.count++
}

func (bf *BloomFilter) Search(element string) string {
    if bf.k == 1 {
        h1Idx := h1(element) % bf.m
        if bf.data[h1Idx] == 0 {
            return "Not in Bloom Filter"
        }
    } else if bf.k == 2 {
        h1Idx := h1(element) % bf.m
        h2Idx := h2(element) % bf.m
        if bf.data[h1Idx] == 0 || bf.data[h2Idx] == 0 {
            return "Not in Bloom Filter"
        }
    }

    // false positive probability
    prob := math.Pow(1.0-math.Pow(1.0-1.0/float64(bf.m), float64(bf.k*bf.count)), float64(bf.k))
    return fmt.Sprintf("Might be in Bloom Filter (false positive probability %.6f)", prob)
}

// h1 using md5
func h1(w string) int {
    hash := md5.Sum([]byte(w))
    encoded := base64.StdEncoding.EncodeToString(hash[:])
    return simpleHash(encoded[:6])
}

// h2 using sha256
func h2(w string) int {
    hash := sha256.Sum256([]byte(w))
    encoded := base64.StdEncoding.EncodeToString(hash[:])
    return simpleHash(encoded[:6])
}

// helper: convert string -> int
func simpleHash(s string) int {
    h := fnv.New32a()
    h.Write([]byte(s))
    return int(h.Sum32())
}

func main() {
    bf := NewBloomFilter(10, 2)

    bf.Insert("apple")
    bf.Insert("banana")

    fmt.Println(bf.Search("apple")) // Might be in Bloom Filter(false positive probability 0.118267)
    fmt.Println(bf.Search("cherry")) // Not in Bloom Filter
}

Reference

Bloom Filter, Brilliant.org https://brilliant.org/wiki/bloom-filter/
Geeks for Geeks https://www.geeksforgeeks.org/python/bloom-filters-introduction-and-python-implementation/
Youtube Video - ByteMonk https://www.youtube.com/watch?v=GT0En1dGntY

Bloom Filter

The What

The Why