Mar 26, 2024

Implement a bloom filter in Rust

In this article we are going to build a bloom filter in Rust without any libraries

What are Bloom filters

ChatGPT describes Bloom filters as

“A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.”

What does this mean? ‘space-efficient’ means it doesn’t take up much space in memory, ‘probabilistic’ means that the returned result is not always 100% accurate, it is used to test wether an item exists in a list, this is useful for stuff like checking if a user is in a block list, or if an email source is on a spam list

when a bloom filter tells you an item doesn’t exist it is correct, but when it returns true there is a chance that the item really doesn’t exist, this is the tradeoff for the speed and efficiency

Bloom filters are not the best option for when you need to know for sure if an item exists, for example we wouldn’t use a bloom filter to check if a username is already taken, but it’s good for a lot of purposes and regardless it is a pretty cool algorithm and suprisingly simple

check out this amazing article by samwho to learn more about bloom filters samwho.com

Let’s start by defining our struct

struct BloomFilter {
    bit_vector: Vec<bool>,
    size: usize,
    hash_functions: usize,
}

This struct has just 3 properties, bit_vector is where we will store the data, size is the size of the array, the bigger it is the less likely it is you’ll get a false positive but it will also be slightly slower, hash_functions is the amount of hashes we calculate for each item

Now let’s add the functions

impl BloomFilter {
    fn new(size: usize, hash_functions: usize) -> BloomFilter {
        BloomFilter {
            bit_vector: vec![false; size],
            size,
            hash_functions,
        }
    }
}

So far we just created the struct and added the new method so we can create a BloomFilter instance, now let’s add the hash function, this is what’s called to hash an item

inside the impl add the following function

fn hash(&self, item: &str) -> Vec<usize> {
    let mut result: Vec<usize> = Vec::new();
    for i in 0..self.hash_functions {
        let mut hasher = DefaultHasher::new();
        item.hash(&mut hasher);
        i.hash(&mut hasher);
        let hash = hasher.finish() as usize % self.size;
        result.push(hash);
    }

    return result
}

This is a simplified version, we start by looping over a range so we get the amount of hashes we specified earlier, for each iteration we create a hasher and hash the item, the important thing here is how we call i.hash this makes every iteration unique, we also divide the result by the length of the bit vector and use the remainder, this ensures the result fits inside the vector

Last step we add function to add and check items

fn add(&mut self, item: &str) {
    for i in self.hash( item) {
        self.bit_vector[i] = true;
    }
}

fn contains(&self, item: &str) -> bool {
    for i in self.hash(item) {
        if !self.bit_vector[i] {
            return false;
        }
    }
    true
}

Those function are very similar, the first one is setting the bit to true for each hash, the second one is checking if the bit is false for each hash, if a single bit is false we know the item doesn’t exist so we can immediately return false, otherwisee we still don’t know for sure if the item is there but we return true

You may notice there are no methods to remove items and that’s because by default Bloom filters do not allow removing items, it makes perfect sense if you think about it, we can’t remove a bit because we don’t know if the bit belongs to any other items so we will end up removing additional items, there are variations fo bloom filters that allow that but I will leave that for another post

Now we can go ahead and test our Bloom filter, in our main function let’s add the following and check the result

fn main() {
    let mut filter = BloomFilter::new(1000, 3);
    filter.add("hello");
    filter.add("world");

    println!("Contains 'hello': {}", filter.contains("hello"));
    println!("Contains 'world': {}", filter.contains("world"));
    println!("Contains 'rust': {}", filter.contains("rust"));
}

Ruun this and you should see this in your terminal

Contains 'hello': true
Contains 'world': true
Contains 'rust': false

And that’s it, I wrote this article because I pushed off learning Bloom filters for a long time because I was intimidated by them, as you can see it is a very simple and elegant solution, I hope you found this interesting, catch you in the next one