Writing a bloom filter in go
Let’s assume you’re assigned a task to implement a feature which will check whether a username is already taken or not.
You: Sounds easy.
Most probably your first intuitive approach will be to check for username availabilty in your database everytime a new user tries to signup. Right?
Manager: Good, but does this going to scale well? can we make it more efficient?
Somehow we have to minimize:
- DB search
- Network calls
You: Perfect, I know a data structure highly appropriate for these kind of problems and it is known as Bloom Filter.
<FeelingProudOfYourself.gif>
OKay, What is a Bloom Filter?
Bloom filter is a probalistic Data Structure designed by Burton Howard Bloom in 1970. It is primarily used for answering set membership queries i.e., check if an element x is present in a set S or not.
Although prima facie it’s use-case seems similar to hash tables (fast lookup) but it is very space efficient than conventional hash tables and therefore preferred in low core memory scenarios.
It is probalistic in nature as it may return false positives. If bloom filter says an element is not present then we can be 100% sure that it’s not but for any positive response, the element may or may not be present in the set and in this case we have to query our primary source (may be DB) for final confirmation.
One may argue that we’re stil end up calling our primary source for the final confirmation then what’s the benefit, but let me tell you that with proper configuration (using mathematics) we can reduce these calls to a huge extent (almost negligible).
For calculating optimal value of m
and k
from total no. of elements n
and
error rate e
see formula in the program below or see here
While risking false positives, bloom filter has substantial space advantage over other Data Structures like hash tables, tries, binary search trees, etc. bacause these DS store the element itself which can require small no. of bits (for ints) to arbitrary large no. of bits (for strings). Bloom filter doesn’t store element at all and a separate solution like a DB should be used to store actual element.
Important: A Bloom filter with a 1%
error and an optimal value of k
, in contrast, requires
only about 9.6 bits
per element, regardless of the size of the elements
Data Structure
Bloom filter is a bit vector of m bits where initially all bits are set to
0
. It also has k (k
< < m
) hash functions where each of them maps or hashes some set element to
one of the m array positions.
Algorithm
To add an element:
- Pass element through
k
hash functions to getk
indices (k < < m
). - Set bit at these indices to 1.
To check set membership of an element
- Pass element through
k
hash functions to getk
indices - Check bit at these indices, if any bit is 0 then immediately return
false
. - If all bits are 1, then return
true
but remember this may be a false positive and we have to confirm its presence with primary source which in our case is our DB.
Why false positives?
With time, and even with highly uniform hashing algorithm, most of indices will be 1
and if when we now query an element then the probablity of returning all 1s
will be
high.
For ex:
foo
set bits at index1
,3
, and5
bar
set bits at index2
,4
,6
- Now, we want to check for
baz
but our bit vector is now almost full of1s
andbaz
maps to indices2
,5
, and6
which are all1s
but we know thatbaz
was never present in the set and hence it’s a false positive as these bits weren’t set bybaz
itself.
But with proper configuration using mathematics this situation can be avoided easily.
Program in golang
|
|
Driver Program
|
|
Conclusion
So, that was our attempt to implement a very simple bloom filter from scratch in golang.
Simple bloom filters are just the start. Bloom filter serve for variety of use-case and more advanced versions of it are being used at low latency large scale environments.
Some of the products using bloom filters are:
- Akamai uses bloom filters to avoid “one-hit wonders”
- Medium relies on bloom filters to avoid recommending articles user has previously read
- Chrome used to use them to protect users from opening malicious sites
- Many DBs use them to avoid disk lookups for non-existent rows and columns
- Ethereum uses Bloom filters for quickly finding logs on the Ethereum blockchain
- and infinitely more use-cases and applications.
To learn more about bloom filters in details see it’s Wikipedia page
And, for math behind bloom filters, see this
#TillThenHappyCoding