Story telling
Here’s a story that I’m sure is familiar to you. You want to do something so you sign up for an online service to help you as quickly as possible. Typing is boring and filling out forms is boring.
How much time do you spend carefully crafting the password for that new site?
If you’re reading this, you would probably tell me it takes a few seconds for you to summon your password manager, type in your master password and generate a new password.
For many people though, it is simply the time it takes for them to remember their only password. The same password they use for everything from email to random startup in some far-off country.
Imagine you’re a hacker. You want to log in to other people’s accounts as quickly as possible to spread pictures of your cat all around the internet because she’s the best.
You could try every combination of letters, numbers and symbols for every account but that’d take forever.
It turns out that other hackers before you have already posted millions of passwords online. These are passwords that people actually used. It’s a dream come true. As a hacker, you can try these, or the most common ones. If you strike gold quickly that’s great. If not, you just move on to the next account looking for the easy pickings.
Sadly, this is where we are today. As a developer at a SaaS company, how can I keep my users safe when they insist on reusing leaked or insecure passwords?
Downloadable lists
I could make them choose an arbitrarily long password but then they may just type “passwordpasswordpassword”.
There is an alternative thankfully. We can use these same lists of millions of passwords to help our users. Troy Hunt has ethically collated many lists for us. He has even gone so far as to provide an API!
Wait, you want me to send my users’ passwords to an API? Not quite. The API takes the first few hex characters of a password hash and it will return the remainder of the hashes matching this prefix for us to check the against the full hash.
This is great but some of us are a little uneasy about sending our users’ passwords externally in any form even if it is just part of a hash.
Troy provides the hashes (currently 517m of them) to download. 517 million? How many GiB of storage does that require? 10.3 GiB compressed.
Fancy that in a docker container and loading that into RAM before the service can start?
As I didn’t really want a 10.3 GiB docker container in my Kubernetes cluster at work I decided to try something else. What if I don’t actually store the passwords? I was able to reduce the storage requirements to 2.1 GiB in exchange for a less than 1 in 1,000,000 chance of being wrong.
Bloom filters
Enter the Bloom filter. Invented by Burton Howard Bloom, it can store a hash of each piece of data and when queried can say that the item was never stored or that it was probably stored.
What do you mean “probably?” A Bloom filter is a probabilistic data structure. It works by hashing data and using the hash to determine which bits to set or “turn on”. When checking for the presence of a value, it is hashed and the bits are checked. If a bit is not set or “off” then we can be certain that the value was not added to the Bloom filter. If all of the hashed bits are set, then this hash or other hashes have set these same bits. If this sounds interesting, I encourage you to investigate Bloom filters further.
I said earlier that Troy provides hashes of passwords and the Bloom filter hashes the values. It is important to choose a hashing algorithm that has a uniform distribution for a Bloom filter and in my tests this appears to be true of SHA1 (the hash used by Troy). While some hashes are faster than others, hashing is slow and the fastest algorithm does no work. I decided to eliminate the second hashing step. Passwords are hashed and sent to the password checking service and it simply looks up the hash in the filter. This avoids transferring plain text passwords over the internal network.
After developing this service, my employer has embraced it and now we prevent our users from setting or changing their password to something that is already known by the internet and any would-be hackers.
Challenges
No system is perfect. It turns out that loading the Bloom filter (2 GiB of data) from disk into RAM is not super fast so I have set a large initial delay for the readiness and liveness probes in Kubernetes. I also had to decide on some wording for a validation message to try and explain to a less-technical user why they couldn’t use that password.
New passwords are stolen and leaked on a frequent basis and it isn’t possible to keep up with them all. This list of passwords will also become out of date. While these concerns are valid and a list like this isn’t bulletproof, it will go a long way to help and leaked passwords from years ago are still leaked.
The one-off processing of the list into the Bloom filter takes some time. It is no surprise that loading 30 GiB of uncompressed data is slow. This happens when the CI server builds the docker image so users of the service don’t need to download and process this data at runtime. I did try to parallelise the loading of data but it became significantly slower and the number of lines and complexity of the code grew substantially. I did manage to improve performance somewhat over a single core by batching inserts into the Bloom filter and looping through the bytes in the filter once per batch instead of once per hash. I looked at how complex the code had become and weighed that against the modest gain in performance and decided that for a job that only runs on CI occasionally, it was not worth the cost in maintenance, potential bugs and readability.
Try it yourself
My password checking service is written in Go and the source code is available on Github. There are some library functions available if you need to customise it or use a different monitoring system. Why not help your users and help them secure their accounts?