I've talked a lot about revocation in recent years and the issues with the current methods for checking the revocation status of a certificate are well understood and widely documented. We may now have something that changes that, let's take a look at CRLite!
We'll start with a few pieces of information to set the scene on what revocation is, why it's important and what's wrong with the current methods of revocation checking.
What is revocation?
Any website that wants HTTPS needs to obtain a certificate. Historically you'd have to pay for a certificate but they can now be had for free from many providers including Let's Encrypt. These certificates are valid for a period of time, currently a maximum of 825 days (but soon 398 days), and can only be used during that time. The problem comes if someone manages to steal your private key from you and then that someone can also use your certificate. This is of course a really bad situation and the whole idea of having certificates becomes meaningless at that point as someone else can now impersonate you and prove they are you or even decrypt visitor's encrypted traffic to your site. In order to stop someone using your certificate you have to mark it as revoked so that the browsers know not to trust it if they see it. The problem right now is that there are two ways you can do that and neither of them work in practise.
CRL and OCSP
I have a whole blog post on why Revocation is Broken but I will give a TLDR here and cover both Certificate Revocation Lists and the Online Certificate Status Protocol.
A CRL can be maintained by a CA and is a list of all of their revoked certificates. A browser needs to download these lists to check the certificates they encounter against them and when these lists can weigh in at tens of megabytes, that becomes problematic very quickly. CRLs simply got too large and there's no real way to combat that problem. The other concern is that CRLs only grow in size, especially as the encrypted web is now growing faster than ever before, so they're never going to shrink and help solve the problem. Coupled with that is the problem that events like Heartbleed pose with their large scale key compromise. With a large scale key compromise comes the requirement of a large scale revocation event, just look at these Netcraft stats following Heartbleed!
If you keep pumping all of these new revocations into the CRLs then they're going to see a big spike in size and they did. The more interesting point to note is that if you look at the current rate of certificate revocations we're far exceeding the daily rates from 2014 even if you include Heartbleed!
Bearing these things in mind, it's clear that CRLs were not the way to go and haven't been for a very long time.
After CRL there was OCSP which was a less heavy approach and involved simply asking the CA for the revocation status of a particular certificate. This was great for performance but terrible for privacy, just imagine asking the CA "Hey, is the certificate for scotthelme.co.uk revoked?" as you visit my site. What does the CA now know about you? Your browsing history and activity! Now imagine that happening with all of the websites you visit...
To try and fix this privacy compromise a new feature called OCSP Stapling was introduced and it did help, but not enough to fix all of the issues with OCSP, just the privacy and performance issues. With OCSP Stapling it would be the website, in this example scotthelme.co.uk, that would contact the OCSP responder to fetch the OCSP response and then relay that to the client, meaning the client didn't need to contact the OCSP responder. With no contact from the client to the OCSP Responder we would fix the privacy and performance issues, but OCSP Staping was an optional feature for website operators to turn on.
To try and fix the optional nature of OCSP, which was the main security limitation and reason it didn't fix revocation, OCSP Must-Staple was introduced but unfortunately isn't widely supported or deployed so OCSP still falls short even today.
On top of these mechanisms having their own problems, they both have another in common: availability. What if the CA infrastructure is down? Can a CA keep this infrastructure online and performing well 24/7/365 whilst serving billions of requests per day? Back in 2013 Comodo hit a milestone of serving 2,000,000,000 OCSP Responses per day! As time went by they'd go on to hit hit rates of 150,000 OCSP requests per second during their busiest periods and this was 7 years ago when the encrypted Web was a fraction of the size it is now. I can't find any recent numbers from CAs or their CDN providers, but just stop to think what that volume looks like today.
The problem, of course, is if their service stops working the whole thing breaks and there lies the major problem. We have a huge single point of failure for the whole system and that's something that we really can't afford which is what lead us to soft fail revocation checking and ultimately why the whole revocation thing fell apart. We'd have to assume that a responder might be offline, broken or unavailable for some other reason so the browsers treat the OCSP response as optional and allow it to fail open. With the OCSP response being optional, we can never rely on it.
How do we protect ourselves now?
One of the only things that site operators can do right now to protect themselves is to get shorter validity periods on their certificates. With a shorter certificate an attacker has less time to use it after they steal the key so you place yourself at less risk. I've covered other reasons why we need shorter certificates and as much as it does help with the revocation problem, it's really not the ideal solution. Whilst certificates can currently be valid for a maximum of 2 years (825 days) there was another attempt recently with Ballot SC22 to reduce that 1 year (398 days) that failed but Apple have since decided to enforce that requirement to protect their users. As great as it is to see the maximum validity period of a certificate continue to tumble, we simply can't make enough progress fast enough.
The only other methods we currently have at our disposal are effective but not on a large scale. In Chrome we have CRLSets and in Firefox we have OneCRL. These are lists of revoked certificates that each vendor maintains and bundles into their browsers. Between them they barely cover a fraction of a percent of all revoked certificates and suffer from the same main issue as CRLs that were mentioned above, size. It simply wouldn't be possible for a client like a browser to bundle in all CRLs as it'd require in the region of 2-3GB of space to store that. Whilst CRLSets and OneCRL protect us against the worst possible scenarios, they aren't going to help me and my visitors or anyone else reading this article.
What do need from a revocation system for it to be viable?
There are clear issues with the 2 existing revocation mechanisms we've looked at, CRL and OCSP, so we need to avoid any of the same shortcomings.
- ⚖️ The size issue of CRLs.
- 🎭 The privacy issues of OCSP.
- 📈 The performance costs of online checks.
- ❌ The single point of failure.
- 👎 The soft fail approach to revocation checking.
That's actually quite a tall order and to fix all of those issues at once is going to take some serious work. The biggest issue that jumps out at me is the fact that any external dependency is going to come with privacy and performance costs along with the availability concern. If we bring it local to the client though how do we avoid the huge size concern? Say hello to CRLite.
CRLite: A Scalable System for Pushing All TLS Revocations to All Browsers
That's the title from the official whitepaper introducing CRLite and is absolutely worth a read if you'd like some really technical details. While I will be going into some detail here, I'm not going to go to the depths that the whitepaper does. With that said, let's dig in and cover the first issue that CRLite addresses: Size. If you want to build all of that revocation information into the client, it has to be small. Really small.
To get to the level of smallness required, CRLite uses something called a Bloom Filter to store data. According to Wikipedia "A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set.". That sounds really promising in that it's 'space-efficient' and the 'element' that we want to test is a certificate that is a member of the revocation 'set'. The probabilistic part is quite interesting though and it does present a problem. What it means is that a Bloom Filter can only tell us one of two things. That an item is "definitely not in the set" or that it is "maybe in the set". Now for the purposes of revocation the "definitely not in the set" is going to be the most frequent and desired answer, but the "maybe in the set" does present some problems we'll tackle later. For now though, how a Bloom Filter works. We start with an empty bit array of, in my example, 10 bits. The first row of the table shows the index and the second row shows the bit value, with them all set to 0 here.
Now to insert items into the Bloom Filter we need to hash them and set the appropriate bits. Let's say we want to insert 2 certificates, we'd hash them and then set their bits in the bit array.
hash(cert1) = 0010100100
hash(cert2) = 0100010010
With those bits now set in the array the next step is to query the Bloom Filter. To do that you hash an item and check to see if all of the bits are set in the array.
hash(cert3) = 1000001100
We can say that
cert3 is "definitely not in the set" because not all of the bits required are set. Whilst bit 8 is set, bits 1 and 7 are not set, which gives us our definitive answer. What about another certificate though?
hash(cert4) = 0110000010
If we check the Bloom Filter for
cert4 we're going to get a "maybe in the set" even though we know that
cert4 is not in the set. Because a combination of bits from
cert2 have set all of the required bits for
cert4, this is where the 'probabilistic' part comes in. It can look like an item was present in the set even though it was not. False positives are possible while false negatives are not. To avoid false positives like this you can control the size of the filter and also how many bits are set per item in the filter, but you can only ever reduce the chance of a false positive, you can't remove it. To completely remove the chance of a false positive you'd need to know all possible items that could ever be queried against the filter to ensure they never collide.
I wonder if there's a way to know about all certificates in existence that could ever possibly be queried against our Bloom Filter?.. Thankfully we have Certificate Transparency which provides us a list of all currently issued certificates and it turns out, that comes in quite handy!
Using the example above,
cert4 is a problem for us as a false positive. This is a certificate that the browser might check and determine is revoked when it actually isn't revoked, it's a false positive. If we query all current certificates in CT against the filter then false positives like this can be identified and resolved using another Bloom Filter, called a Cascading Filter.
Cascading Bloom Filters
Here's our original Bloom Filter:
If we detect a collision with
cert4 when populating the first filter we create a second Bloom Filter that will be much smaller as it will contain less entries and then set the bits for
different_hash(cert4) = 010100
This second filter acts like a whitelist of non-revoked certificates that were identified as revoked because of a collision in the first filter. If we get a "maybe in the set" on the first filter we check it against the second filter. In the second filter we don't find
cert2, proving they are revoked, as they are present in the first filter but not the second.
different_hash(cert1) = 101000 different_hash(cert2) = 000011
This means that
cert2 are definitely revoked as they were present in the first filter but we ruled out a false positive by checking the second filter. We do have an issue with
cert4 though as it was identified as a false positive in the first filter, but how do we know it wasn't identified as a false positive in the second filter too? Well, that's where the third filter come in! I'm not going to roll out another example here but you get the idea. Each new filter is only a small fraction of the size of the previous filter and we can alternate between checking for false positives against the list of all revocations and the list of all known certificates on each filter level. Eventually, we arrive at a point where we have no false positives and a client can be sure that a certificate is either revoked or not.
Just think about that, the client can now make an accurate determination about the revocation status of a certificate, reliably, using data that it can hold locally! Let's look at our earlier list of requirements for a new revocation mechanism and see where we're up to.
- ⚖️ The size issue of CRLs.
- 🎭 The privacy issues of OCSP. ✔️
- 📈 The performance costs of online checks. ✔️
- ❌ The single point of failure. ✔️
- 👎 The soft fail approach to revocation checking. ✔️
We've removed the privacy issue because there's no online check like there was with OCSP. With no online check there's also no performance issue to worry about. There's no single point of failure because all we're depending on is the client itself and we can hard fail on the check because the client has all of the information it needs stored locally. That just leaves the size issue at the top of the list and looking at the original whitepaper, things do look promising in that regard.
Moreover, CRLite has low bandwidth costs: it can represent all certificates with an initial download of 10 MB (less than 1 byte per revocation) followed by daily updates of 580 KB on average.
Looking at a more recent blog post from Mozilla we can see the current filter is only 1.3MB but there are some caveats that come with the current implementation. To be included a CA must publish a CRL and not all CAs do, Let's Encrypt being the most notable that does not, so they are not included in the filter nor are they checked for revocation with this method. Even with the current CAs that are excluded the projections for the size of the filter to represent all revocations look good so maybe we can just about cross off that size issue!
- ⚖️ The size issue of CRLs. ✔️
- 🎭 The privacy issues of OCSP. ✔️
- 📈 The performance costs of online checks. ✔️
- ❌ The single point of failure. ✔️
- 👎 The soft fail approach to revocation checking. ✔️
CRLite looks really promising and is currently deployed in Firefox nightly for the purposes of gathering telemetry. The determination of revocation status using CRLite isn't used just yet, and the browser will still depend on OCSP, but if the data looks good then CRLite could start to be enabled in the future.
Having the ability to check revocation status like this, reliably and locally, will be fantastic but there are a few questions that it raises. Questions like what if I see a certificate that was issued after I last updated my CRLite data set? The certificate would clearly not be covered by CRLite and the next question then is what should the browser do? In this case I honestly think the browser should just accept the cert, as bad as that is. We know what all of the issues with OCSP are and if it's a soft fail check that's going to take place, we may as well skip it altogether.
Another question or point I expect to be raised is that if we do get reliable revocation checking then we can go back to the glory days of 5 year certificates and stop bothering with all of this shorter certificates nonsense. Let me just shoot this one down now before anyone raises it. Reliable revocation does not mean we can have longer certificates. Yes, limiting the damage of not being able to revoke a certificate is one of the key reasons for shorter certificates, but it is not the only key reason. As such, despite the fact that fixing revocation is a truly awesome step forwards, it doesn't mean you can have your longer certificates back. So don't ask.
Reading through the whitepaper and looking at the implementation details of CRLsite so far, I'm very optimistic about it's potential to solve the revocation problem. There are a few areas where it's raised some concerns though and I'll share these here to gather feedback or perhaps someone can answer them.
Assuming we hit the 10MB download for the initial filter, and we do need 580KB/day of updates, what kind of clients are we excluding? I'm sure if you're reading this on your expensive laptop or phone these kind of numbers won't be a concern to you. My laptop is hooked up to my WiFi (which in the UK is using my pitiful 80Mbps Internet connection) and my phone is hooked up to 4G which gives me around 100Mbps of bandwidth. My devices have the 10MB of storage to spare, the bandwidth is basically nothing and my devices have the performance to use the filter with no concern about the impact. But what about everyone else? Is 10MB a reasonable storage impact to have on all devices? I'm thinking not just about cheaper smartphones but what about smart devices and IoT? What about devices in constrained bandwidth environments? Could CRLite be used there, should it be used and was it ever intended for anything outside of a browser? I don't know the answers to these questions, they're just some of the things that came to mind during my research.
Yep, that's right, there are some concerns I have about reliable revocation checking. Just think about how powerful a mechanism CRLite is. You can revoke a certificate and make sure it never gets trusted by a client again. That's a 100% fantastic thing if you've been compromised and lost your key, but maybe it's not the only time a certificate could get revoked.
I've written about this concern in various forms over the years and I have a blog post titled When your CA turns against you. You should read the post for full details but the point is that if a CA issues you a certificate, and then for some reason decides to revoke it, the certificate stops working. In that particular blog the reason was that the CA later became unhappy with some information in the application, so they killed the certificate with a revocation after issuance, but what other scenarios might a CA do this?
What if a DMCA take down notice is used to try and kill the certificate of a site? Maybe a court order or government enforced revocation? We then get into the realms of censorship too. What if a particular government doesn't like a particular site and they compel the revocation of the certificate? Now, I know there are other mechanisms that are already used today to do such things, but it feels like going after the certificate with a reliable revocation in our (eventually) 100% encrypted world would put an awful lot of power in the hands of the CA. Based on all of my past experiences with CAs, that makes me really uncomfortable. I don't know how this will pan out, or if it will be a concern in the long run, but it's certainly worth mentioning and thinking about as we move forwards.
All said and done I'm very impressed by CRLite and think it has the genuine chance to mostly solve the revocation problem. In your browser or your OS, CRLite would be a minimal impact for most and having a local database of all revoked certificates that you can query would be a great thing. Beyond that I wonder if we will see other uses for CRLite surface too. Much like we see uses for the HSTS Preload list outside of clients upgrading network requests, perhaps we will see interesting new ways to implement CRLite beyond what it was originally intended.
Useful links and info
Introducing CRLite: All of the Web PKI’s revocations, compressed - a blog post from Mozilla introducing CRLite.
The End-to-End Design of CRLite - A more technical post from Mozilla with details on CRLite.
OCSP Stapling - my blog on OCSP Stapling
OCSP Must-Staple - my blog on OCSP Must-Staple
Revocation Is Broken - my blog detailing why revocation is broken.
Why we need to do more to reduce certificate lifetimes - my blog on technical reasons to reduce certificate lifetimes.
Revocation checking and Chrome's CRL - by Adam Langley
Revocation still doesn't work - by Adam Langley
No, don't enable revocation checking - by Adam Langley