Well, the Internet Apocalypse came and went! Due to the recent expiration of the Let's Encrypt intermediate and root certificates, I saw more widespread issues than I was expecting, but on different devices and for different reasons than I thought. Let's take a look at what happened and why.

All Great Things Come To An End

Every Certificate Authority out there is powered by Root Certificates installed on your device, and all of those Root Certificates will eventually expire. With a lifespan that's typically 20-25 years, we're only just seeing the first, major Root Certificates coming up to their expiration. A Root Certificate expiring is meant to be a non-event. New Root Certificates are created to replace the expiring ones and they're distributed via updates to all of the clients out there, often many years in advance. But there's a catch, not all clients devices out there are being updated and some don't even have updates available. This was the first big reason we were expecting to see issues when the Let's Encrypt Root Certificate expired, that legacy clients wouldn't have received the new Root Certificate to continue working.

I wrote a blog post back in Apr 2019 titled Let's Encrypt to transition to ISRG root which talked about the process of Let's Encrypt planning to move from their old Root Certificate to their new one but the move was ultimately postponed. I then wrote about it again in Sep 2020, Let's Encrypt postpone the ISRG Root transition, and as you can tell from the title, that move was also postponed. Let's Encrypt explained:

Due to concerns about insufficient ISRG root propagation on Android devices we have decided to move the date on which we will start serving a chain to our own root to January 11, 2021.

Loosely translated, they were worried that not enough Android devices had received the update that contained the new Root Certificate in the almost 5 years since it was made available. The Jan 2021 deadline was also postponed and in the end it was realised that time was running out, fast, and a new solution to this problem was required.

Supporting Legacy Devices

The concern was these older devices that were not getting updates and may never get an update, but how do we fix that? Well, we issue even more certificates of course! Let's Encrypt issues new Root and Intermediate Certificates. To oversimplify a very complicated process, it was realised that these old devices were unlikely to check the expiration date on the Root Certificate so they may actually continue to use it. The problem is that newer, more modern devices will check the expiration date so they can't use it. How do we balance the needs of modern and legacy clients at the same time? Well, that was done with something called a cross-signature and is far beyond the scope of this post-mortem, but if you are interested in more technical details, I have a 4-part blog series that will take you through everything you might like to know:

  1. The Impending Doom of Expiring Root CAs and Legacy Clients
  2. The Complexities of Chain Building and CA Infrastructure
  3. Cross-Signing and Alternate Trust Paths; How They Work
  4. Finding alternate trust paths the easy way; Introducing Chain Builder

With this new, cross-signed certificate in place, websites and servers that use Let's Encrypt certificates will now need to serve a new certificate chain, but that shouldn't have caused any problems at all, yet it did.

The First Issue - 29 Sep 2021 19:21:40 UTC

When you get a new server certificate for your website or service, you also get the intermediate certificate/s that you need to make it work. This is commonly referred to as the certificate bundle and if you've handled certificates before, it will be familiar to you. This new change to support legacy devices by Let's Encrypt required everyone to use a new certificate bundle, but nobody actually had to do anything. You can see in this update from Let's Encrypt about Production Chain Changes that the new default chain, starting 4 May 2021, "will remain compatible with many Android devices, thanks to the cross-sign!".

If we consider that all Let's Encrypt certificates are valid for 90 days, that means that all certificates issued prior to 4 May 2021 will have expired by 2 Aug 2021 and will have been replaced. That means that everyone, everywhere, should have received the brand new certificate bundle with their new certificate long before the Sep 2021 deadline. That means everyone will be compatible with legacy devices and modern devices before the deadline, but that's not what happened.

We saw a wide selection of servers and services continuing to serve the old Intermediate Certificate with their new server certificate, a bit of an odd mismatch and something that wasn't expected because frankly, it doesn't make much sense to do this...



I tweeted about the Intermediate Certificate expiring thinking that it would be a total non-event because this should have been fixed long ago without anyone having to do anything, but the reports of things breaking soon started to come in. It turned out that there were a load of servers out there still serving their new server certificate with the old Intermediate Certificate that had now expired?! The fact is that most clients aren't capable of dealing with this scenario so we started to see many reports of something similar to this in their browser:

Without digging a little deeper it does look like the server certificate has expired, and indeed I saw many organisations being told about this across social media, but it's actually because the Intermediate Certificate in the certificate chain is expired, not the server certificate! A small selection of modern clients can actually work around this issue for you and indeed my Chrome on Windows 10 combination is good enough to figure out and you can try your client here: https://expired-r3-test.scotthelme.co.uk/

The problem is that most clients won't be able to deal with this gracefully and although there is a way to fix it, by finding an alternate trust path using AIA fetching or something like a local Intermediate Certificate cache, it's a complicated process that few clients do.

As to 'why' sites were still serving the expired Intermediate Certificate, well, I just don't know... For those sites running on IIS, which is not my area of expertise, it appears that IIS would continue to use the expired R3 Intermediate Certificate as it didn't realise it was expired. A restart of the server would trigger the new certificate chain to be served, but the manual intervention, and lack of knowing what you needed to do, did cause prolonged issues. As for other platforms, as I say, I just don't know. Having an Intermediate Certificate be treated like its a static part of your configuration is extremely bad and it will eventually cause issues, the whole certificate bundle should be updated each and every time you get a new server certificate!

The solution for this issue is to update the certificate chain being served by the server, which most organisations seemed to do by simply renewing their certificate and using the certificate bundle that came with it. This did overwhelm Let's Encrypt a little, where I saw a slowdown in the issuance process during this time, but nothing too drastic and no unavailability.

The Second Issue - 30 Sep 2021 14:01:15 UTC

The main event was of course the expiration of the DST Root CA X3 and that came not long after the expiration of the R3 Intermediate Certificate.

This was when I really expected to start seeing issues but at this stage there were already regular reports of problems created by the intermediate expiring, so they had to be separated.

The first major problem that was reported widely was that Google Cloud Monitoring was reporting down across the board for any server using a Let's Encrypt certificate. This was quickly followed by reports that Microsoft Azure Application Gateway was no longer connecting to servers using a Let's Encrypt certificate either. Both of these problems caused outages and I was getting a flurry of tweets and DMs with screenshots of dashboards saying that everything was reported as offline. There was then a regular stream of reports from various different platforms about these issues surfacing in a variety of different ways. I've tried to keep the Links section further down updated with companies and status pages that detail the problems that were seen but it seems that most of these issues were caused be devices/servers running outdated software, mainly OpenSSL.

Unfortunately, the fix for this issue is usually a little more involved, ranging from a software update through to a firmware or OS update.

The problem comes back to a similar concern around the client being outdated in some way and not having the new ISRG Root X1 installed, meaning it can no longer validate certificate chains as it has no Root CA to anchor on.

Summary

Overall, I think the expiration of the Let's Encrypt CA certificates went really quite well, largely due to the work Let's Encrypt did around arranging for a new cross-signed chain to be available beyond the expiration of the IdenTrust root. That said, there were far more issues in areas we didn't anticipate. Modern devices, all the way through to latest versions of iOS and macOS hit issues when connecting to servers that had a misconfigured certificate chain and quite serious issues from huge companies like Google and Microsoft in their cloud products that could no longer validate certificate chains was surprising to say the least. In all, I think this just highlights something that many of us that work in this space have known for some time, that TLS/PKI are complex and fragile systems that often go overlooked for long periods of time because they 'just work' most of the time. It's very similar to the huge Facebook/WhatsApp/Instagram outage that's just spread across the entire world for several hours because of an issue with BGP, many people look at that and think how can it be so fragile or so simple to totally break stuff?

One thing that's certain is that this event is coming again. Over the next few years we're going to see a wide selection of Root Certificates expiring for all of the major CAs and we're likely to keep experiencing the exact same issues unless something changes in the wider ecosystem.

Here are the highlights of companies/products that had issues which appear to have been caused by the R3 Intermediate Certificate expiring, including links to incidents where available:

Fortinet | Cisco Umbrella | Catchpoint | BlueCoat | PaloAlto | Guardian Firewall |
Monday.com | OPNsense

Here's the highlights of companies/products that had issues which appear to have been caused by the DST Root CA X3 expiring, including links to incidents where available:

OVH | Shopify | Xero | InstaPage | Heroku | Netlify | Cloudflare | MailGun | Sophos | cPanel | AWS | DigitalOcean | OpenSSL (<1.0.2) | Wikimedia

If you have any links to incidents or information about other companies that had issues because of these expiring certificates, please drop them in the comments below!