It's been 6 months since my last crawl of the Alexa Top 1 Million so it's time to dust off my servers and fire them up again! Here's my latest observations of security on the top 1 million sites on the Web.
What I'm looking for
I use these crawls to get a feel for how security is progressing on the web and how quickly different technologies are being adopted and deployed. To do this, I analyse the top 1 million sites, as ranked by Alexa, and check for the following:
I refer to a particular set of HTTP response headers as 'security headers' and they include:
These headers allow sites to configure important, and in some cases essential, security protections for visitors.
The use of HTTPS on the web becomes more and more important every day. My crawl checks for sites that redirect users to HTTPS when visiting them.
Whilst I already track the use of various HTTP response headers during the crawl, I also get the score for the site from the securityheaders.io API to give a simple grade that's easier to digest.
These numbers were really promising but I was still seeing incredibly low usage of some of the security features. For example, there were only 2,764 sites using CSP in February 2016 out of 1,000,000. This is incredibly low at only 0.2942% of the top 1 million and only double what I'd seen 6 months earlier. I was really interested to see how the rate of adoption had changed in the last 6 months and was hoping to see significant progress.
August 2016 Results
After firing up my crawlers and feeding them the latest copy of the Alexa Top 1 Million the results are in for August 2016. Some of the trends and oddities found in previous scans were still present and it's good to see that adoption has increased across the board for everything again. That said, there is one worrying thing that stood out.
As you can see, the numbers do look good and headers like CSP, STS and PKP have seen a healthy increase in deployment. The only problem is the rate at which they're being deployed has slowed by quite a large amount. This is a worrying thing to note when some of them haven't even cracked 1% of the top 1 million sites. We're still seeing new deployments, just nowhere near as many as before.
The one thing that did buck this trend was the deployment of HTTPS. In fact, HTTPS was deployed at an even faster rate than it was in the previous scan period. The results showed 62,043 sites with HTTPS in September 2015, then 88,199 sites with HTTPS in February 2016 and 129,149 sites with HTTPS in August 2016!
Beyond the obvious privacy and security gains of using HTTPS there are many other reasons that sites are migrating. I covered a few in my article Still think you don't need HTTPS? and I have another article in the works covering the huge push towards HTTPS so keep your eyes out for that!
The wider view is still quite good and the numbers are going in the right direction. We can still see the huge spike in adoption of these technologies at the very top end of the Alexa ranking, followed by a sharp decline and then a steady tail off towards the lower end. Have a look at the results of all 3 scans for comparison.
The Y axis has had to scale up as all of the metrics are improving, but some are clearly making better progress than others, HTTPS being the most notable improvement. It's also interesting to see that the trend with XCTO and XXSSP is still present in that their usage increases as you go further down the Alexa ranking in stark contrast to all of the other metrics. I still don't have a conclusive explanation for this!
Just like the last scan I also ran the crawl through the securityheaders.io API to see what kind of scores the top 1 millions sites get.
Whilst there has been a marginal improvement in these results, they're pretty much just as bad as they were 6 months ago. The overwhelming majority at almost 86% scored the worst possible grade of F. We still have some room for improvement! If you haven't seen securityheaders.io it's my free HTTP response header scanning service, check it out!
Just like I did with my previous scans, I'm releasing all of the raw data for you to use should you want it. It's licensed under the same CC BY-SA 4.0 license as my blog. The Google Sheet can be found here and contains all of the tables and graphs you see throughout all of my articles. If you want to download the raw scan data, grab it here:
As always, feedback and comments are welcome below!