It's time for the 4th instalment of my Alexa Top 1 Million scan and I've added a heap of new metrics to the crawler for analysis. On top of this there are also some other exciting announcements. Let's dig in!
I've done 3 previous crawls before now and they were Aug 2015, Feb 2016 and Aug 2016. They've shown some awesome trends in our adoption of security headers and HTTPS and I've also made several improvements to my crawlers along the way. These latest results are literally fresh off the press as I'm now running my crawl every single day, but more on that later, so let's dig in.
It's really awesome to see that we are still making great progress towards securing the web and the most recent results back that up. Here is a quick glance at the headline figures from the latest scan.
There are some really good numbers in here and the first and hopefully most obvious is the absolutely enormous jump in the adoption of Content Security Policy. In the last 6 months there has been a 166% increase in the number of sites deploying CSP in the Alexa Top 1 Million which represents a great success. There are countless great things that you can do with CSP but it looks like the most popular things right now are helping sites migrate to HTTPS / fixing mixed-content and framing/clickjacking protection (see data later). On the subject of HTTPS and migrations, the other really important metric I'm tracking is how many of the sites actively redirect from HTTP to HTTPS and I'm glad to say we're still seeing significant progress being made there too! With a 45% increase in the number of sites redirecting to HTTPS we're now just a whisker away from having 20% of the Top 1 Million sites using HTTPS. This is how that progress looks over the last 2 years.
Of course, the original purpose of me starting these scans was to track the use of security headers across the Top 1 Million and we're still seeing positive indicators across the board there too.
All of the familiar trends from previous scans are still present, with high adoption at the top end of the ranking and then a drop in adoption as you move down through the lower ranked sites. There's also the same trend breakers present in this scan, the XXP and XCTO headers, that buck the trend and actually increase in use as you move down the ranking. There's still no solid explanation for that!
A year ago I also added code to track the usage of Let's Encrypt as a CA when I was performing the crawl. At the time they did have a pretty small presence but that has change, a lot!
The lower usage at the top end seems to be fairly expected with all of the really large sites probably having commercial agreements with 'big' CAs but with growth like that, I look forward to their August 2017 results to see what they can do in the next 6 months!
Crawler overhaul and regular scans
I've tweaked and improved the crawler as I've run each of these scans to add new features and make it more efficient. Over the Christmas break though I wanted to take it up a notch and completely re-wrote my crawler with the intention being to run a crawl of the Alexa Top 1 Million every day. Since December I've been refining and improving this process and I'm now reliably crawling the entire 1 million sites and storing the data every single day! The crawlers log basically everything they do to a MySQL database and once they're all done it produces a nice summary of the crawl for me:
Total Rows: 938133 Security Headers Grades: A 1763 A+ 678 B 26910 C 716 D 56509 E 81973 F 769504 R 80 Sites using content-security-policy: 11010 Sites using content-security-policy-report-only: 1435 Sites using x-webkit-csp: 368 Sites using x-content-security-policy: 882 Sites using public-key-pins: 501 Sites using public-key-pins-report-only: 74 Sites using x-content-type-options: 90333 Sites using x-frame-options: 95774 Sites using x-xss-protection: 71966 Sites using x-download-options: 6952 Sites using x-permitted-cross-domain-policies: 6935 Sites using access-control-allow-origin: 27840 Sites redirecting to HTTPS: 187245 Sites using Let's Encrypt certificate: 31032 Top 10 Server headers: Apache 206396 nginx 138860 cloudflare-nginx 86344 Microsoft-IIS/7.5 36266 Microsoft-IIS/8.5 31501 LiteSpeed 20643 GSE 18443 nginx/1.10.2 14625 nginx/1.10.3 13944 Apache/2.2.15 (CentOS) 13013 Top 10 TLDs: .com 459161 .net 49756 .ru 48977 .org 45171 .de 24226 .jp 18958 .uk 15296 .br 13627 .ir 13352 .in 12784 Top 10 Certificate Issuers: Let's Encrypt Authority X3 31030 COMODO RSA Domain Validation Secure Server CA 27372 COMODO ECC Domain Validation Secure Server CA 2 19094 Go Daddy Secure Certificate Authority - G2 18606 RapidSSL SHA256 CA 8991 GeoTrust SSL CA - G3 4876 AlphaSSL CA - SHA256 - G2 4404 Symantec Class 3 Secure Server CA - G4 4271 Symantec Class 3 EV SSL CA - G3 4200 RapidSSL SHA256 CA - G3 4067 Top 10 Protocols: TLSv1.2 171723 TLSv1 7945 TLSv1.1 208 SSLv3 1 NULL 0 Top 10 Cipher Suites: ECDHE-RSA-AES256-GCM-SHA384 75448 ECDHE-RSA-AES128-GCM-SHA256 50357 ECDHE-ECDSA-AES128-GCM-SHA256 19963 ECDHE-RSA-AES256-SHA384 11152 DHE-RSA-AES256-GCM-SHA384 3631 DHE-RSA-AES256-SHA 3036 ECDHE-RSA-AES256-SHA 2538 AES256-SHA256 1882 AES128-SHA 1872 AES256-SHA 1843 Top 10 PFS Key Exchange Params: ECDH, P-256, 256 bits 154384 DH, 1024 bits 6059 ECDH, P-521, 521 bits 3508 ECDH, P-384, 384 bits 3291 DH, 2048 bits 1172 DH, 4096 bits 159 ECDH, B-571, 570 bits 50 ECDH, brainpoolP512r1, 512 bits 4 DH, 768 bits 3 DH, 3072 bits 2 Top 10 Key Sizes: RSA 2048 bit 146817 ECDSA 256 bit 20046 RSA 4096 bit 11233 RSA 1024 bit 237 RSA 3072 bit 86 ECDSA 384 bit 55 RSA 8192 bit 8 RSA 3248 bit 5 RSA 2049 bit 4 RSA 4056 bit 3
Alongside this summary I get individual files listing every site that uses each feature, like CSP, and their configurations, so I can see what people are commonly configuring.
For example, csp-values.txt shows me the most popular CSP configuration at a glance and the first few entries give an indication to what a lot of sites are using CSP for.
Values for content-security-policy: upgrade-insecure-requests 838 frame-ancestors 'self' 377 frame-ancestors 'self' ; 286 frame-ancestors 'self'; 233 default-src https: data: 'unsafe-inline' 'unsafe-eval' 97 frame-ancestors 'none' 86
The caList.txt file also shows all of the certificate providers I encountered during the crawl and identifies the name of the intermediate that signed the leaf.
Certificate Issuers: Let's Encrypt Authority X3 31030 COMODO RSA Domain Validation Secure Server CA 27372 COMODO ECC Domain Validation Secure Server CA 2 19094 Go Daddy Secure Certificate Authority - G2 18606 RapidSSL SHA256 CA 8991 GeoTrust SSL CA - G3 4876
As you can see, Let's Encrypt are now the largest issuing intermediate in the Alexa Top 1 Million but if you combine the total count for COMODO's RSA and ECC intermediates then they are the largest by count. I wonder how long that will last with the raging success that Let's Encrypt are seeing?
Opening up the data
Just like all of my previous scans I've put the data for this scan into my publicly available Google Doc Spreadsheet, Alexa Top 1 Million Scan Results. That's a great reference and quickly/easily used by most people but now the crawler is running daily and I'm logging a whole bunch of data I wanted to open things up even more. To that end I'm making available a zip file containing the entire output from a crawl, specifically the one this blog is based upon, the 7th Feb 2017. Now, the file is pretty large, weighing in at 1.2Gb compressed, but it contains everything!
Download Raw Data
Data for all scans is now available here.
I'm opening it up under the same license as my blog and I'd love to see what else people can do with it. There is a lot of data in there and I'm sure there are all kinds of interesting things that people could do with just some basic SQL query skills. Please keep me apprised in the comments below if you do anything awesome with it. As I'm running these scans daily now I could also do with a good solution to host the data if there's demand for access to it. Right now it rsyncs down to my local NAS server at home where I have a lot of free space, but I can't host that publicly for easy access. At ~1.2Gb per day it's an interesting problem but I am happy to open this data up to the world, if you can help out with hosting it, please get in touch.