almanac.httparchive.org
almanac.httparchive.org copied to clipboard
Security 2024 Queries
Queries for the 2024 Security Chapter
General
- [ ] Maybe add subfolders for the sections or prefix each query with their subsection to improve the structure (50 queries in one folder are quite a mess)?
- [ ] Maybe remove some unused queries as they can distract?
Existing Queries
- [x] Copy all queries from the 2022 chapter
- [x] Update all of them to use only
httparchive.all - [x] Add annotations to queries missing annotations
- [x] Other structure/style improvements (preamble:
Section: ? Question: ?) - [x] Fix security.txt query to contain all documented fields (and not to heavily over report! 2022 said ~5% sites use security.txt, however most of these files are HTML pages and it is more like 1% of sites that use it which is similar to https://www.uriports.com/blog/security-txt/)
- [ ] ~~Fix sensitive values in robots.txt query~~ Query is already functional?
New Queries
- [x] List all new queries (according to the planning document as of July 16)
- [x] Implement all new queries
Attack Preventions
Security Header Adoption
- [x] Usage of
document.domainsetters and getters (Blink Feature Usage + Origin Agent Cluster) - [x] General Usage of
Timing-Allow-Origin - [x] Value Distribution of
Timing-Allow-Origin
Preventing Attacks using Cross-Origin Policies
- [x] ~~Value Distribution
CORP~~ Already existed - [x] Value Distribution
COEP - [x] Value Distribution
COOP - Also added queries for XFO (CSP already existed)
HTML Sanitization
- [x] General adoption of
setHTMLUnsafeandparseHTMLUnsafe(Blink Feature Usage) - [ ] ~~% Sites using above APIs of all sites that use shadow DOM~~ Usage is too rare: 6x parse, 2x set in total!
Drivers of Security Mechanism Adoption
Website Categories
- [x] % Sites use HTTPS per website category
- [x] % Sites use security headers per website category
- [ ] ~~% Sites use SRI per website category~~ We currently do not even have queries for SRI related to rank, country, ...; should we add website category here?
- [ ] ~~% Sites use bot protection services per website category~~ We currently do not even have queries for bot protection related to rank, country, ...; should we add website category here?
Security Misconfiguration and Oversights
- [x] ~~Unsupported policies in
<meta>tag~~ Already exists:meta_policies_allowed_vs_disallowed- [x] CSP, CSP-ro, XFO, HSTS, ...
- [ ] Discrepancies for policies definition in both header and
<meta>tag - [ ] More advanced query looking into the values (i.e., check for e.g., frame-ancestors, sandbox, report-uri, ...
- [ ] CSP vs XFO?
- [x]
Server-Timingheader: general usage,durproperty, ....
Thanks for the implementations! I took a look over the queries, and here's a few of my thoughts (also as response to the notes here and there)
-
bot detection: I would say we can just keep HSTS and Really Simple SSL and just select the values we'd like during writing. Same goes for entries like "bugcrowd" etc.
-
cryptominer usage: A year or two is usually used in previous editions as far as I know. Two years would be great imo because the last almanac used data up until July 2022, which we can continue from. Usually the usage is very low but sometimes the usage spikes, which might be interesting and if that would be the case we may find a good explanation (like last edition). I would propose to still run this query for 2 years worth of data and to then see what the result is before deciding whether to include it into the chapter or not.
-
csp allowed host frequency: The CSP header allows hostnames without scheme (or other schemes such as wss://), we should think whether we want to include those or not. It is my understanding that this currently does not happen and I think we should at least mention that in case we use the current version of the query. Parsing will be more difficult, but there might be hosts that are included much more frequently without scheme. Example: github.com sends a CSP response header in which none of the domains have a scheme apart from wss://.
It is not unthinkable that certain sites/frameworks/plugins/platforms/... in general do not include schemes. I wonder for how many sites "https?://" is in the CSP header at all. Not having "https?://" could also mean they just don't use allowlists of hosts but better options like strict-dynamic and nonce/hash (which is recommended).
-
csp number of allowed hosts: Same remark as previous (sheme-less hosts in CSP).
-
documentdomain usage: I'm not sure I understand the note: "the header is very rare and yet this value is very low". Would that not be normal? Rare header == low number of urls == low usage value?
-
security adoption by category: We should LOWER the header names so we don't accidentally miss some headers.
-
security adoption by rank: Same remark as previous (LOWER header names)
-
server timing usage values: It would be interesting to have the fractions of properties besides the 'dur' property to gauge which info may be included in the header.
For all other queries I had no remarks really. Let me know what you think. (CC @GJFR)
- Bot Detection: yes we can simply filter them out later (no change)
- Cryptominer Usage: The usage in June 2024 is a total of 53 pages on mobile and 41 pages on desktop which is nothing (total pages: 15M+) (https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=2083415086#gid=2083415086) (~~I can change the query to start in 2022~~ done)
- CSP hosts: Yes, we could come up with another regex/parsing to allow for hosts that do not start with https (Maybe change parsing of CSP hosts)
- Documentdomain: The note might be confusing. Here is the data: https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=150039899#gid=150039899 DocumentSetDomain has a count of
484310, DocumentDomainSettingWithoutOriginAgentClusterHeader has a count of only804, this would suggest that most pages that use DocumentSetDomain have an OriginAgentCluster. However, this is not the case, as we can see from other queries that almost 0 pages have an OriginAgentCluster header. I checked the data and it seems like starting from20230201they only count DocumentDomainSettingWithoutOriginAgentClusterHeader if the page has an OriginAgentCluster that does not allow for using SetDomain, in all other cases including no-header, they do not count that counter up. (~~Improve the note text~~ done) - Security by category/rank: True, they should be lowercased as was done in the other queries. (~~Lowercase header-names x2~~ done)
- Server Timing: Here are the results of the query: https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=1339089790#gid=1339089790 This could also be tracked by the privacy team as it no a security header per se? (Maybe extract/count other properties of the server-timing header)
Thanks a lot for writing the queries @JannisBush, going over the results was already very interesting! And especially thanks for updating all existing queries to use the new tables.
Some remarks:
- CSP hosts: I just performed a small test query with regex
(?i)(wss*://[^\s;]+)[\s;]instead of thehttps*one. Frequency ofws*hosts is very low compared tohttps*and imo wouldn't make sense to use in the same query (wss*hosts would almost be truncated out of the resulting table anyway). Highest relative frequency:wss://*.intercom.ioandwss://*.hotjar.comwith both +/- 0.07%. Might be interesting to still mention separately and would warrant a separate query forwss*, though? - Server Timing: I agree with @vikvanderlinden that it would make more sense to consider the properties alone, without timing values (they undermine the
group by). Would it be feasible to extract every property? If not, a cheap improvement would be to just omit all numeric values from the string (remaining caveat is the different order of properties in headers)? What do you both think? - Cookie Age Negative: Results indeed differ quite a lot from 2022. I looked at the query in detail, but I can't find any issues. Other cookie-related queries seem to differ less from 2022 results.
- It appears from
Iframe Attributes Usagethat usage ofallowdoubled compared to 2022. It would be nice to check the change in usage of time. Could you add such a query similar toHttps Request Over Timefor the usage ofallowandsandbox?
I've added some additional derived tables to the results sheets to provide more insights into the data. When running queries again, could you avoid overwriting the full tabs? It's only a few tabs, though.
I will finish reviewing the last few queries tomorrow.
Went though the remainder of queries:
- I think I found a typo: two queries contain
date = 2022-06-09while it should bedate = 2022-06-01?- https://github.com/JannisBush/almanac.httparchive.org/blob/cbcb2a7ea833de7fad881ca7afa689112bf6ce27/sql/2024/security/security_headers_prevalence.sql#L22
- https://github.com/JannisBush/almanac.httparchive.org/blob/cbcb2a7ea833de7fad881ca7afa689112bf6ce27/sql/2024/security/server_information_header_prevalence.sql#L21
- There is something off about the results of Tls Ca Issuers Pages: second largest issuer is an empty string (accounts for about 13% of all pages), which is quite a lot (this was 0.03% in 2022). I also found additional empty string issuers on row 151 and 180, which is weird because
GROUP BYshould have merged all empty string issuers. Maybe something went wrong during transfer of results from bigquery to sheets? - What does
count_foundmean in the results of Well Known Security? Since there is also ahas_security_txt, I assume it doesn't represent all websites with asecurity.txtendpoint?
CSP hosts: We can add another query for WSS or non-https hostsServer Timing: We can split at "," first and then take everything before the first ";", then take anything until the first "=" and that should give us all the metric names. If you also want to extract the descriptions and values, it will get more complex.Cookie Age Negative: I started to run the new query on the old data to check if there might be an issue with the queryIframe attribute usage: We can add a new query that tracks usage of allow and sandbox over time2022-06-09: This is on purpose, the2022-06-01entries forall.requestsis broken and the correct entries for that year/month are2022-06-09TLs Ca Issuers: As written in the notes, the query currently includes HTTP Pages (empty issuers, 13%) as well. We could filter that out (for the comparison to the old data), however, I think for the future including the amount of HTTP Pages in that query directly makes sense. Row 151 and 180 have 0.02% and are probably HTTPS Pages without an issuer (the same as the old 0.03% entries). I guess in bigquery, the HTTP Pages have a NULL issuer and the HTTPS Pages without an issuer have an empty string, so they are not merged together. However, when copying the data to sheets the difference between NULL and the empty string is lost.count_found: Count found is what the custom metric reports as found which is the final response to./well-known/security.txtreturns a status code of 200, this includes a lot of false positives that return non-security.txt files with code 200 at that URL (probably a similar number as https://almanac.httparchive.org/en/2022/security#detecting-status-code-reliability).has_security_txthas the additional constraint that thecontent-typeheader of the response has to start withtext/plain(which it MUST according to the spec: https://datatracker.ietf.org/doc/html/rfc9116#section-3). I did quite some experiments (see the outcommented queries in the file: https://github.com/HTTPArchive/almanac.httparchive.org/blob/cbcb2a7ea833de7fad881ca7afa689112bf6ce27/sql/2024/security/well-known_security.sql). My verdict of the results are that requiringtext/plaingets rid of the vast majority of FPs while at the same time only introducing a negligible amount of FNs.- I am not automatically overwriting any data in the sheets.
- Data for the new cookie query on the old data (
2022-06-09): - The numbers for Max-Age are basically the same. The differences is in
Expires(which then transfers toReal Age) - I had the note
Some of the percentages are quite different to the old query; one of both might be broken (difficult to compare as both cannot operate on a shared dataset)in the query. So yeah, something is going on here.
client count_negative_max_age pct_negative_max_age num_max_age_pages pct_max_age_pages num_max_age_hosts pct_max_age_hosts count_negative_expires pct_negative_expires num_expires_pages pct_expires_pages num_expires_hosts pct_expires_hosts count_negative_real_age pct_negative_real_age num_real_age_pages pct_real_age_pages num_real_age_hosts pct_real_age_hosts
desktop 3027635 0.085079712693773929 391368 0.18646223068797993 164603 0.14661195391159976 5920452 0.06190244069067101 1110544 0.29486378523190565 303714 0.16427975276414852 5761149 0.056308908960075628 1069189 0.28305686557625642 264131 0.13890066271136348
mobile 2752659 0.061759115713496793 481595 0.16319871445790932 200027 0.1244405596580835 6068461 0.049396407189040661 1456445 0.27821213917970833 347559 0.13971620115235245 5938723 0.0449079900197829 1409598 0.26856166505420204 300906 0.11727684617708099
Maybe CAST(JSON_QUERY(summary, '$.startedDateTime') AS NUMERIC) in all.requests is not the same as startedDateTime in almanac.requests 🤔
Seems like there was some post-processing: https://github.com/HTTPArchive/almanac.httparchive.org/blob/a6042fe74ef09ff3fc71549a0de780c28e7c97a1/sql/util/requests.sql#L121 I tried to do the same post-processing, but got very different results. I will ask in slack.
For Server-Timing I think the metric names are most interesting. Additionally it would be interesting to check the count for the dur property.
Sort of like: total # pages, # with header, # with >= 1 dur, # with 1x dur, # with 2x dur, # with 3x dur, # with >3 dur.
And if we can get the property names, a table that shows the number of pages that have at least one of each property. I wouldn't prioritize this, but for instance:
| property | # pages with that property at least once |
|---|---|
| dur | 200 |
| desc | 150 |
| cache | 100 |
If that makes sense?
Maybe
CAST(JSON_QUERY(summary, '$.startedDateTime') AS NUMERIC)inall.requestsis not the same asstartedDateTimeinalmanac.requests🤔Seems like there was some post-processing:
https://github.com/HTTPArchive/almanac.httparchive.org/blob/a6042fe74ef09ff3fc71549a0de780c28e7c97a1/sql/util/requests.sql#L121
I tried to do the same post-processing, but got very different results. I will ask in slack.
Tested more and they both seem very similar and have the same format, so that might not be the reason for the drastic differences.
The only difference I could discover is that in all.requests around 0.8% of requests have an extracted startedDateTime that is nullwhereas this is never the case for almanac.requests. However, as this only applies to such a small number of requests, I don't think this can be the (only) reason for the large difference.
- Average startedDateTime: almanac: f0_ 1654908665.0546837, all: f0_ 1654883339.0342844
SELECT
AVG(startedDateTime),
AVG(CAST(JSON_QUERY(r2.summary, '$.startedDateTime') AS INT64))
FROM
`httparchive.almanac.requests` as r1,
`httparchive.all.requests` as r2
WHERE
r1.date = '2022-06-01'
AND r2.date = '2022-06-09'
AND is_root_page
- 0.8% of
all.requestshave astartedDateTimethat is NULL
SELECT
COUNTIF(CAST(JSON_QUERY(summary, '$.startedDateTime') AS INT64) IS NULL) / count(0) as startedDateTime
FROM
`httparchive.all.requests` TABLESAMPLE SYSTEM (10 PERCENT)
WHERE
date = '2022-06-09' AND
is_root_page
- 0% of
almanac.requestshas a startedDateTime that is NULL
SELECT
COUNTIF(startedDateTime IS NULL) / count(0)
FROM
`httparchive.almanac.requests` TABLESAMPLE SYSTEM (100 PERCENT)
WHERE
date = '2022-06-01'
- I added the CSP WSS host query (and added the results to the sheets)
- Iframes attributes: updated the query and added the results to sheets (only for some months as the query is huge), seems like the allow attributes only surged in 2024-06-01 🤔
- Server Timing: I have the parsing step implemented. Now, we only need to create the aggregate statistics on it.
I finished the server-timing query and it now contains the information you wanted. The data is below the original data in the sheet: https://docs.google.com/spreadsheets/d/1b9IEGbfQjKCEaTBmcv_zyCyWEsq35StCa-dVOe6V1Cs/edit?gid=1339089790#gid=1339089790
Thanks a lot for the additional queries and clarifications @JannisBush! Interesting to see the additional data on CSP hosts and iframe allow values.
-
About the discrepancy for
Expiresvalue for cookies: I also don't seem to find a cause for it. I think it's best to leave it as is now. Should we still report on this data in our article, we will also explicitly mention the issue. -
I'm not 100% sure, but I think @vikvanderlinden is mainly interested in the
Server-Timingattributes (not sure if this is the right term). For example:- If the header is
cfRequestDuration;dur=22.000074, then we should disregardcfRequestDurationand count this header towards the group of headers where only onedurattribute is used. So the value ofdur(22.000074) would also be disregarded. - If the header is
cache;desc=hit, varnish;desc=hit_hit, dc;desc=fastly_g, this will be part of the group of headers with threedescattributes.
- If the header is
@vikvanderlinden, please correct met if I'm wrong about your intention for the Server-Timing header.
The new query counts the number of dur and desc attributes. It reports the average number of these attributes per host, as well as the exact number of hosts with exactly 1, 2, 3 or more dur attributes.
Yes, that and if there is for instance
db;dur=4cache;dur=0;desc=hit, varnish;desc=hit_hit, dc;desc=fastly_g;loc=eu-w-1cache;dur=hit, varnish;desc=hit_hit, dc;desc=fastly_g
on different pages, we could also count:
| property | frequency |
|---|---|
| dur | 3 |
| desc | 2 |
| loc | 1 |
And truncate to 50 results of so to gauge how many site use certain properties, which may or may not show so often occurring properties besides dur and desc.
@GJFR is this good to merge now?
Yes, should be complete 👍