ua-client-hints icon indicating copy to clipboard operation
ua-client-hints copied to clipboard

Considerations for fighting spam and bots with ua-client-hints

Open summercms opened this issue 5 years ago • 22 comments

I've seen a few issues in this repo mention about spam, fraud and protection, but not actually give a proper example.

Below is a screenshot of a real example, taken a few days ago. As you can see from the user-agent it's quite clear this is not a real person and a bot, accessing the website once a day and using the exact same user-agent from different ip addresses:

image

Note: On day 4 our system has flagged it as a hacker and our firewall then blocks that pattern for future attacks. The bot went to the home page only and used the exact same user-agent, once a day! Easy to spot this robotic behaviour.

I understand and I love to hear things to do with privacy are being improved.

BUT

Reducing information to better protect users privacy MUST be matched to fight spam and bots. Likewise with all the blackhats flooding our websites everyday trying to scan for vulnerabilities!

My point to raise is that there should be information generated by the Sec-CH-UA that will help us fight spam and vulnerability scanning!

Here's another real life example of a Chinese botnet that is currently flooding the internet:

Mozilla/5.0 (Linux; Android 7.0; FRD-AL00 Build/HUAWEIFRD-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043602 Safari/537.36 MicroMessenger/6.5.16.1120 NetType/WIFI Language/zh_CN

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36 Mb2345Browser/9.0

Mozilla/5.0(Linux;Android 5.1.1;OPPO A33 Build/LMY47V;wv) AppleWebKit/537.36(KHTML,link Gecko) Version/4.0 Chrome/43.0.2357.121 Mobile Safari/537.36 LieBaoFast/4.51.3

Mozilla/5.0(Linux;U;Android 5.1.1;zh-CN;OPPO A33 Build/LMY47V) AppleWebKit/537.36(KHTML,like Gecko) Version/4.0 Chrome/40.0.2214.89 UCBrowser/11.7.0.953 Mobile Safari/537.36

Mozilla/5.0 (Linux; U; Android 8.1.0; zh-CN; EML-AL00 Build/HUAWEIEML-AL00) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/57.0.2987.108 baidu.sogo.uc.UCBrowser/11.9.4.974 UWS/2.13.1.48 Mobile Safari/537.36 AliApp(DingTalk/4.5.11) com.alibaba.android.rimet/10487439 Channel/227200 language/zh-CN

You can clearly see patterns in the above user-agents. The Sec-CH-UA spec needs to allow developers to still spot such patterns, yet help keep end-users safe from fingerprinting.

summercms avatar Jan 16 '20 05:01 summercms

The conformity of the User-Agent string to a large cohort of User-Agents representative of the web when combined with IP address is a major feature of all anti-fraud and security solutions.

As one example. When a device makes a new version of a browser or app available the version number changes. Fraud toolkits don’t stay current with the latest versions. Most devices upgrade automatically to the new version within a few days of release. Therefore, an older version for a specific model of device is a suspicious indicator.

This change will remove this important feature in fighting fraud without offering a replacement.

jwrosewell avatar Jan 16 '20 06:01 jwrosewell

@ayumi-cloud to make sure I'm understanding the scenario here, am I correct in saying that most of the concern in the fraud detection space comes from the fact that UA client hints won't be delegated to subresource requests unless top level domains set the appropriate Feature Policy?

Based on my current understanding, top level domains will still be able to request additional browser information via the Accept-CH header, but any third party scripts will be unable to access this information without the an appropriate Feature Policy set.

If there are other concerns I'm missing, please let me know!

scottlow avatar Jan 17 '20 17:01 scottlow

@scottlow my concern with my first example (with the above image), is as follows:

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.1.2222.33 Safari/537.36

Now becomes:

Sec-CH-UA: "Chrome 71"

Removing the Semantic versioning information, will make fraud detection even harder! My point in the first example was showing Chrome 78.0.3904.87 was used everytime. These patterns are important for our firewall to discover bad users.

In the spec it only shows major browser versions, I can't see anything being mentioned about, MINOR or PATCH versions.

See here: https://wicg.github.io/ua-client-hints/#http-ua-hints

It only talks about major and not about the rest of the semantic versioning.

p.s. I can't remember where I saw an example but it was written by Yoav Weiss saying this spec will show major only and freezing browser versions.

I think I saw something like: Sec-CH-Major: "73" I was hoping the spec could do something like this:

Sec-CH-Major: "78"
Sec-CH-Minor: "0"
Sec-CH-Patch: "3904"

summercms avatar Jan 17 '20 19:01 summercms

It becomes Sec-CH-UA: "Chrome 71" only if the site in question doesn't send an Accept-CH: UA header though, right?

So if there was a first party domain that required more browser information to perform fraud detection, it could send this header and receive the additional information. Based on my read of the explainer, the fact that Client Hints are delegated via Feature Policy would have more impact here since this means that embedded third-party resources that may be used for fraud prevention would no longer be able to collect additional browser information unless the first-party site they're embedded in set the appropriate Feature Policy.

scottlow avatar Jan 17 '20 22:01 scottlow

Correct me if I'm wrong, but this seems like a duplicate of #11.

In cases of suspected fraud, you'd be able to request the required hint, get the full version and use its entropy to deduce the user is the same one you've seen before.

Note though that this use-case is indistinguishable from user fingerprinting (even if slightly more coarse).

yoavweiss avatar Jan 27 '20 10:01 yoavweiss

In cases of suspected fraud, you'd be able to request the required hint, get the full version and use its entropy to deduce the user is the same one you've seen before.

This assumes that the fraud/bot/spam detection system has the ability to send a second request and modify headers. Maybe @ayumi-cloud can elaborate on how flexible such systems are in terms of how much of the logic is server side vs. client side. Lack of information on the first navigation request will for sure degrade the quality of such systems.

jonarnes avatar Feb 07 '20 13:02 jonarnes

This assumes that the fraud/bot/spam detection system has the ability to send a second request and modify headers. Maybe @ayumi-cloud can elaborate on how flexible such systems are in terms of how much of the logic is server side vs. client side.

@jonarnes sure I'll give some extra feedback. Our firewall and many other firewall platforms all try to do basically the same thing, by trying to balance security and performance. So things like TTFB (time to first byte) is very important to us! Creating extra requests is something we try to avoid.

Right now this spec seems quite buggy and hard to test how things should be, an embarrassing example is testing the Opera browser, using Opera version 79 which gives the following results:

sec-ch-ua: Chromium 79

Feedback from other people say there's nothing to stop bad bots and malware browsers such as (Cheetah Security Browser - https://www.liebao.cn/index.html as an example) from faking the sec-ch-ua result.

Plus this time next year we going to be processing double the amount of data (not less), as this spec suggests to use user-agents for backwards compatibility and client hints for modern browsers. We will have no choice as the user-agent will be freezed to force people in using client hints. Take note IE11 end of life is in 2025 for example.

I understand the idea behind this, is to deal with fingerprinting but it seems to fall short on the security side of things.

Likewise, many people and organizations have doubts with Client hints as a whole, here's an example: Brave’s Concerns with the Client-Hints Proposal.

I would love to hear from professionals like Mike west or Troy hunter on their opinions on this spec. and how it effects firewall security companies.

summercms avatar Feb 07 '20 16:02 summercms

As @mikewest wrote the initial proposal, I'm assuming he's supportive.

yoavweiss avatar Feb 08 '20 08:02 yoavweiss

Thanks @ayumi-cloud. So, it's fair to say that this proposal will likely have a negative impact on security and/or performance in the firewall/spam/bot/security business...

jonarnes avatar Feb 08 '20 09:02 jonarnes

Feedback from other people say there's nothing to stop bad bots and malware browsers such as (Cheetah Security Browser - https://www.liebao.cn/index.html as an example) from faking the sec-ch-ua result.

What are you looking for here? An April Fools evil bit? How could a web spec possibly prevent the browser from faking client hints the same way they currently fake existing user agent strings?

mcatanzaro avatar Feb 08 '20 15:02 mcatanzaro

@mcatanzaro

What are you looking for here? An April Fools evil bit? How could a web spec possibly prevent the browser from faking client hints the same way they currently fake existing user agent strings?

Neither user-agents or client hints ua are a valid solution to stop people from faking them. At some point in time, it would be good to address this issue and create a solution.

As I wrote in another post it would be nice if this spec could also add an update version for reverse dns lookups, so for example:

Sec-CH-UA: Googlebot; v=2

And reverse dns data:

Sec-CH-UA-DNS: Google LLC.

Above indicates a real GoogleBot.

Sec-CH-UA-DNS: Amazon.

Above indicates a fake GoogleBot.

  • To add another level of verification to exclude fake entities.

summercms avatar Feb 08 '20 20:02 summercms

By what technical mechanism would the fake GoogleBot be prevented from sending the HTTP header Sec-CH-UA-DNS: Google LLC.?

mcatanzaro avatar Feb 09 '20 17:02 mcatanzaro

By what technical mechanism would the fake GoogleBot be prevented from sending the HTTP header Sec-CH-UA-DNS: Google LLC.?

I leave you to research reverse dns lookups, your question is well documented with the explanation of them.

summercms avatar Feb 09 '20 19:02 summercms

@ayumi-cloud I think what @mcatanzaro is saying is that these are just headers sent by UAs that can send whatever they want. You'd have to do an actual reverse-DNS look up to verify anyways, so what would be the point of the extra header?

amtunlimited avatar Feb 10 '20 00:02 amtunlimited

@ayumi-cloud: ...I understand the idea behind this, is to deal with fingerprinting but it seems to fall short on the security side of things.

I think the issue here is that much of the security side is identical to fingerprinting. I could be wrong, I'm not an expert. Can you provide some differentiators between fingerprinting and the security usefulness that the user-agent header provides, and we can work from there?

amtunlimited avatar Feb 10 '20 00:02 amtunlimited

I believe this was addressed by the following use case as one we want to enable. Closing, but please let me know if more discussion is required and I'll re-open.

yoavweiss avatar Apr 15 '20 14:04 yoavweiss

Re-opening due to a request

yoavweiss avatar Apr 15 '20 18:04 yoavweiss

@yoavweiss thank you for reopening.

Anti-fraud solutions rely to a great extent on features being removed by this proposal. The advertising and publishing sectors in turn, particularly smaller publishers and advertisers, rely on these solutions to provide the scale and accuracy they require. @jonarnes, @ayumi-cloud and myself have highlighted just some of the issues. The TAG review process would benefit from engaging with the CTOs of companies such as Impact, Neustar, TrustMetrics, Confiant, White Ops, Oracle, IAS (Integral Ad Science), Pixalate and others.

The client hints specification needs to contain considerably more detail in relation to the alternative solution before this issue can be considered closed unless it is the intention of the W3C to increase web fraud. The impact of increased fraud will be particularly harsh for smaller publishers and advertisers not operating within the walled gardens of a handful of US oligopolies.

All must recognise robust engineering mandates phrases such as "something like" or "hope" be replaced with clarify before an issue can be considered closed.

There are a group of stakeholders starting to form in the the W3C Web Advertising group. Engagement with this group concerning the business impact, justiciations and alternatives associated with changes such as this one that seek to remove long established features from the web will provide the TAG invaluable insights.

jwrosewell avatar Apr 16 '20 14:04 jwrosewell

Can you please outline why opting in to Client Hints to receive the exact same signals that the User-Agent string currently provide is not sufficient to counter web fraud?

yoavweiss avatar Apr 16 '20 14:04 yoavweiss

I’ll provide a detailed example concerning the implementation of anti-fraud in the advertising funded web using a relatively small publisher picked at random.

Go to the following URL with developer tools -> network tab enabled. Make sure you’re not using an ad blocker.

https://www.givemesport.com/1563044-quiz-24-questions-that-only-proper-champions-league-fans-will-know

As you scroll down the page hundreds of additional HTTP requests are made to a myriad of third parties to support the provision of advertising. Some are provided by the publisher; others are provided by the advertiser who won the real time auction to display the advertising. Have a look at the nature of the various requests to see the sheer number of domains involved.

When I went to the page an advert was displayed for Procter and Gamble (P&G). As part of the advert payload P&G included a tracking pixel from a company called Moat, now part of Oracle. Here’s the URL for the tracking pixel I was sent.

https://px.moatads.com/pixel.gif?e=17&i=SMG_PROCTERGAMBLE_UKVIDEO1&hp=1&kq=1&hq=0&hs=0&hu=0&hr=0&ht=1&dnt=0&bq=8&f=1&nh=1&j=https%3A%2F%2Fwww.givemesport.com&lp=https%3A%2F%2Fwww.givemesport.com&t=1587055615704&de=185751837751&m=0&ar=b63606d9a9-clean&iw=b83b934&q=9&cb=0&ym=0&cu=1587055615704&ll=4&lm=1&ln=1&r=0&em=0&en=0&d=120936%3A9622%3A4403301%3Aundefined&zMoatGSR=1&ph=&pj=standard&zGSRC=1&gu=https%3A%2F%2Fwww.givemesport.com%2F1563044-quiz-24-questions-that-only-proper-champions-league-fans-will-know&id=1&bo=givemesport.com&bd=givemesport.com&zMoatOrigSlicer1=undefined&zMoatOrigSlicer2=undefined&gw=smgproctergambleukftvideo936432277912&fd=1&ac=1&it=500&ti=0&ih=1&pe=0%3A-%3A-%3A1725%3A754&fs=177454&na=1075323498&cs=0

There are a lot of complex rules concerning whether GiveMeSport will be paid by P&G for that advert. At a high level P&G will require the tracking pixel to have been displayed AND for Moat to consider the activity to be genuine.

Moat will use HTTP headers like User-Agent, specifically the diversity of devices, browser currency, operating system and explicit crawler identifiers as well as other information such as IP addresses to inform their decision. They will be capturing such information from thousands of similar websites and will have additional information about the click through rates for different devices, browsers, etc. Their secret sauce algorithms will then determine whether this particular advert was or was not a human and therefore whether GiveMeSport will be paid.

(Cookies also play a role but in the interests of brevity I’m skipping over that.)

In many of the examples on the page mentioned there is no second request. In fact the request is merely for a 1x1 pixel. Any solution involving second HTTP requests will not work. Ad operations teams spend a huge amount of time working out what was and was not legitimate with their customers and suppliers. They then have to reconcile and agree payments. For a smaller publisher this is the major source of revenue.

There is also the complexity of modifying existing solutions. The solutions that are used across the industry have a single field for User-Agent. These data models will need to be changed to embrace the client-hints data model. Just like OpenRTB (which I raised on another related issue) this is not a trivial activity.

If this change is implemented, then advertisers will no longer be able to verify their adverts were severed to humans when displayed in this manner by publishers. Advertisers will direct their advertising spend directly to publishers and platforms that can provide that verification. Only extremely large publishers with enough traffic volumes and financial / ad-ops muscle will be able to provide that certainty. As I understand a recent meeting of the IAB Tech lab this scenario was confirmed by brand representatives. Therefore, this change will almost certainly reduce the revenue to small and medium sized publishers unless there is an alternative solution which is made available in parallel to the proposal being implemented. It will also increase revenues to the largest players including Google. This is just one of the reasons why the W3C should pause this and all related proposals and insist on a clear and robust way forward for all stakeholders.

Governance

I’ve now spoken to an increasing number of people from across the industry. The vast majority are unwilling to participate in this debate. They have share holders and share holders want to be reassured everything will be okay. If they were to engage in these debates and suggest there might be a problem investors will get wobbly and that’s not a good thing. Better to stay silent, say it’ll all be okay, and “hope” someone else will sort it out and everything will be alright.

I’m encouraged by the W3C Web Advertising Business group. I know a number of people are working to update the explainers on how ad tech works for the benefit of the W3C and I’d encourage TAG to review those documents asthey evolve. Some are out of date or inaccurate.

jwrosewell avatar Apr 16 '20 17:04 jwrosewell

If this change is implemented, then advertisers will no longer be able to verify their adverts were severed to humans when displayed in this manner by publishers.

So the advertising industry's security model for fraud detection depends on the attacker being nice and sending a truthful user agent header...? Then your business comes crashing down if the bad guys ever figure this out, or notice this GitHub thread? If your secret sauce fraud prevention algorithms actually seriously rely on the UA header, then it is trivial for a malicious attacker to generate fraudulent ad impressions by simply changing the header. Yes? Am I missing something?

If you really rely on the UA header like this, it really seems like the writing is on the wall regardless of this proposal.

mcatanzaro avatar Apr 16 '20 17:04 mcatanzaro

Chief Brand Officer at P&G Mark Pritchard gave the following presentation in 2017.

Marc Pritchard, P&G, on Better Advertising Enabled by Media Transparency at IAB ALM

In summary he was not happy about the state of the digital programmatic advertising model.

Fast forward three years and the model I described in being used by P&G and tens of thousands of others. P&G are the world’s largest advertiser. P&G pay a proportion of many of the salaries of people reading this comment. They pay the bills and in aggregate it works for them.

Technically the fraud detection models assume bad actors will pretend to be something they’re not. Often toolkits such as Phantom.js will form components of the tools used. Other indicators also come into play, such as the third-party cookie. The fate of the third-party cookie is the subject of another proposal and debate.

It is very hard for bad actors to represent the diversity of IP addresses, devices, OS and browsers associated with legitimate traffic. Those bad actors will be reading this thread and rubbing their hands with glee that a component of the tools used to thwart them could be removed from the web.

However these bad actors are short sighted. Ultimately P&G will divert their advertising to the publishing platforms of a handful of US oligopolies that do have the scale to monitor fraud using a wider diversity of tools and data points due their scale and importantly first party user consent.

Already the uncertainty that has been created is extremely disruptive and an advertising funded open web is under threat.

It is for these reasons this proposal, and it’s like, should be paused by W3C TAG to allow proper consultation, justification and robust engineering options created, challenged against one another and the current situation, so that a way forward that considers the needs of all stakeholders can be adopted.

jwrosewell avatar Apr 16 '20 19:04 jwrosewell