[Statistics] Many user-agents are marked as unknown
We need to analyze which user agents are not identified, and update the list.
This commit shows how to add a user-agent to the list.
This commit shows how to add a user-agent to the list.
For anyone else who wanders into this thread in the future, this YAML file seems to have been moved to the NuGetGallery repo.
And for anyone else confused about why they see clients listed on NuGet.org that aren't represented in this YAML, the script that consumes this YAML has a fallback that uses a generic list of well-known user agents.
(I was almost going to suggest that updating that list by updating ua-parser might help resolve this issue, but it turns out that list has not been updated in a very long time other than a handful of unreleased changes.)
Thought I had after leaving that comment: Could all the unknown clients be from the Chinese CDN? I don't fully understand its purpose, but the user agent parser has logic which rewrites the regular expressions in the YAML file to use + in place of spaces:
https://github.com/NuGet/NuGetGallery/blob/4b37d4d6bba949d81768f914bf99ea14e31168db/python/StatsLogParser/loginterpretation/useragentparser.py#L34-L46
However no similar logic exists for the previously mentioned fallback. Presumably this is done because the logs from the Chinese CDN are in a different format. I don't know if the statistics include both CDNs or if the Chinese gallery is isolated from the main one, but it might explain the high number of unknown UAs on some packages.
(I don't have a horse in this race, just thought I'd share what I noticed. I was just looking into where the names on the stats page come from.)