atlas
atlas copied to clipboard
State of content language as of June 2024
Curated videos
The problem is pretty simple, we have only 4 videos hand-picked by the CWG (state on 19th June).
Rest of the feed
Our language detection approach wasn't good enough for the home feed. The overall accuracy was pretty good for the initial sample, but the problem was that sometimes it was filtering out the videos from high-value channels.
The thing is, we cannot use language property reliably, there are too many non-English videos marked as ones, so trying to use it as a filter is pointless.
Solutions
I know of two possible solutions:
- We should push interactions tracking ASAP to start tethering the data for ML solution and after gathering a decent amount of them we should release Gleev with ML system. Using interactions geolocation as a way of bypassing poor language property accuracy.
- Find some services that offer video-based language detection, integrate it into Orion, and use it for high-value channels.
- Try text language detection with a more reliable approach. We could use some paid solution to detect language based on the video title, this should be fairly cheap.