Consider providing a way to verify the code in this repository against production
Is your feature request related to a problem? Please describe. Hey! I really like the concept of Community Notes and I appreciate this is open-source. However, there doesn't currently appear to be a way to verify that the code in this repository is the code that is actually running on the web experience. This classifies as a feature request, but could also be considered a security issue as well.
Describe the solution you'd like A way for any user to technically verify that the code in releases made to mobile and web application for Community Notes is the same code running in this repository.
Thanks for any information!
Aside from displaying the git commit timestamps and commit hashes on the website/app, I don't know of a good solution.
The only other feasible option would be a 3rd party, independent DevOps verification team who could verify these deployments and provide the status reports publicly.
I think it's on us to trust that the algo is what's being used in production.
Love this. Very open to suggestions. Currently, we do release a note_status_history.tsv file which contains note status histories. It is possible (although very high-effort) to run the public code on the data at each time step: since we only run the algorithm once per hour, on the hour in production, you can pass every individual hour as a timestamp to main.py to filter out all ratings newer than that timestamp. Then you'd have manually verified that the public code, run on the public data, can re-create note_status_history.tsv (which you can verify matched production via screenshots or a web-archive-type approach). Be aware that caching and eventual consistency delays can cause it to take up to an hour in some cases for new note scores to propagate everywhere. So basically it is currently possible to do this verification, although is way too much effort, and I'd love to think of something much simpler to allow external verification more akin to a checksum.
I understand a lot changed since Aug 14, 2023. It's a completely different scale now. Do you still use the exact same public code in production?
We run the exact same code internally. In order to speed up serving time in production, we split it into separate phases, e.g. so that we don't have to re-run all the preprocessing to simply re-score a note after a new rating arrived. See this function for how it's split https://github.com/twitter/communitynotes/blob/main/sourcecode/scoring/run_scoring.py#L1833 One other difference vs. 1.5yrs ago is that we now run final_note_scoring much more often than every hour (every few mins).
But we also are repeatedly re-running the full sequential version (exactly same as public code) to verify it works and outputs match.