speed up autosuggestions generation by building bigram counts in a si…
Contributor checklist
- [x] This pull request is on a separate branch and not the main branch
- [x] I have tested my code with the
pytestcommand as directed in the testing section of the contributing guide
Description
This PR proposes an optimization to how autosuggestions are generated.
Instead of iterating through all articles multiple times (once for each top word), this version processes the corpus only once to build a word → next-word frequency map (bigrams).
This reduces the generation time for 500 top words across ~700k articles from tens of hours to around 2 hours, while producing nearly identical results.
I compared the top 10 autosuggestions generated by both methods, and the outputs were similar.
Related issue
- #ISSUE_NUMBER
Thank you for the pull request! ❤️
The Scribe-Data team will do our best to address your contribution as soon as we can. If you're not already a member of our public Matrix community, please consider joining! We'd suggest that you use the Element client as well as Element X for a mobile app, and definitely join the General and Data rooms once you're in. Also consider attending our bi-weekly Saturday dev syncs. It'd be great to meet you 😊
Maintainer Checklist
The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)
Thanks so much for the PR here, @catreedle! Sorry for your wait in the review. We'll get to this in the coming days! 😊