Scribe-Data icon indicating copy to clipboard operation
Scribe-Data copied to clipboard

speed up autosuggestions generation by building bigram counts in a si…

Open catreedle opened this issue 2 months ago • 3 comments

Contributor checklist


Description

This PR proposes an optimization to how autosuggestions are generated.

Instead of iterating through all articles multiple times (once for each top word), this version processes the corpus only once to build a word → next-word frequency map (bigrams).

This reduces the generation time for 500 top words across ~700k articles from tens of hours to around 2 hours, while producing nearly identical results.

I compared the top 10 autosuggestions generated by both methods, and the outputs were similar.

Related issue

  • #ISSUE_NUMBER

catreedle avatar Oct 15 '25 09:10 catreedle

Thank you for the pull request! ❤️

The Scribe-Data team will do our best to address your contribution as soon as we can. If you're not already a member of our public Matrix community, please consider joining! We'd suggest that you use the Element client as well as Element X for a mobile app, and definitely join the General and Data rooms once you're in. Also consider attending our bi-weekly Saturday dev syncs. It'd be great to meet you 😊

github-actions[bot] avatar Oct 15 '25 09:10 github-actions[bot]

Maintainer Checklist

The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

  • [ ] The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • [ ] The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

github-actions[bot] avatar Oct 15 '25 09:10 github-actions[bot]

Thanks so much for the PR here, @catreedle! Sorry for your wait in the review. We'll get to this in the coming days! 😊

andrewtavis avatar Oct 23 '25 20:10 andrewtavis