Scribe-Data icon indicating copy to clipboard operation
Scribe-Data copied to clipboard

Convert all query processes to use `LIMIT` and `OFFSET`

Open andrewtavis opened this issue 1 year ago • 0 comments

Terms

Description

Related to the work that's happening in #124, we made the decision in the last dev sync that we'll be doing a new method of breaking down queries that are too large to return information because of time out restrictions. The first version of this will be implemented in #124, and then other queries should further be changed to run on the new method where all queries will have a LIMIT and OFFSET set within the query that can then be programmatically changed. The method for this will be:

  • For each query we'll run a basic query to derive how many Wikidata items will be returned
  • This will then be used to derive a chunk size for the LIMIT and OFFSET
    • Say that there are 100K items to return data for, so we could have a LIMIT of 50K and programmatically set an OFFSET of 0 and 50000
    • update_data.py would then loop the versions of the query and append the results to a common output

Note: In the sync I was talking that we'll also switch over all of the _1, _2, etc queries to also work like this. This may not be possible, as if memory serves me part of this was also that Wikidata has a character limit to what you can pass to it (this is why all the queries are written with very short abbreviations). We can test this and see if we can convert these queries into a single common one as well 😊

Contribution

Happy to work on this with people as far as planning the scope of the work and helping with implementation! 🚀

andrewtavis avatar Apr 21 '24 20:04 andrewtavis