Scribe-Data
Scribe-Data copied to clipboard
Investigate and implement `LIMIT` and `OFFSET` within queries
Terms
- [X] I have searched open and closed feature requests
- [X] I agree to follow Scribe-Data's Code of Conduct
Description
This issue is a new version of the deleted #130 that came from #124, and also is related to #68. Scribe will at one point likely need to have LIMIT and OFFSET within the queries such that they can finish. As of now a solution was found for the issue in #124, but there could come a time when the queries would not finish. Figuring this out would allow us to have confidence that the query process for Scribe-Data is robust, regardless of the size of the Wikidata Query Service response.
Contribution
Would be very happy to investigate this going forward and help implement. The general idea was that we would query the total for a language and word type pair and then break the query down with LIMIT and OFFSET being iterated over based on the total number of results. Keeping the total returned to ~50,000 should be fine, but we can also test this with different queries.
Note that this issue is not of high priority, but could be something that we look at later :)
CC @wkyoshida 😊
I am interested in this!
Hey @henrikth93 👋 Let's maybe hold off on this one until GSoC's all done, as there's no real need for it now :) We can discuss in the sync or a call between the two of us what might be the best next thing to work on!
Hi, if this issue is still not resolved, I would like to work on this.
Hey @ItsAbhinavM 👋 Thanks so much for your interest in taking this on! It's an interesting question of what we want to do with this, and I'd like to also get @axif0's opinion on this. As of now we're splitting all the queries so that they're at most six forms in a query, which is reducing the issues, but there could still be issues involved in all of this later as data expands. @axif0, do we want to set it up such that we cycle through queries by the thousand or so and append the results? We could even potentially parallelize this so it's a bit quicker (if we're running on the dumps, as we'd likely get rate limited by the Wikidata Query Service)?
Happy to assign for now, @ItsAbhinavM, and then you're free to get started once we have the above decisions finalized 😊