Scribe-Data icon indicating copy to clipboard operation
Scribe-Data copied to clipboard

Investigate and implement `LIMIT` and `OFFSET` within queries

Open andrewtavis opened this issue 1 year ago • 3 comments
trafficstars

Terms

Description

This issue is a new version of the deleted #130 that came from #124, and also is related to #68. Scribe will at one point likely need to have LIMIT and OFFSET within the queries such that they can finish. As of now a solution was found for the issue in #124, but there could come a time when the queries would not finish. Figuring this out would allow us to have confidence that the query process for Scribe-Data is robust, regardless of the size of the Wikidata Query Service response.

Contribution

Would be very happy to investigate this going forward and help implement. The general idea was that we would query the total for a language and word type pair and then break the query down with LIMIT and OFFSET being iterated over based on the total number of results. Keeping the total returned to ~50,000 should be fine, but we can also test this with different queries.

Note that this issue is not of high priority, but could be something that we look at later :)

andrewtavis avatar Jun 15 '24 19:06 andrewtavis

CC @wkyoshida 😊

andrewtavis avatar Jun 15 '24 19:06 andrewtavis

I am interested in this!

henrikth93 avatar Jul 23 '24 18:07 henrikth93

Hey @henrikth93 👋 Let's maybe hold off on this one until GSoC's all done, as there's no real need for it now :) We can discuss in the sync or a call between the two of us what might be the best next thing to work on!

andrewtavis avatar Jul 24 '24 20:07 andrewtavis

Hi, if this issue is still not resolved, I would like to work on this.

ItsAbhinavM avatar May 30 '25 13:05 ItsAbhinavM

Hey @ItsAbhinavM 👋 Thanks so much for your interest in taking this on! It's an interesting question of what we want to do with this, and I'd like to also get @axif0's opinion on this. As of now we're splitting all the queries so that they're at most six forms in a query, which is reducing the issues, but there could still be issues involved in all of this later as data expands. @axif0, do we want to set it up such that we cycle through queries by the thousand or so and append the results? We could even potentially parallelize this so it's a bit quicker (if we're running on the dumps, as we'd likely get rate limited by the Wikidata Query Service)?

Happy to assign for now, @ItsAbhinavM, and then you're free to get started once we have the above decisions finalized 😊

andrewtavis avatar May 31 '25 12:05 andrewtavis