Scribe-Data
Scribe-Data copied to clipboard
Query generation is including the same query form in more than one query
Terms
- [x] I have searched all open bug reports
- [x] I agree to follow Scribe-Data's Code of Conduct
Behavior
An issue that was discovered in https://github.com/scribe-org/Scribe-Data/pull/601 is that we have at times the same query form being generated in more than one query. An example of this is for English verbs:
Each of the above query files is returning the englishPastParticiple form, which should not happen.
One potential fix for this is that when we find that there are new forms that haven't been included, it would be best if we regenerate all queries to make sure that no form is repeated. That way the list of available forms is directly translated to queries one to one π
CC @axif0 and @catreedle for the issue that we're making in the call :)
Hey, @andrewtavis @axif0 , can I be assigned this issue? I'd like to work on this
Thanks so much for taking this on, @harikrishnatp! Please let us know if there's anything we can do to help π
Hey, @andrewtavis @axif0
I wanted to give you an update on the work I've been doing to address this issue and to clarify a few things before moving forward with regenerating the queries.
Progress made so far:
-
Duplicate detection: I wrote a script to scan all .sparql files in each language folder and identify forms which are being queried in multiple files.
-
Master list generation: from your suggestion, I wrote a script that generated a
forms_master_list.txtfor each language. This will list all unique grammatical forms found in the files along with their location.
Next steps: Regenerate all sparql files from the master list ensuring that each form is only extracted in one file per language.
Doubts and clarification:
-
Is it correct to exclude variables like
lexeme,lexemeID,lastModifiedand lemma variables like noun, verb, adjective etc (when used as the base word) from the master list of forms ? Or are there cases where you want these included as "forms" ? -
When regenerating queries, is there a preferred way to assign forms to files ? Like, is there a mapping you want me to follow ? or just keep the one form per file approach ?
-
Should variables like
lexeme,lexemeID,lastModifiedalways be included in every file regardless of the forms being extracted ?
Hey @harikrishnatp π Thanks for your efforts here so far! I'm wondering on whether it would make sense to check the query generation logic to see what's causing the duplicates. We have a workflow already to generate the queries, and if it gets fixed then the logic of how many queries are in each query and which queries go in which file would solved by the original. My assumption is that somewhere in the process we're not using a unique list of identifiers, or maybe an identifier is being added in again π€
The query check process can be found in src/scribe_data/check/check_missing_forms. Maybe you could run this process and then find where the repeats are coming from?
Thanks for the feedback! I'll check it out and find what's causing the issue
Looking forward to your findings, @harikrishnatp! Will be really great to get this finalized π
Hey @harikrishnatp π Quick check in here to see if there's anything we can do to support here. @axif0 and I both have more capacity right now, so do let us know if we can help π
Hey @andrewtavis , @axif0 . I was almost done with this but while testing, I found a few bugs which took a while to fix. Also, my end semester exams are currently going on.. Apologies for the delay, and thank you for your patience!ππΌπ
No stress on this, @harikrishnatp! We heard that you'll be at the hackathon coming up, so we figured you can work on this then :) Good luck with the end of the semester!
Thanks a lot! I'll definitely continue the work on this at the hackathonβΊοΈ
From a conversation with @harikrishnatp and @axif0 now, the process that we want to create is:
- A default SPARQL query with placeholders for language QID and data type QID will be created (similar to the profanity query)
- This query will be loaded in to the query generation process with
LANGUAGE_QIDandDATA_TYPE_QIDbeing replaced based on the languages and data types that we're generating queries for - The query will be ran with SPARQLWrapper and the results will be returned as a full dictionary structured like the below
- See query: https://w.wiki/F8vD
{
"LANGUAGE_QID_0": { # English, German, etc
"DATA_TYPE_0": [ # nouns, verbs, etc
["Q3482678"], ["Q3482678", "Q30619513"], ... # all query combinations returned from the query as a list of lists
],
...
},
...
}
- Sort the array of arrays based on the query form metadata file
- Based on lexeme_form_metadata.json
- There is a function to sort this, so let's use that and we'll sort the array of arrays within the dictionary
- We should also remove forms and properties that are not included within the metadata, but also list them so that we can update the metadata from time to time
- Things like noun genders need to be written, but there's a place in the code where them being included in the generated queries has been done
- To generate queries, we would loop through the above dictionary and use the same logic that we have now to create queries, but the query form property combinations would be determined by the array of arrays for each language and data type pair
- Whether to include a threshold such that we do not include form combinations of properties if said combination is only present on a few lexemes
- We would assume that a combination that is only on two of 2,000 adjectives is rather dirty data
- We'll need to test this a bit, but general agreement is that we should filter out so that we don't have fields being generated in the queries that are for data that was applied incorrectly
- For right now, we don't have to do this in the first PR, and we can make another issue after
- Would be good to think about it as we work on this
- A filter process for this could be included and just set to 100% for the first PR
- If form property combination is in <1% of the lexeme items, then maybe don't include it
@andrewtavis, @axif0, quick update on this
I have finished the implementation of query service using the query template given by @axif0. The implementation is working well but I found some performance issues while I was testing and would like your input on.
- Arabic adjectives : Found 274 combinations vs 24 found in backup
- same with basque and bengali, most of the categories working right.
- Arabic nouns/verbs: Timeout after multiple retry attempts. Same with Czech adjectives.
- Complex languages hit computational limits with the current query.
@andrewtavis, your suggestions about cutoffs and removing labels should work for this. Due to the current query processing all combinations at once, its failing for complex languages.
Should I implement the frequency cutoffs and label removal you mentioned ?
Sorry for the delay in getting back, @harikrishnatp :)
What kind of timeouts are we dealing with? Are you talking the version of this query that you're using is timing out, or the queries that it generates are timing out? If it's the former, then maybe there's some kind of a fallback we can do π€
And yes, I think proceeding with the frequency cutoff makes sense :)
Yes, its the query that you linked getting timed out, not the generated queries. It works for the simpler cases but hits the limits for other more complex languages
I've implemented frequency cutoffs by which we're getting more practical results. (arabic adjectives from 274 combinations to 57)
Removed labels and only use QIDs like you mentioned in matrix
I tweaked the frequency to filter by data type like, nouns/verbs have min_frequency=50+, adjectives=30, others have 5 added LIMIT clauses to prevent huge results from timing out
after all those, many languages started working right, including arabic nouns which previously timed out. But, arabic verbs still isn't working even by trying with extreme cutoffs.
I just found out why arabic verbs are causing problems:
arabic adjectives: 417,732 forms arabic verbs: 4,655,094 forms
The GROUP BY ?form on all those forms is whats causing the timeout
Maybe using dump as fallback for an edge case like this would be a good idea ?
Let's maybe just skip it for now. I feel like the above is impossible unless the data is very very corrupted. Do you want to set it up to skip queries for outliers that your conditions find, and then we can open a PR and begin the review process, @harikrishnatp? :) Happy to figure out a time for a call with @axif0 to bring this all in and run the query generation workflow! π
WHERE {
?lexeme dct:language wd:Q13955;
wikibase:lexicalCategory wd:Q24905; # Changed to Q24905 (verbs)
ontolex:lexicalForm ?form.
?form wikibase:grammaticalFeature ?feature.
}
This is the query I ran for arabic verbs and got the result
Working through this with @harikrishnatp right now:
- The up to date query that just returns QIDs that we can then apply our own labels to is https://w.wiki/FesH
- If the above query does not work, then we are making the assumption that the underlying data is not of a quality that we're able to generate queries from
- Example: @harikrishnatp and I just did a test of Arabic verbs and we're getting back that a sample of 1000 lexemes will have a maximum of 35 instances with a common combination
- The query that we should generate in this case is https://w.wiki/FesZ
- Arabic verbs in this case would have one query returned which is
query_verbs.sparqland the contents would be the above query with the language being Arabic and the data type being verbs
- Arabic verbs in this case would have one query returned which is
- It might also be nice if we get a message to the terminal from the
tryexception that says that queries will not be being generated for the language and data type combination and then we can look into this further :)
The thinking here is that this is the best that we can do in a situation where some of the data is so dirty that it doesn't make sense for us to generate queries for it (i.e. Arabic verbs).
@andrewtavis , So while I was testing out by generating all the sparql files using the query service, I noticed arabic verbs and other data types with dirty data was being queried properly. Then I checked from the service with this query and found this
Did they clean up the data or what ?
That was without LIMIT though, just wanted to let you know :)
Looks like they likely did clean up the data! Nice! So we should be able to include them then :)
In the future maybe we can also use this process to alert the Wikidata community of dirty data π€
Making great progress here :) I just did an update of the dependencies as the workflow was failing because they were a bit out of date. We now use the production requirements within the dev requirements as well for easier maintainability. From here we need to update the workflow given the new functionality as it's failing because of the changes in arguments.
@harikrishnatp, do you want to send along a PR for updating the check_and_update_missing_query_forms.yaml workflow? From there we can merge that in and run the workflow to check the queries it generates π
From there we should move to #637 to further improve maintainability, and then with that we can finalize a few other issues and switch the focus to writing tests and Wiktionary based translation. For the tests we can write multiple individual issues and switch #623 over to an epic with sub-issues.