Scribe-Data
Scribe-Data copied to clipboard
Expand Swedish data queries
Terms
- [X] I have searched open and closed feature requests
- [X] I agree to follow Scribe-Data's Code of Conduct
Description
This issue would look into expanding the src/scribe_data/language_data_extraction/Swedish files with as much data as are possible from the current data on Wikidata. We can use code for getting data from other languages, and from there we can check Swedish data on Wikidata for what conjugations are available. We can then expand the query with optional selections of certain forms as is done in other SPARQL queries. The query can be tried on the Wikidata Query Service UI during development :)
Data types to include:
- [x] Nouns
- [x] Verbs
- [ ] Adjectives
- [ ] Adverbs
- [ ] Prepositions
- [x] Emoji keywords
Contribution
This is a Feature i am willing to work on
@GicharuElvis shall I work on this?
@GicharuElvis made this for themselves, @Khushalsarode, but you can look for another language and make a similar issue for yourself. Maybe German or Spanish still needs adjectives?
Thank you @andrewtavis . I'll work on it. I'll reach out incase of anything
@GicharuElvis made this for themselves, @Khushalsarode, but you can look for another language and make a similar issue for yourself. Maybe German or Spanish still needs adjectives?
Sure! No problem, I didn't saw anyone assigned that why I commented! I will take other issues @andrewtavis
No stress, @Khushalsarode! Appreciate your interest! :)
Just added a list of data types that we want to include to this issue :) Have marked those that are already done or have PRs open, and we can work on the others 😊 If the data type can't work, then we can move to the others and open up specific issues later :)
Thank you, i'll have a look at them
Some suggestions on how to improve the Swedish adjectives query that was just merged, @GicharuElvis, which is coming from a post I did in the Data channel on Matrix. This is the process that I do to work on queries from the base query that just returns the Wikidata lexeme ID and lemma:
- Let's take for example the query for Slovak adjectives src/scribe_data/language_data_extraction/Slovak/adjectives/query_adjectives.sparql
- Let's take this query over to query.wikidata.org, but with one edit to return to lexeme URI via
?lexeme- You can see the edited query in the Wikidata Query Service here
- Note that it includes
?lexeme
- Run the query and look at the results
- Click on the first result, which will be random, and for me was wikidata.org/wiki/Lexeme:L238355
- This is the Slovak adjective
slovenský(the adjective conveniently means Slovak) - This is the base adjective, but in Slovak specifically we need to also get forms for the adjectives as they're different based on if the thing is masculine, feminine, etc
- You can see all of these other forms on the Wikidata page when you scroll down
- There are forms based on masculine vs. feminine, singular vs. plural and the case that's used
- These are the forms that we also want to include in our data outputs
- This is the Slovak adjective
- Not all data types for languages have forms on them, but it helps to check if the forms are there beforehand so we don't miss them
- An example of a query with tons of forms is src/scribe_data/language_data_extraction/Estonian/adverbs/query_adverbs_1.sparql
- Slovak is similar to Estonian in these regards, that there are many forms and they're complex (combinations of four properties like feminine, singular, etc)
- We need to construct forms with an optional selection that includes all the properties that are on the form
- For the first form in the adjective above
slovenskýwe need to find the Wikidata QIDs for masculine, nominative case, singular, positive - Then put these within the optional selection to get the form in a way that the returned value is unique
Note that in queries we ideally would not have SELECT DISTINCT, GROUP BY or anything else to combine the results of the query - i.e. it should be one row per lexeme. We can do this by making sure we include all properties on the form so we don't overlap, and we can also filter based on a language or also FILTER NOT EXISTS to say that we don't want a specific property.
Hope this helps! Please let us know if there are any questions :) Would be great if you took the above and tried to expand the Swedish adjectives query 😊
Thank you @andrewtavis Question. At what point do we use the second option over the lexeme
Seems everything is all set here
That it is, @GicharuElvis 😊 Thanks for the work here!
Sure, anytime. Let me look for more!