Scribe-Data icon indicating copy to clipboard operation
Scribe-Data copied to clipboard

Expand Swedish data queries

Open GicharuElvis opened this issue 1 year ago • 7 comments
trafficstars

Terms

Description

This issue would look into expanding the src/scribe_data/language_data_extraction/Swedish files with as much data as are possible from the current data on Wikidata. We can use code for getting data from other languages, and from there we can check Swedish data on Wikidata for what conjugations are available. We can then expand the query with optional selections of certain forms as is done in other SPARQL queries. The query can be tried on the Wikidata Query Service UI during development :)

Data types to include:

  • [x] Nouns
  • [x] Verbs
  • [ ] Adjectives
  • [ ] Adverbs
  • [ ] Prepositions
  • [x] Emoji keywords

Contribution

This is a Feature i am willing to work on

GicharuElvis avatar Oct 08 '24 09:10 GicharuElvis

@GicharuElvis shall I work on this?

Khushalsarode avatar Oct 08 '24 11:10 Khushalsarode

@GicharuElvis made this for themselves, @Khushalsarode, but you can look for another language and make a similar issue for yourself. Maybe German or Spanish still needs adjectives?

andrewtavis avatar Oct 08 '24 14:10 andrewtavis

Thank you @andrewtavis . I'll work on it. I'll reach out incase of anything

GicharuElvis avatar Oct 08 '24 14:10 GicharuElvis

@GicharuElvis made this for themselves, @Khushalsarode, but you can look for another language and make a similar issue for yourself. Maybe German or Spanish still needs adjectives?

Sure! No problem, I didn't saw anyone assigned that why I commented! I will take other issues @andrewtavis

Khushalsarode avatar Oct 08 '24 15:10 Khushalsarode

No stress, @Khushalsarode! Appreciate your interest! :)

andrewtavis avatar Oct 08 '24 21:10 andrewtavis

Just added a list of data types that we want to include to this issue :) Have marked those that are already done or have PRs open, and we can work on the others 😊 If the data type can't work, then we can move to the others and open up specific issues later :)

andrewtavis avatar Oct 09 '24 08:10 andrewtavis

Thank you, i'll have a look at them

GicharuElvis avatar Oct 09 '24 09:10 GicharuElvis

Some suggestions on how to improve the Swedish adjectives query that was just merged, @GicharuElvis, which is coming from a post I did in the Data channel on Matrix. This is the process that I do to work on queries from the base query that just returns the Wikidata lexeme ID and lemma:

  • Let's take for example the query for Slovak adjectives src/scribe_data/language_data_extraction/Slovak/adjectives/query_adjectives.sparql
  • Let's take this query over to query.wikidata.org, but with one edit to return to lexeme URI via ?lexeme
    • You can see the edited query in the Wikidata Query Service here
    • Note that it includes ?lexeme
  • Run the query and look at the results
  • Click on the first result, which will be random, and for me was wikidata.org/wiki/Lexeme:L238355
    • This is the Slovak adjective slovenský (the adjective conveniently means Slovak)
    • This is the base adjective, but in Slovak specifically we need to also get forms for the adjectives as they're different based on if the thing is masculine, feminine, etc
    • You can see all of these other forms on the Wikidata page when you scroll down
    • There are forms based on masculine vs. feminine, singular vs. plural and the case that's used
    • These are the forms that we also want to include in our data outputs
  • Not all data types for languages have forms on them, but it helps to check if the forms are there beforehand so we don't miss them
  • An example of a query with tons of forms is src/scribe_data/language_data_extraction/Estonian/adverbs/query_adverbs_1.sparql
    • Slovak is similar to Estonian in these regards, that there are many forms and they're complex (combinations of four properties like feminine, singular, etc)
    • We need to construct forms with an optional selection that includes all the properties that are on the form
    • For the first form in the adjective above slovenský we need to find the Wikidata QIDs for masculine, nominative case, singular, positive
    • Then put these within the optional selection to get the form in a way that the returned value is unique

Note that in queries we ideally would not have SELECT DISTINCT, GROUP BY or anything else to combine the results of the query - i.e. it should be one row per lexeme. We can do this by making sure we include all properties on the form so we don't overlap, and we can also filter based on a language or also FILTER NOT EXISTS to say that we don't want a specific property.

Hope this helps! Please let us know if there are any questions :) Would be great if you took the above and tried to expand the Swedish adjectives query 😊

andrewtavis avatar Oct 17 '24 22:10 andrewtavis

Thank you @andrewtavis Question. At what point do we use the second option over the lexeme

GicharuElvis avatar Oct 18 '24 08:10 GicharuElvis

Seems everything is all set here

GicharuElvis avatar Oct 19 '24 12:10 GicharuElvis

That it is, @GicharuElvis 😊 Thanks for the work here!

andrewtavis avatar Oct 20 '24 17:10 andrewtavis

Sure, anytime. Let me look for more!

GicharuElvis avatar Oct 20 '24 22:10 GicharuElvis