Scribe-Data icon indicating copy to clipboard operation
Scribe-Data copied to clipboard

Update languages metadata file and use of it thoughout project

Open andrewtavis opened this issue 1 year ago • 10 comments
trafficstars

Terms

Description

As of now the Scribe-Data CLI options are determined based on the language_metadata.json file. To make maintenance of the package easier, it would be great if the options of the CLI were instead determined by the directory structure of src/scribe_data/language_data_extraction so that the code doesn't need to be updated each time new queries are being added in.

Of key importance is also that the options of the CLI would allow for dialects as well, so for Norwegian we'd like to see Norwegian - Bokmål and Norwegian - Nynorsk, for example. How this will be achieved is open for discussion!

Contribution

Happy to discuss how best to read in dialect sub directories and review the changes here when the PR is up!

andrewtavis avatar Oct 09 '24 10:10 andrewtavis

I'm interested in this issue. 😃

OmarAI2003 avatar Oct 09 '24 19:10 OmarAI2003

This would be a really good one for you, @OmarAI2003 😊 Let us know if you have any questions!

andrewtavis avatar Oct 09 '24 19:10 andrewtavis

Replacing the dependency on language_metadata.json for getting the language names by using the language_data_extraction folder structure seems applicable. However, I’m not sure how to handle other properties in the JSON file like iso, qid, remove-words, etc.Would it make sense to include these properties somehow, or should we consider another approach? I'm not sure if this is right but I would love to get more input!

OmarAI2003 avatar Oct 10 '24 11:10 OmarAI2003

In talking about this a bit, @OmarAI2003, we might not be able to do this. @SethiShreya and I were talking and as you said we need the QIDs as well so that we can do calls for the CLI based on QIDs as well. Without a central store of languages and their QIDs, maybe it can't work?

andrewtavis avatar Oct 10 '24 12:10 andrewtavis

Maybe we could use the directory structure just for language names, but still keep language_metadata.json for properties like QIDs? Not sure if this would help, but happy to hear your thoughts!

OmarAI2003 avatar Oct 10 '24 14:10 OmarAI2003

Is an interesting idea, but then say that we rely on the structure and then we don't get a QID added and then some functionality is broken 🤔

andrewtavis avatar Oct 10 '24 17:10 andrewtavis

So is this issue will be closed , or is there anything that needs to be addressed?

OmarAI2003 avatar Oct 11 '24 16:10 OmarAI2003

I'm thinking that for this one we can convert the functionality of the languages metadata file? I don't think we need the header key for it or the "languages" key where all the leagues are? You can remove the header and put all the language objects at the top level. You can also remove all of the keys that aren't the language name, iso and qid? Then from there we need to rework the reference of this metadata file throughout the project and fix the tests 😇

How does this sound, @OmarAI2003? :)

andrewtavis avatar Oct 11 '24 16:10 andrewtavis

Sounds nice @andrewtavis, but I will need to engage in several discussions here and there along the way to make sure I'm on the same page.

OmarAI2003 avatar Oct 11 '24 17:10 OmarAI2003

Sure thing, @OmarAI2003! Just start with getting the file down to just objects with languages, ISO-2s and QIDs at the base level, and then we can discuss from there. Happy to help as needed!

andrewtavis avatar Oct 11 '24 23:10 andrewtavis

I'm thinking that for this one we can convert the functionality of the languages metadata file? I don't think we need the header key for it or the "languages" key where all the leagues are? You can remove the header and put all the language objects at the top level. You can also remove all of the keys that aren't the language name, iso and qid? Then from there we need to rework the reference of this metadata file throughout the project and fix the tests 😇

How does this sound, @OmarAI2003? :)

hi @andrewtavis with the languages header removed, what will the full language_metadata file look like?

catreedle avatar Oct 15 '24 11:10 catreedle

@OmarAI2003, can you send along a snippet of the current version of the file so we can all take a look? :)

andrewtavis avatar Oct 15 '24 12:10 andrewtavis

This is the current version of the JSON file. I'm telling you not to worry about the sub-languages file path because there will be a format_sublanguage_name function in utils.py that will provide the name of the language to get the name of it relative to its directory. For example, a Norwegian sub-language like 'Bokmål', when called within the function format_sublanguage_name(Bokmål, language_metadata), will return the language directory capitalized like Norwegian/Bokmål, and normal languages will be returned as it is but capitalized. There will also be a list_all_languages function for listing all queryable languages and sub-languages. language_metadata.json

OmarAI2003 avatar Oct 15 '24 15:10 OmarAI2003

language_metadata.json

Thank you. Sounds great 😃

catreedle avatar Oct 16 '24 07:10 catreedle

Closed by #402 :) Thanks for the great work @OmarAI2003 and for the great conversation all!

andrewtavis avatar Oct 18 '24 01:10 andrewtavis

Closed by #402 :) Thanks for the great work @OmarAI2003 and for the great conversation all!

You're welcome! It was a great experience working on this, and I appreciate all the valuable feedback and discussions.

OmarAI2003 avatar Oct 18 '24 05:10 OmarAI2003