Scribe-Data Update languages metadata file and use of it thoughout project

trafficstars

Terms

[X] I have searched open and closed feature requests
[X] I agree to follow Scribe-Data's Code of Conduct

Description

As of now the Scribe-Data CLI options are determined based on the language_metadata.json file. To make maintenance of the package easier, it would be great if the options of the CLI were instead determined by the directory structure of src/scribe_data/language_data_extraction so that the code doesn't need to be updated each time new queries are being added in.

Of key importance is also that the options of the CLI would allow for dialects as well, so for Norwegian we'd like to see Norwegian - Bokmål and Norwegian - Nynorsk, for example. How this will be achieved is open for discussion!

Contribution

Happy to discuss how best to read in dialect sub directories and review the changes here when the PR is up!

Oct 09 '24 10:10 andrewtavis

I'm interested in this issue. 😃

Oct 09 '24 19:10 OmarAI2003

This would be a really good one for you, @OmarAI2003 😊 Let us know if you have any questions!

Oct 09 '24 19:10 andrewtavis

Replacing the dependency on language_metadata.json for getting the language names by using the language_data_extraction folder structure seems applicable. However, I’m not sure how to handle other properties in the JSON file like iso, qid, remove-words, etc.Would it make sense to include these properties somehow, or should we consider another approach? I'm not sure if this is right but I would love to get more input!

Oct 10 '24 11:10 OmarAI2003

In talking about this a bit, @OmarAI2003, we might not be able to do this. @SethiShreya and I were talking and as you said we need the QIDs as well so that we can do calls for the CLI based on QIDs as well. Without a central store of languages and their QIDs, maybe it can't work?

Oct 10 '24 12:10 andrewtavis

Maybe we could use the directory structure just for language names, but still keep language_metadata.json for properties like QIDs? Not sure if this would help, but happy to hear your thoughts!

Oct 10 '24 14:10 OmarAI2003

Is an interesting idea, but then say that we rely on the structure and then we don't get a QID added and then some functionality is broken 🤔

Oct 10 '24 17:10 andrewtavis

So is this issue will be closed , or is there anything that needs to be addressed?

Oct 11 '24 16:10 OmarAI2003

I'm thinking that for this one we can convert the functionality of the languages metadata file? I don't think we need the header key for it or the "languages" key where all the leagues are? You can remove the header and put all the language objects at the top level. You can also remove all of the keys that aren't the language name, iso and qid? Then from there we need to rework the reference of this metadata file throughout the project and fix the tests 😇

How does this sound, @OmarAI2003? :)

Oct 11 '24 16:10 andrewtavis

Sounds nice @andrewtavis, but I will need to engage in several discussions here and there along the way to make sure I'm on the same page.

Oct 11 '24 17:10 OmarAI2003

Sure thing, @OmarAI2003! Just start with getting the file down to just objects with languages, ISO-2s and QIDs at the base level, and then we can discuss from there. Happy to help as needed!

Oct 11 '24 23:10 andrewtavis

I'm thinking that for this one we can convert the functionality of the languages metadata file? I don't think we need the header key for it or the "languages" key where all the leagues are? You can remove the header and put all the language objects at the top level. You can also remove all of the keys that aren't the language name, iso and qid? Then from there we need to rework the reference of this metadata file throughout the project and fix the tests 😇

How does this sound, @OmarAI2003? :)

hi @andrewtavis with the languages header removed, what will the full language_metadata file look like?

Oct 15 '24 11:10 catreedle

@OmarAI2003, can you send along a snippet of the current version of the file so we can all take a look? :)

Oct 15 '24 12:10 andrewtavis

This is the current version of the JSON file. I'm telling you not to worry about the sub-languages file path because there will be a format_sublanguage_name function in utils.py that will provide the name of the language to get the name of it relative to its directory. For example, a Norwegian sub-language like 'Bokmål', when called within the function format_sublanguage_name(Bokmål, language_metadata), will return the language directory capitalized like Norwegian/Bokmål, and normal languages will be returned as it is but capitalized. There will also be a list_all_languages function for listing all queryable languages and sub-languages. language_metadata.json

Oct 15 '24 15:10 OmarAI2003

language_metadata.json

Thank you. Sounds great 😃

Oct 16 '24 07:10 catreedle

Closed by #402 :) Thanks for the great work @OmarAI2003 and for the great conversation all!

Oct 18 '24 01:10 andrewtavis

Closed by #402 :) Thanks for the great work @OmarAI2003 and for the great conversation all!

You're welcome! It was a great experience working on this, and I appreciate all the valuable feedback and discussions.

Oct 18 '24 05:10 OmarAI2003

Scribe-Data Scribe-Data copied to clipboard

Update languages metadata file and use of it thoughout project

Terms

Description

Contribution

Scribe-Data
Scribe-Data copied to clipboard