Scribe-Data
Scribe-Data copied to clipboard
Update languages metadata file and use of it thoughout project
Terms
- [X] I have searched open and closed feature requests
- [X] I agree to follow Scribe-Data's Code of Conduct
Description
As of now the Scribe-Data CLI options are determined based on the language_metadata.json file. To make maintenance of the package easier, it would be great if the options of the CLI were instead determined by the directory structure of src/scribe_data/language_data_extraction so that the code doesn't need to be updated each time new queries are being added in.
Of key importance is also that the options of the CLI would allow for dialects as well, so for Norwegian we'd like to see Norwegian - Bokmål and Norwegian - Nynorsk, for example. How this will be achieved is open for discussion!
Contribution
Happy to discuss how best to read in dialect sub directories and review the changes here when the PR is up!
I'm interested in this issue. 😃
This would be a really good one for you, @OmarAI2003 😊 Let us know if you have any questions!
Replacing the dependency on language_metadata.json for getting the language names by using the language_data_extraction folder structure seems applicable. However, I’m not sure how to handle other properties in the JSON file like iso, qid, remove-words, etc.Would it make sense to include these properties somehow, or should we consider another approach? I'm not sure if this is right but I would love to get more input!
In talking about this a bit, @OmarAI2003, we might not be able to do this. @SethiShreya and I were talking and as you said we need the QIDs as well so that we can do calls for the CLI based on QIDs as well. Without a central store of languages and their QIDs, maybe it can't work?
Maybe we could use the directory structure just for language names, but still keep language_metadata.json for properties like QIDs? Not sure if this would help, but happy to hear your thoughts!
Is an interesting idea, but then say that we rely on the structure and then we don't get a QID added and then some functionality is broken 🤔
So is this issue will be closed , or is there anything that needs to be addressed?
I'm thinking that for this one we can convert the functionality of the languages metadata file? I don't think we need the header key for it or the "languages" key where all the leagues are? You can remove the header and put all the language objects at the top level. You can also remove all of the keys that aren't the language name, iso and qid? Then from there we need to rework the reference of this metadata file throughout the project and fix the tests 😇
How does this sound, @OmarAI2003? :)
Sounds nice @andrewtavis, but I will need to engage in several discussions here and there along the way to make sure I'm on the same page.
Sure thing, @OmarAI2003! Just start with getting the file down to just objects with languages, ISO-2s and QIDs at the base level, and then we can discuss from there. Happy to help as needed!
I'm thinking that for this one we can convert the functionality of the languages metadata file? I don't think we need the header key for it or the "languages" key where all the leagues are? You can remove the header and put all the language objects at the top level. You can also remove all of the keys that aren't the language name, iso and qid? Then from there we need to rework the reference of this metadata file throughout the project and fix the tests 😇
How does this sound, @OmarAI2003? :)
hi @andrewtavis with the languages header removed, what will the full language_metadata file look like?
@OmarAI2003, can you send along a snippet of the current version of the file so we can all take a look? :)
This is the current version of the JSON file. I'm telling you not to worry about the sub-languages file path because there will be a format_sublanguage_name function in utils.py that will provide the name of the language to get the name of it relative to its directory. For example, a Norwegian sub-language like 'Bokmål', when called within the function format_sublanguage_name(Bokmål, language_metadata), will return the language directory capitalized like Norwegian/Bokmål, and normal languages will be returned as it is but capitalized. There will also be a list_all_languages function for listing all queryable languages and sub-languages.
language_metadata.json
language_metadata.json
Thank you. Sounds great 😃
Closed by #402 :) Thanks for the great work @OmarAI2003 and for the great conversation all!
Closed by #402 :) Thanks for the great work @OmarAI2003 and for the great conversation all!
You're welcome! It was a great experience working on this, and I appreciate all the valuable feedback and discussions.