Open-Assistant
Open-Assistant copied to clipboard
Language detection CLI
Method to run language classification on a json list file via CLI. Output is an already fixed dictionary (use with caution: the model is not 100% accurate, especially when it comes to e.g. mathematical notation), and a list of dictonaries containing
- predicted lang, the predicted language
- confidence, the confidence of the classification as produced by the feature-subsampling process
- expected lang, the language that was set in the original message
- message_id, the message id associated with that message
- text, the text of that message
The CLI is built to allow either training a new model and doing inference on the jsonlist, or loading an existing model from a file. To use the CLI you have to set
- "--model" which is the save/load location for the model
- "--test_data" which is where the json you want to evaluate lies
- "--num_words" which is how many words are supposed to be used for the character-level statistics. Setting this higher will generally give better results, but also will filter out more examples since I skip messages with less than "num_words" words. Set to something between 5 and 10.
For training you further need to set "--data" which is where the training data lives (I'm using https://lukelindemann.com/wiki_corpus.html). If you just want to load an existing model, use "--load" which will skip the training step and instead use the pretrained model in "--model".
The CLI outputs a formatted list of all mismatched languages.
If you want to use this in another script, use infere_names(model,json_path,converter), where "model" is the model loaded (convenience method load(modelname) exists), "json_path" is where you place your data to be checked and "converter" converts the full-length language names, like "English" to their language code "en".
:x: pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md
I think this code might be dead - a file with the same name was added, which does things differently. So now the merge conflict is as large as the changes added :(
Might need to close this?
Yeah, this version now conflicts with a different version I already merged before this one was accepted. Will reopen this with the necessary merges done on my end