Open-Assistant Language detection CLI

Method to run language classification on a json list file via CLI. Output is an already fixed dictionary (use with caution: the model is not 100% accurate, especially when it comes to e.g. mathematical notation), and a list of dictonaries containing

predicted lang, the predicted language
confidence, the confidence of the classification as produced by the feature-subsampling process
expected lang, the language that was set in the original message
message_id, the message id associated with that message
text, the text of that message

The CLI is built to allow either training a new model and doing inference on the jsonlist, or loading an existing model from a file. To use the CLI you have to set

"--model" which is the save/load location for the model
"--test_data" which is where the json you want to evaluate lies
"--num_words" which is how many words are supposed to be used for the character-level statistics. Setting this higher will generally give better results, but also will filter out more examples since I skip messages with less than "num_words" words. Set to something between 5 and 10.

For training you further need to set "--data" which is where the training data lives (I'm using https://lukelindemann.com/wiki_corpus.html). If you just want to load an existing model, use "--load" which will skip the training step and instead use the pretrained model in "--model".

The CLI outputs a formatted list of all mismatched languages. If you want to use this in another script, use infere_names(model,json_path,converter), where "model" is the model loaded (convenience method load(modelname) exists), "json_path" is where you place your data to be checked and "converter" converts the full-length language names, like "English" to their language code "en".

Feb 04 '23 14:02 MattAlexMiracle

:x: pre-commit failed. Please run pre-commit run --all-files locally and commit the changes. Find more information in the repository's CONTRIBUTING.md

Feb 04 '23 14:02 github-actions[bot]

I think this code might be dead - a file with the same name was added, which does things differently. So now the merge conflict is as large as the changes added :(

Might need to close this?

Feb 20 '23 02:02 bitplane

Yeah, this version now conflicts with a different version I already merged before this one was accepted. Will reopen this with the necessary merges done on my end

Feb 20 '23 10:02 MattAlexMiracle

Open-Assistant Open-Assistant copied to clipboard

Language detection CLI

Open-Assistant
Open-Assistant copied to clipboard