NeuroNER Steps to utilize NeuroNER for other languages

It appears that BART at least is pretty language agnostic. The English specific parts of NeuroNER (afaict), are the recommended glove.6B.100d word vectors, and all of the spacy related tokenizing code, which is used to translate BART format into CoNLL format (correct?)

Am I correct that if I:

Supply Korean word vectors in /data/word_vectors
Supply CoNLL formatted train, valid, and test data using BART labeled Korean text which I run through my own tokenizer

I will be able to train and use NeuroNER for Korean text?

Jul 04 '17 02:07 sooheon

Correct! Note that providing word vectors is optional (it's typically better if you have some), and that I haven't tested NeuroNER with languages other than English. I know someone successfully used it in French (after an encoding fix PR :)), and someone was supposed to try with Bangladeshi but I haven't heard back from him.

On Jul 3, 2017 9:49 PM, "Sooheon Kim" [email protected] wrote:

It appears that BART at least is pretty language agnostic. The English specific parts of NeuroNER (afaict), are the recommended glove.6B.100d word vectors, and all of the spacy related tokenizing code, which is used to translate BART format into CoNLL format (correct?)

Am I correct that if I:

Supply Korean word vectors in /data/word_vectors

Supply CoNLL formatted train, valid, and test data using BART labeled Korean text which I run through my own tokenizer

Will I be able to train and use NeuroNER for Korean text?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Franck-Dernoncourt/NeuroNER/issues/30, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA7447RV_hPWNxKIwrgUb6oHxSekvLUks5sKahDgaJpZM4OM1FP .

Jul 04 '17 02:07 Franck-Dernoncourt

Hi (I'm the guy who uses NeuroNER in French)! These 2 steps are true, but you also need spacy (or nltk) working in Korean. I'm explaining a bit more for SpaCy : You need a SpaCy Korean model. This consist in a tokenizer and a POS Tagging model. Someone asked exactly this question : https://github.com/explosion/spaCy/issues/929 Then you will have to change spacylanguage in parameter.ini I hope I'm clear, if not, feel free to ask.

Steps (for spacy) language : X:

Check if NLTK or spacy support your language X (you need full support). (https://github.com/explosion/spaCy#spacy-industrial-strength-nlp)
- if not add your language : https://spacy.io/docs/usage/adding-languages (1-2 week)
Supply X word vectors in /data/word_vectors
Supply CoNLL/BRAT(into directory) formatted train, valid, and test data using BART labeled X text which I run through my own tokenizer
change parameter.ini : token_pretrained_embedding_filepath, token_embedding_dimension, spacylanguage, dataset_text_folder
run main.py

Jul 06 '17 08:07 Gregory-Howard

Thanks for the additional detail! That looks perfectly doable.

Jul 06 '17 14:07 sooheon

I don't understand what exactly spacy (or nltk) does in NeuroNER. I think spacy is used as tokenizer. Do we need a language specific tokenizer? And also why do we need POS tagging model? Can't we just use nltk for tokenization?

Feb 22 '18 12:02 ersinyar

Spacy is used in this file : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20 The problem here is for span in document.sents: this method need a model to works. I think if we transform a bit the code we might just need a tokenizer.

Feb 23 '18 16:02 Gregory-Howard

Hey all! I'm trying to get NeuroNER to work for some Hindi data, but from what I understand spaCy does not support Hindi.

Would you recommend I user NLTK for the same because from what I gather, (spaCy or NLTK) is primarily used from sentence splitting and tokenizing here : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20

Mar 04 '18 14:03 Killthebug

Hi! As you seem to be the people who have the most experience in using NeuroNER for languages other than English, could I please ask you to take a look at my query regarding Icelandic?

Unfortunately Spacy, Stanford and NLTK don't support Icelandic, so we need to find a way to use NeuroNER by relying on available NLP tools for Icelandic. Thanks a lot! Issue: #126

Oct 26 '18 15:10 svanhvitlilja

Can we use the NeuroNER model for Urdu language, spacy does't support Urdu language. Also can we use other word embedding like Facebook fasttext.

Dec 14 '19 14:12 Peacelover01

You can use your own tokenizer, and bypass spacy by changing a few lines in the source code, we did that for Icelandic. I can give you some pointers if you want. Don't know about the other embeddings, would like to know :)

Dec 14 '19 14:12 svanhvitlilja

Thank you @svanhviti16 for your reply. It will be highly appreciated.

Dec 15 '19 12:12 Peacelover01

NeuroNER NeuroNER copied to clipboard

Steps to utilize NeuroNER for other languages

NeuroNER
NeuroNER copied to clipboard