NeuroNER icon indicating copy to clipboard operation
NeuroNER copied to clipboard

Steps to utilize NeuroNER for other languages

Open sooheon opened this issue 7 years ago • 10 comments

It appears that BART at least is pretty language agnostic. The English specific parts of NeuroNER (afaict), are the recommended glove.6B.100d word vectors, and all of the spacy related tokenizing code, which is used to translate BART format into CoNLL format (correct?)

Am I correct that if I:

  1. Supply Korean word vectors in /data/word_vectors
  2. Supply CoNLL formatted train, valid, and test data using BART labeled Korean text which I run through my own tokenizer

I will be able to train and use NeuroNER for Korean text?

sooheon avatar Jul 04 '17 02:07 sooheon

Correct! Note that providing word vectors is optional (it's typically better if you have some), and that I haven't tested NeuroNER with languages other than English. I know someone successfully used it in French (after an encoding fix PR :)), and someone was supposed to try with Bangladeshi but I haven't heard back from him.

On Jul 3, 2017 9:49 PM, "Sooheon Kim" [email protected] wrote:

It appears that BART at least is pretty language agnostic. The English specific parts of NeuroNER (afaict), are the recommended glove.6B.100d word vectors, and all of the spacy related tokenizing code, which is used to translate BART format into CoNLL format (correct?)

Am I correct that if I:

  1. Supply Korean word vectors in /data/word_vectors
  2. Supply CoNLL formatted train, valid, and test data using BART labeled Korean text which I run through my own tokenizer

Will I be able to train and use NeuroNER for Korean text?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Franck-Dernoncourt/NeuroNER/issues/30, or mute the thread https://github.com/notifications/unsubscribe-auth/AAA7447RV_hPWNxKIwrgUb6oHxSekvLUks5sKahDgaJpZM4OM1FP .

Franck-Dernoncourt avatar Jul 04 '17 02:07 Franck-Dernoncourt

Hi (I'm the guy who uses NeuroNER in French)! These 2 steps are true, but you also need spacy (or nltk) working in Korean. I'm explaining a bit more for SpaCy : You need a SpaCy Korean model. This consist in a tokenizer and a POS Tagging model. Someone asked exactly this question : https://github.com/explosion/spaCy/issues/929 Then you will have to change spacylanguage in parameter.ini I hope I'm clear, if not, feel free to ask.

Steps (for spacy) language : X:

  • Check if NLTK or spacy support your language X (you need full support). (https://github.com/explosion/spaCy#spacy-industrial-strength-nlp)
    • if not add your language : https://spacy.io/docs/usage/adding-languages (1-2 week)
  • Supply X word vectors in /data/word_vectors
  • Supply CoNLL/BRAT(into directory) formatted train, valid, and test data using BART labeled X text which I run through my own tokenizer
  • change parameter.ini : token_pretrained_embedding_filepath, token_embedding_dimension, spacylanguage, dataset_text_folder
  • run main.py

Gregory-Howard avatar Jul 06 '17 08:07 Gregory-Howard

Thanks for the additional detail! That looks perfectly doable.

sooheon avatar Jul 06 '17 14:07 sooheon

I don't understand what exactly spacy (or nltk) does in NeuroNER. I think spacy is used as tokenizer. Do we need a language specific tokenizer? And also why do we need POS tagging model? Can't we just use nltk for tokenization?

ersinyar avatar Feb 22 '18 12:02 ersinyar

Spacy is used in this file : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20 The problem here is for span in document.sents: this method need a model to works. I think if we transform a bit the code we might just need a tokenizer.

Gregory-Howard avatar Feb 23 '18 16:02 Gregory-Howard

Hey all! I'm trying to get NeuroNER to work for some Hindi data, but from what I understand spaCy does not support Hindi.

Would you recommend I user NLTK for the same because from what I gather, (spaCy or NLTK) is primarily used from sentence splitting and tokenizing here : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20

Killthebug avatar Mar 04 '18 14:03 Killthebug

Hi! As you seem to be the people who have the most experience in using NeuroNER for languages other than English, could I please ask you to take a look at my query regarding Icelandic?

Unfortunately Spacy, Stanford and NLTK don't support Icelandic, so we need to find a way to use NeuroNER by relying on available NLP tools for Icelandic. Thanks a lot! Issue: #126

svanhvitlilja avatar Oct 26 '18 15:10 svanhvitlilja

Can we use the NeuroNER model for Urdu language, spacy does't support Urdu language. Also can we use other word embedding like Facebook fasttext.

Peacelover01 avatar Dec 14 '19 14:12 Peacelover01

You can use your own tokenizer, and bypass spacy by changing a few lines in the source code, we did that for Icelandic. I can give you some pointers if you want. Don't know about the other embeddings, would like to know :)

svanhvitlilja avatar Dec 14 '19 14:12 svanhvitlilja

Thank you @svanhviti16 for your reply. It will be highly appreciated.

Peacelover01 avatar Dec 15 '19 12:12 Peacelover01