NER-using-Deep-Learning
NER-using-Deep-Learning copied to clipboard
Fixing Imbalanced Data
The NER corpus include many more 'O' label than any entities. How can we fix this using keras? I tried sample_weight to ajust the loss function during training, but it does not appear to fix the problem fully. What would you suggest? Thx
In case of Hindi data, there are surely many 'O' entries. Fixing this is not entirely possible, as we would have to go through entire dataset, or create a new one(that is extreme task). We can only make some assumptions, like using only those sentences which have some certain specific number of named entities, using sentences with max_len <= threshold, etc. I dont understand fixing this by keras. Can you explain more?
Actually that was unclear from me, when I try to train the model on the english conll dataset, The classifier only predicts '0' label, and this yields a high accuracy (around 97%).
Maybe I'm just doing something wrong, but i don't see what. I already encountered class imbalances in other ML cases. But I'm wondering if there is any preferred solution for the NER problem. There are many ways of addressing this problem, (such as oversampling, undersampling or SMOTE) or some solutions within the keras options such as setting class weights in the loss function.
That image suggests that you are surely doing something wrong. How was the output when you were training the model using keras. Was valid. acc increasing steadily(at a optimum rate) and loss decreasing at a good rate? For handling class imbalances, you can do something like I told in previous comment.
I tried to run the script with default settings, as it can be found in english_NER.ipnb. The accuracy (and logloss) is stuck at 97.3 from the first epoch. I'm trying to figure out what is going wrong.
Sorry for late replying. Were you also getting very low loss (in negative powers of 10) and NaN values during training?
Hello Divesh,
I have a very low loss from the first epoch, I joined a capture of training logs:
The only thing I changed was adding a few parenthis to print functions since I'm running your scripts with python 3, maybe I'm also using a different version of Keras tensorflow, I have keras 2.0.0 and tensorflow 1.0.1 installed on windows 64, which versions did you use initially? Thanks for helping
The problem is version numbers. I should have made requirements.txt.
I used Keras==1.2.1
and tensorflow-gpu==0.12.1
. Though I had tensorflow with GPU support, you can avoid that by installing just tensorflow==0.12.1
. Try this in a new env and let me know.
About using python 3, some problems can occur while handling unicodes, but in our case, chances are less.
I got a similar issue that all are predicted to be 'O' for the English dataset, but my issue is even worse as the losses are all nan since the beginning. I will try to match the versions of Keras and tensorflow. Do you have any other advice on this issue? Thanks.
A follow-up with that, it does not seem that the versions of tesorflow and Keras can solve my loss: nan issues. I am wondering if this is due to the gpu vs cpu?
It turns out that I get the same answer with @ArmandGiraud right now. @pandeydivesh15 what was the accuracy?
I trained one model just now. Output in my case:
Still having this