simpletransformers icon indicating copy to clipboard operation
simpletransformers copied to clipboard

Not using entire data in MultiLabelClassification

Open Sanjeet-panda-ssp opened this issue 2 years ago • 12 comments

trans1 I made 54000 data points to train a multilabel classifier but it is taking only 110 data points for training for verification I tried this with other datasets and codes that were available in the net. In each case I observed it was using part of the data.

trans2

Sanjeet-panda-ssp avatar Nov 11 '21 05:11 Sanjeet-panda-ssp

did you find the reason for this ? I am facing the same issue

IS5882 avatar Nov 22 '21 08:11 IS5882

did you find the reason for this ? I am facing the same issue

Not yet

Sanjeet-panda-ssp avatar Nov 22 '21 08:11 Sanjeet-panda-ssp

So when I added those 2 lines in my args, I no longer get the red bar 0%, so supposedly it is training and evaluating all data. But the performance is the same (output results) which is making me skeptical about it! did this actually fix the issues? is there's a way to double-check that my model is training on all the data I provided?

 "use_multiprocessing":False,
 "use_multiprocessing_for_evaluation":False,

IS5882 avatar Nov 22 '21 09:11 IS5882

So when I added those 2 lines in my args, I no longer get the red bar 0%, so supposedly it is training and evaluating all data. But the performance is the same (output results) which is making me skeptical about it! did this actually fix the issues? is there's a way to double-check that my model is training on all the data I provided?

 "use_multiprocessing":False,
 "use_multiprocessing_for_evaluation":False,

It s highly likely not working because the loading time of the dataset is almost the same in my case. As in I have 54000 dataset and it takes almost same time as previous to run that part. Red bar of 0 percent not showing doesn't mean it has considered the entire dataset.

Sanjeet-panda-ssp avatar Nov 23 '21 06:11 Sanjeet-panda-ssp

It s highly likely not working because the loading time of the dataset is almost the same in my case. As in I have 54000 dataset and it takes almost same time as previous to run that part. Red bar of 0 percent not showing doesn't mean it has considered the entire dataset.

Yes I am also skeptical. The thing is I had that 0% red bar on both model.train and model.evaluate, it shows that it is evaluating on 4/1946 sentences and gives an F-measure of 99%! yet it also shows that # of False Positive is 10 and False Negative is 9 (the rest of the 1946 is either TN or TP)! so it does mean that it is evaluating on the whole test set even with the 0% bar! (I would assume the same on training although I can't verify that)

What I also did to verify the 99% F-measure, I evaluated sentence by sentence of my test set using model.predict(sentence)! I calculated the total number of labels mismatch I got a total 19 mismatches (which is the same as FN + FP that I got with model.evaluate) so it means that the evaluation using model.evaluate was correct even with the red bar at 0% and showing 4/1946!

Yet, I am not comfortable in using SimpleTransformers because I am still skeptical that something might be wrong.

IS5882 avatar Nov 23 '21 08:11 IS5882

I've never encountered this issue myself. If I had to guess, it's probably something to do with the Jupyter environment, tqdm (the progress bar library), and multiprocessing not playing well together. But, it seems to be a problem with the progress bar updating rather than the training/evaluation itself.

ThilinaRajapakse avatar Nov 28 '21 23:11 ThilinaRajapakse

I'm getting the exact same issue on Google Colab.

Screen Shot 2021-12-06 at 11 11 57 AM
model = ClassificationModel('roberta', 'roberta-base', num_labels=2, args=model_args)
print('Training on {:,} samples...'.format(len(df_train_90)))
# Train the model, testing against the validation set periodically.
model.train_model(df_train_90, eval_df=validation_df)

My model_args are all default for multiprocessing, etc. Considering the results, the plotted WandB outputs (number of FN, TN, etc), the fact that all mentions of this issue have been Google Colab/Jupyter, and the significant size of the cached file, I would find it very likely that as @ThilinaRajapakse says, it's a display problem. It would be fantastic to have the definite proof though!

Note that I tested on two different dataset on my Google Colab, and the displayed progressed bar stopped at 0.2% both times. This is exactly the same as @IS5882 (4 out of 1946). Unlikely to be a coincidence, and unless there's something in the code about a 0.2%, it would seem more a notebook display issue indeed.

Out of curiosity, and for peace of mind and quick sanity check, for the people who have performed calculations without encountering this issue, how long would you expect the feature conversion to take for around 10,000 features? A few seconds, or 30 minutes, as shown in the screenshot?

Looked at it more, and the cached file that is created (e.g. cached_train_roberta_128_2_2) does contain the entire dataset. One can test it by downloading the cached file, and doing:

import torch
data = torch.load('cached_train_roberta_128_2_2')
print(data[0]['input_ids'].size())

glher avatar Dec 06 '21 19:12 glher

@glher as @ThilinaRajapakse said it is just a display issue, I would assume that your model is training fine. What I did to validate that (as I mentioned in my comment above) is using model.predict on each sentence and manually count the # of fp, fn, tp and tn which are all the same as model.eval, so it is training and testing correctly,

IS5882 avatar Dec 11 '21 22:12 IS5882

Have you indeed checked the GPU stats when it stalls? I checked mine and saw that GPU power consumption reduces to idle while model still is on GPU memory. I do not think that it will continue the training even if I wait forever.

image image

tarikaltuncu avatar Jan 09 '22 06:01 tarikaltuncu

experiencing same issue here image

whole dataset is ~48k words but on epochs bar it only shows 6k

HponeMK avatar Mar 24 '22 04:03 HponeMK

@tarikaltuncu Your issue seems to be different. The estimated time for tokenization is 131 hours for some reason. The GPU is idle because training hasn't started yet.

ThilinaRajapakse avatar Mar 24 '22 09:03 ThilinaRajapakse

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 12 '22 17:06 stale[bot]