bingjie3216

Results 10 comments of bingjie3216

Met the same issue: smart_tokenizer_and_embedding_resize(smart_tokenizer_and_embedding_resize( File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/x-gpu1/code/Users/x/code/ColossalAI/applications/Chat/coati/utils/tokenizer_utils.py", line 68, in smart_tokenizer_and_embedding_resize File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/x-gpu1/code/Users/x/code/ColossalAI/applications/Chat/coati/utils/tokenizer_utils.py", line 68, in smart_tokenizer_and_embedding_resize model.resize_token_embeddings(len(tokenizer))model.resize_token_embeddings(len(tokenizer)) File "/anaconda/envs/coati/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__ File "/anaconda/envs/coati/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1614, in __getattr__...

I also created a wechat group since my qq was lost for a while: ![image](https://user-images.githubusercontent.com/1332679/227734622-2a007a13-ee4d-4b9a-93dc-7fa97a992998.png)

Thanks a lot for sharing the knowledge, I have tried a bunch of stuff, and let's see how it goes: I lowered the batch size to 1 and also modified...

Some update with V100, I am running it on V100-32G GPUs, 8 of them on one node: Last step shows: {'eval_loss': 1.3447265625, 'eval_runtime': 25.5031, 'eval_samples_per_second': 39.211, 'eval_steps_per_second': 2.47, 'epoch': 1.0}...

I am not sure whether it is a bug in the code: Deleting older checkpoint [/home/azureuser/cloudfiles/code/Users/jbing/code/dolly/local_output_dir_0325/checkpoint-1400] due to args.save_total_limit My latest checkpoint is checkpoint-1400 and the one that should be...

I am retrying by changing the following parameters in the trainer code: save_total_limit=3, load_best_model_at_end=False, I think it might be a bug in transformers. @yinwangsong Do you mean you met the...

@srowen it is not about the permissions, the failure it claims is it could not find the file that it wants to delete. I believe it is a transformer bug....

Update: both 1 and 2 could solve the problem I met earlier.

Same here, please see my attached screenshot: ![image](https://user-images.githubusercontent.com/1332679/233803399-4f981fb9-388c-4d66-a83e-54aec4b45a72.png)