fasterrcnn-pytorch-training-pipeline icon indicating copy to clipboard operation
fasterrcnn-pytorch-training-pipeline copied to clipboard

model save error

Open KavitaHoude opened this issue 1 year ago • 7 comments

I am getting the following errors when trying to train the model on custom dataset. This error is getting at last epoch. Maybe it is model save error. Please give suggestions to solve these errors.

SAVING BEST MODEL FOR EPOCH: 10

Traceback (most recent call last): File "/content/drive/MyDrive/Tree_Detect_Faster_RCNN/fasterrcnn-pytorch-training-pipeline/train.py", line 571, in main(args) File "/content/drive/MyDrive/Tree_Detect_Faster_RCNN/fasterrcnn-pytorch-training-pipeline/train.py", line 566, in main wandb_save_model(OUT_DIR) File "/content/drive/MyDrive/Tree_Detect_Faster_RCNN/fasterrcnn-pytorch-training-pipeline/utils/logging.py", line 225, in wandb_save_model wandb.save(os.path.join(model_dir, 'best_model.pth')) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn return func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 361, in wrapper return func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1852, in save return self._save(glob_str, base_path, policy) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1906, in _save os.symlink(abs_path, wandb_path) OSError: [Errno 95] Operation not supported: '/content/drive/MyDrive/Tree_Detect_Faster_RCNN/fasterrcnn-pytorch-training-pipeline/outputs/training/fasterrcnn_mobilenetv3_large_fpn_noaug_40e/best_model.pth' -> '/content/drive/MyDrive/Tree_Detect_Faster_RCNN/fasterrcnn-pytorch-training-pipeline/wandb/offline-run-20240110_114154-ui6uadqd/files/outputs/training/fasterrcnn_mobilenetv3_large_fpn_noaug_40e/best_model.pth' Traceback (most recent call last): File "/content/drive/MyDrive/Tree_Detect_Faster_RCNN/fasterrcnn-pytorch-training-pipeline/train.py", line 571, in main(args) File "/content/drive/MyDrive/Tree_Detect_Faster_RCNN/fasterrcnn-pytorch-training-pipeline/train.py", line 566, in main wandb_save_model(OUT_DIR) File "/content/drive/MyDrive/Tree_Detect_Faster_RCNN/fasterrcnn-pytorch-training-pipeline/utils/logging.py", line 225, in wandb_save_model wandb.save(os.path.join(model_dir, 'best_model.pth')) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn return func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 361, in wrapper return func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1852, in save return self._save(glob_str, base_path, policy) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1906, in _save os.symlink(abs_path, wandb_path) OSError: [Errno 95] Operation not supported: '/content/drive/MyDrive/Tree_Detect_Faster_RCNN/fasterrcnn-pytorch-training-pipeline/outputs/training/fasterrcnn_mobilenetv3_large_fpn_noaug_40e/best_model.pth' -> '/content/drive/MyDrive/Tree_Detect_Faster_RCNN/fasterrcnn-pytorch-training-pipeline/wandb/offline-run-20240110_114154-ui6uadqd/files/outputs/training/fasterrcnn_mobilenetv3_large_fpn_noaug_40e/best_model.pth'

KavitaHoude avatar Jan 10 '24 11:01 KavitaHoude

Hello. Can you please provide the command that you are using?

sovit-123 avatar Jan 10 '24 13:01 sovit-123

hello Sir, this is the command !python train.py --model fasterrcnn_mobilenetv3_large_fpn --data data_configs/custom_data.yaml --epochs 10 --name fasterrcnn_mobilenetv3_large_fpn_noaug_40e --seed 42

On Wed, Jan 10, 2024 at 7:06 PM Sovit Ranjan Rath @.***> wrote:

Hello. Can you please provide the command that you are using?

— Reply to this email directly, view it on GitHub https://github.com/sovit-123/fasterrcnn-pytorch-training-pipeline/issues/122#issuecomment-1884865380, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5TGEHNRX6CYNX7H5F2B3VLYN2KOTAVCNFSM6AAAAABBUVAN3KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBUHA3DKMZYGA . You are receiving this because you authored the thread.Message ID: <sovit-123/fasterrcnn-pytorch-training-pipeline/issues/122/1884865380@ github.com>

KavitaHoude avatar Jan 10 '24 13:01 KavitaHoude

Okay. If you are training on Colab and trying save on Google Drive, please use the --project-dir argument instead of the --name argument for saving the project.

sovit-123 avatar Jan 11 '24 11:01 sovit-123

Okay. If you are training on Colab and trying save on Google Drive, please use the --project-dir argument instead of the --name argument for saving the project.

not worked. same error again

KavitaHoude avatar Jan 11 '24 13:01 KavitaHoude

Can you please let me know where the code files are? Is it getting cloned to colab or is it somewhere on the Google Drive? It may not work If it is on Google Drive.

sovit-123 avatar Jan 11 '24 13:01 sovit-123

Can you please let me know where the code files are? Is it getting cloned to colab or is it somewhere on the Google Drive? It may not work If it is on Google Drive.

its cloned to google drive by using the command !git clone https://github.com/sovit-123/fasterrcnn-pytorch-training-pipeline.git

KavitaHoude avatar Jan 11 '24 13:01 KavitaHoude

Most probably it won't run from Google Drive. Please try to clone to the colab drive directly and run it.

sovit-123 avatar Jan 11 '24 13:01 sovit-123