pytorch-image-models icon indicating copy to clipboard operation
pytorch-image-models copied to clipboard

[FEATURE/ISSUE for timm] Can we use symbolic link instead of hard link when save the checkpoint?

Open SleepyTT opened this issue 4 years ago • 1 comments

Hi rwightman,

Thank you for providing the timm tool which is very useful!

When I use CheckpointSaver in timm.utils to save checkpoint, I met this issue:

os.link(last_save_path, save_path) OSError: [Errno 38] Function not implemented: './output/train/20210105-001029-tf_efficientdet_d0/last.pth.tar' -> './output/train/20210105-001029-tf_efficientdet_d0/checkpoint-0.pth.tar'

it seems my filesystem does not support creating a hard link using os.link() function. But I have tried that the os.symlink() works in my scenario. It is impossible for me to change the filesystem since it's configured by cloud training platform, so I am wondering if it is possible to use the symbolic link instead of the hard link in timm?

Looking forward to your reply! Thanks in advance!

@rwightman

SleepyTT avatar Jan 05 '21 22:01 SleepyTT

@SleepyTT curious what cloud filesystem / setup you use that doesn't support hard links? I'm not aware of many modern setups, including file share protocols that don't support them these days.

Hard links were chosen because the overhead is low and they can be done atomically without risking dangling links/inconsistent states in the event of a crash. I could look at a symlink option at some point, but I doubt it would be as robust, and any filesystem that doesn't support hardlinks is likely to not be very robust in the event of a crash/restart as well...

For cloud I have plans to support direct writing to buckets at some point as I am doing that more in other (non open source) projects.

rwightman avatar Jan 06 '21 16:01 rwightman