composer icon indicating copy to clipboard operation
composer copied to clipboard

Saving checkpoints on network drive fails due to symlinks

Open eldarkurtic opened this issue 1 year ago • 1 comments

Hi folks, I am using llm-foundry to train some LLMs, and trying to save checkpoints directly to network drive (AWS on-prem storage). The issue I am hitting looks like this:

[Errno 524] Unknown error 524: 'ep0-ba2-rank0.pt' -> '/network/eldar/llmfoundry_checkpoints/test_x/latest-rank0.pt'

at the line:

File "/usr/local/lib/python3.10/dist-packages/composer/callbacks/checkpoint_saver.py", line 352, in _save_checkpoint
    os.symlink(os.path.relpath(src_path, os.path.dirname(symlink)), symlink)

FYI: saving on a local disk works just fine. I think this is an issue of not being able to create symlinks on the network drive. For example, running: touch test1.txt && ln -s test1.txt test2.txt, results with the same Unknown error 524.

I was wondering whether you have any suggestion on how to bypass this restriction (?) of not being able to create symlinks on network drives. If not, is there a straight-forward way to save checkpoints on the network drive but keep symlinks on local disks. After digging a bit through the Composer lib, I feel that this could be hacked relatively easy but I'm wondering if you think that might break some other parts of either Composer or llm-foundry.

eldarkurtic avatar Jan 31 '24 15:01 eldarkurtic

You can specify save_latest_filename to keep the symlink on your local disk if that works for you. That seems like the easiest solution.

For object stores, we emulate a symlink by creating a file that has the path to the checkpoint in it's contents. We could try building a similar solution for a network drive -- this seems like the "right" solution. Unfortunately, it's not something we will be able to build since we don't have access to network drives to test this, but I'm happy to work with you and give some guidance if you're interested.

mvpatel2000 avatar Jan 31 '24 15:01 mvpatel2000