aim icon indicating copy to clipboard operation
aim copied to clipboard

Remote aim server leftover checkpoints

Open vanhumbeecka opened this issue 2 years ago • 2 comments

🐛 Bug

I have the following setup:

  • On my synology NAS, I have Aim running as a docker container.
  • I can reach the server via it's (local) ip address, 192.168.0.117:53800
  • I can configure my PyTorch-Lighning models to use the AimLogger with repo=aim://192.168.0.117:53800
  • Everything works, and runs are stored correctly.

However, the 'latest checkpoint' is always stored locally instead of the server. Inside the repository where I start my code from, the following folder is created aim: (yes, including the colon). You can see the results in the screenshot.

It seems to be some leftover checkpoints from aim? I'm not sure. Inspecting the output in aim shows no signs of issues. Everything seems to be in order.

Screenshot 2023-02-25 at 12 58 07

To reproduce

See above

Expected behavior

I expect there is nothing logged locally, and everything is stored on the remote aim server.

Environment

  • Aim Version 3.16.0 running in server mode inside a docker container. Using the official aim docker image.
  • Python version 3.10.8
  • pip version 22.2.2
  • OS MacOS Montery 12.6.3

vanhumbeecka avatar Feb 25 '23 12:02 vanhumbeecka

Hey @vanhumbeecka! Thanks for submitting the issue. In fact, Aim do not support storing checkpoints just yet (as there's no artifact support). On the other hand the implementation of lightning trainer has some complicated logic of selecting the save_dir. You can check it here.

@tmynn, @mahnerak I recall you had some ideas how this can be worked around? Please share your thoughts.

alberttorosyan avatar Feb 27 '23 07:02 alberttorosyan

@vanhumbeecka I handle this by explicitly setting up a lightning ModelCheckpoint callback. When I do this, Lightning doesn't try to interpret AimLogger's logger.save_dir as a local path.

import lightning.pytorch as pl
from lightning.pytorch.callbacks import ModelCheckpoint
callbacks = []
callbacks.append( ModelCheckpoint(dirpath='my/local/chkpts', filename='{epoch}.ckpt', monitor='val_loss', mode='min')
trainer = pl.Trainer(callbacks=callbacks, ... )

sbatchelder avatar Dec 02 '24 05:12 sbatchelder