torchtune Move away from using `/tmp` directories

Using /tmp to store temporary outputs such as model checkpoints, tokenizers, logs, etc is not good practice because /tmp is often deleted and shared across users in a remote environment with stricter permissions. We should update all example configs and download scripts to move away from using /tmp and instead use something similar to ~/.cache/torch/hub or $TORCH_HOME: https://pytorch.org/docs/stable/hub.html#where-are-my-downloaded-models-saved

cc @NicolasHug @kartikayk

Feb 03 '24 01:02 RdoubleA

@joecummings we probably need to address this as part of tune download to make sure we arent' downloading to tmp. I can look at this on the checkpointing side as part of recipe UX.

Feb 25 '24 18:02 kartikayk

Tracking this in #691

Apr 21 '24 16:04 kartikayk

@kartikayk this isn't related to testing and this isn't a "quality of life" thing, this is user-facing. It should probably not be part of https://github.com/pytorch/torchtune/issues/691.

Apr 29 '24 09:04 NicolasHug

Ok yeh, I think I read this as addressing the /tmp/artifacts issue but that shouldn't be a problem anymore. For downloading the models and files, I think we can find a better placeholder for this, but I haven't seen any issues or complaints around this.

Apr 29 '24 15:04 kartikayk

Why not create a dedicated cache folder and set it as an environment variable on install?

Apr 29 '24 15:04 RdoubleA

@RdoubleA how would we set an env variable on install in a persistent way?

@kartikayk should this issue be re-opened?

I'm not sure torchtune needs to do much more than what is already done by https://pytorch.org/docs/stable/hub.html#where-are-my-downloaded-models-saved

Apr 29 '24 15:04 NicolasHug

Why is what hub does a good idea? Models etc that we're downloading are all from HF, why do we need to do what torch hub does for this? IIUC, HF already does a bunch of caching in the background.

Apr 29 '24 15:04 kartikayk

I don't understand.

Using /tmp to store temporary outputs such as model checkpoints, tokenizers, logs, etc is not good practice because /tmp is often deleted and shared across users in a remote environment with stricter permissions. We should update all example configs and download scripts to move away from using /tmp and instead use something similar to ~/.cache/torch/hub or $TORCH_HOME: pytorch.org/docs/stable/hub.html#where-are-my-downloaded-models-saved

The original comment in this issue makes sense to me, i.e. if torchtune is downloading or storing stuff, it should probably not be in /tmp and rather be in a dedicated folder with a similar configuration logic to that of torchhub. Is this not relevant anymore?

Apr 29 '24 15:04 NicolasHug

if torchtune is downloading or storing stuff, it should probably not be in /tmp and rather be in a dedicated folder with a similar configuration logic to that of torchhub

Sorry I dont understand this comment. Can you share more? Like I mentioned we use the HF download API so I'm not sure why I should setup using torch hub. But maybe I'm missing something on this.

On runpod, all of the volume storage is on /workspace. And thats where I download to. I dont see the value for setting up some of the stuff mentioned above, but again, maybe I'm missing something.

Apr 29 '24 16:04 kartikayk

The premise of this issue is that torchtune is storing stuff somewhere, and that:

"somewhere" is /tmp
"stuff" denote models, temporary outputs, logs, checkpoints, etc.

Is this not the case?

Apr 29 '24 16:04 NicolasHug

Closing this as stale and no issues have been raised on this

Jul 19 '24 04:07 RdoubleA