What happens now

Our runs produce "checkpoint directories". You might have seen them. Checkpoint directories contain a bunch of debris from a run, including between 0 and n actual sharded checkpoints.

LUMI runs write checkpoint directories to LUMI's "flash" directory. We have only 20TB of space there, and once a week we fill it up. When this happens, all jobs start failing with no message. Dirk then moves all the checkpoints from the flash directory to the "scratch" directory. There, he runs a little script that makes each directory into a tar.bz2 file. Then, these get uploaded to gs://ai2-olmo/unsorted-checkpoints. At this point, their filename corresponds to the slurm run id from LUMI.

MosaicML runs write checkpoint directories directly to S3. Ask @epwalsh for details.

These are the problems with this setup:

Checkpoint directories are big because they contain sharded checkpoints. We should unshard them all, to get the file size down, and because that makes them useable in other contexts.
Many of these directories don't actually contain any checkpoints, because the run failed during startup. The directories are still big because we write indices and potentially a checkpoint for step 0. But the directories are useless. We should delete all of these if they don't really contain a checkpoint outside of step 0.
The filenames of these correspond to the slurm run id from LUMI. Instead they should be stored under the WandB run path, which looks like this: ai2-llm/gantry-runtime-scripts/ipzq9ab9.
We are storing these in Google Cloud Storage, but Google Cloud (and AWS) are super expensive when you want to move data out of them. We want to store these in Cloudflare's R2 instead, because they don't charge egress fees. We have an account there. Dirk will give you access.
The logs from the runs are stored somewhere else. We should add the logs into the checkpoint directories.

These are not problems with this setup:

Dirk has to run stuff manually every once in a while to clear out the directories. We could fix that at some point, but it's not a concern at the moment, as long as running the stuff is easy.

What should happen

To be clear, I am fine with any solution that solves the problems above. Here is how I would do it:

Write a script that deletes checkpoint directories that don't actually contain any checkpoints. It should be able to do this with local paths and gs://, s3://, and r2:// URLs, and it should be able to do this whether the checkpoint is contained in a tar file or not. We need to be able to run a lot of these processes in parallel, though parallelism could be achieved just with GNU parallel. I want to run this script on LUMI before uploading anything, that's why it's separate.
Write another script that unshards all checkpoints contained in a checkpoint directory. It should also be able to do this wherever the checkpoint directory is stored, tar file or not. This script would replace the directory it's given, but optionally it can write its output somewhere else (so we can use this to copy everything to R2).
Write a tool that can look into WandB and create a mapping from LUMI run ids to WandB run paths. This sounds messy, but I have found the WandB API super easy to use through their Python client.
Use that script to store the checkpoint directories under a name that corresponds to their WandB run path.

In the end, all checkpoints should be stored under r2://olmo-checkpoints/<wandb run path>/, uncompressed, i.e., not as tar files.

Other than the unsharder itself, this all sounds pretty trivial. It only becomes difficult once we run it on many TB, under the constraint that it has to finish faster than LUMI and MosaicML can produce more.

End-to-end workflow

On LUMI: We run the checkpoint directory cleaner script. A bunch of checkpoints get deleted.
On LUMI: We tar up the remaining checkpoint directories and uploads them to r2://olmo-checkpoints/unsorted/<runid>.tar.gz. There is no script for this. It's not necessary.
On GCloud (probably the lm-datasets machine): For each unsorted checkpoint directory in R2, we do, in parallel:
1. We run the mass unsharding script. Because we pass the right flags to the mass unsharder, the output ends up on a local drive in a directory, uncompressed.
2. We upload these checkpoints directly under the right name r2://olmo-checkpoints/. Some combination of those previous scripts needs to find what that right name is and do the upload.
3. We delete the tar file from the unsorted checkpoints, and the local directory.

Oct 02 '23 19:10 dirkgr

One design constraint: Data transfer out of GCloud and AWS is expensive. All other data transfer is free. But the machine doing the data processing (so far) lives in GCloud.

Oct 02 '23 19:10 dirkgr

Progress:

Cleaning, using wandb path and unsharding are all integrated into a script, but the script can be improved now that I have tried using it.
I have uploaded smaller runs from LUMI to clean up LUMI, but bad ones take too long and are in progress at best. More discussion on this later.
I have cleaned some of the unsorted checkpoints in R2. There are a small number of runs with unsharding issues. The most common issue is download/upload flaky issues, especially since I'm doing several runs in parallel and these runs have thousands of files.

Comments:

Deleting bad runs has had a relatively negligible effect on LUMI, probably because people have been doing this themselves (due to storage constraints) or they just don't produce bad runs often. Also, the only bad runs we detect are those without nontrivial checkpoints, which means they are generally relatively small anyways.
Tarring on LUMI is very slow. Even without compression, it appeared to be faster to just directly upload individual files. This may be because tarring isn't parallel (maybe there is a parallel option out there), whereas uploading individual files is inherently parallelizable.
Uploading individual files can run into the failure botocore.exceptions.ClientError: An error occurred (ServiceUnavailable) when calling the CreateMultipartUpload operation: Reduce your concurrent request rate for the same object.. I haven't had the chance to update the script to deal with this, but it causes the whole upload process to end (rather than magically skipping to the next file). To partially help with this, I have been using a local workaround where uploads are skipped for existing files rather than overwriting, which gives a substantial speedup when I retry. This is the only flaky error I have seen, and this is the major bottleneck for cleaning out LUMI.
~~Storage cleaning (particularly, the uploading) in its perfect form will almost certainly not outpace the rate at which large LUMI runs are using up LUMI file limits. We may need to think about a solution for this. I believe Pete is already making runs upload directly to S3.~~
When unarchiving is needed for large files, I believe it is notably more performant to do this manually (with tools like pgiz, lbzip2, etc.) than to rely on the storage cleaning script's inbuilt unarchiver. Similarly, I am inclined to believe that aws cli is better for downloading/uploading purposes.

Dec 22 '23 23:12 2015aroras

will almost certainly not outpace the rate at which large LUMI runs are using up LUMI file limits

Why are large LUMI runs using up so many files?

Dec 27 '23 07:12 dirkgr

I know that gzip on the command line is faster than doing it from Python. I was hoping we could get away without the extra complication of doing it with the command line by just using more CPUs though (i.e., treating many run directories in parallel, not some magical parallel gzip implementation).

Dec 27 '23 07:12 dirkgr

Looking at things with a fresher perspective, I believe I was wrong about large LUMI runs using the file limit quicker than uploading can keep up. A step has 3000-6000 files, which we can keep up with. We just happen to have several runs with 10+ steps, which has added up to near 2M files (the limit).

Jan 08 '24 20:01 2015aroras

We can probably avoid using the command line for unarchiving in general. I was trying to make progress as quickly as possible due to constraints (like time and the LUMI file limit), but these won't be issues normally.

Jan 08 '24 20:01 2015aroras

Marking the items prior to Feb 29th as "closed".

Apr 30 '24 21:04 dumitrac

Better Checkpoint Management

What happens now

What should happen

End-to-end workflow