Dirk Groeneveld comments

Results 200 comments of


                                            Dirk Groeneveld

Better Checkpoint Management

> will almost certainly not outpace the rate at which large LUMI runs are using up LUMI file limits Why are large LUMI runs using up so many files?

Better Checkpoint Management

I know that `gzip` on the command line is faster than doing it from Python. I was hoping we could get away without the extra complication of doing it with...

Checkpoint saving code needs to never delete or overwrite certain checkpoints

Something like that. Or we move the checkpoint to another drive. The checkpoint itself will likely be multiple files, so hard links won't work.

Compiling the AMD layer norm

That run failed because `torch.compile()` is 💩.

Get GPU metrics into wandb

On Nvidia GPUs they are reported by default, but not on AMD.

Continue running the 7B

Currently running here: https://wandb.ai/ai2-llm/olmo-medium/runs/s0jpyzm0

I have three instances running in AWS that are downloading the most recent three checkpoints from CC: https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1#Instances:v=3;$case=tags:true%5C,client:false;$regex=tags:false%5C,client:false They don't have proper AI2 users configured. You can log in as...

Dirk Groeneveld

Better Checkpoint Management

Better Checkpoint Management

Checkpoint saving code needs to never delete or overwrite certain checkpoints

Compiling the AMD layer norm

Get GPU metrics into wandb

Continue running the 7B

Collect 2T tokens

Collect 2T tokens

Collect 2T tokens

Collect 2T tokens