Dirk Groeneveld

Results 200 comments of Dirk Groeneveld

> will almost certainly not outpace the rate at which large LUMI runs are using up LUMI file limits Why are large LUMI runs using up so many files?

I know that `gzip` on the command line is faster than doing it from Python. I was hoping we could get away without the extra complication of doing it with...

Something like that. Or we move the checkpoint to another drive. The checkpoint itself will likely be multiple files, so hard links won't work.

That run failed because `torch.compile()` is 💩.

On Nvidia GPUs they are reported by default, but not on AMD.

Currently running here: https://wandb.ai/ai2-llm/olmo-medium/runs/s0jpyzm0

I have three instances running in AWS that are downloading the most recent three checkpoints from CC: https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1#Instances:v=3;$case=tags:true%5C,client:false;$regex=tags:false%5C,client:false They don't have proper AI2 users configured. You can log in as...

How do the token counts fall off when we add more snapshots?

Ah, also, we've been counting tokens in the other data sources using the unicode universal tokenizer. https://uniseg-py.readthedocs.io/en/latest/index.html is a Python version, but there is a version for C++ and Rust...

This is how I count tokens using `uniseg`: https://github.com/allenai/c5/blob/main/wet_path_to_pages.py#L17