German Abramov
German Abramov
For example, if I'm training on 2 nodes, should I have checkpoints both 0 and 1 rank? I have `save_filename: ep{epoch}-ba{batch}-rank{rank}.pt` But checkpoints saving only for node 0 with rank...
Do you know why i got this problem with `pretrain_gpt_single_node.sh`? I'm setting `N_GPUS=1` and got ``` File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 191, in _get_group_rank raise RuntimeError("The given group does not exist") RuntimeError:...
Hey! I'm would to be training some resnet models using this imagenet dataset So i've git cloned imagenetloader.torch to my PC (os: windows10) But when i'm launch **valprep.sh** file it...
Hi, Llama 3 trains like this > We trained the models on sequences of 8,192 tokens, using a mask to ensure self-attention does not cross document boundaries. I see you...
Hi, looks like some new version of llm-foundry (updated from master) have lags in last week-two. I have error like this ``` train_loader: dataset: max_seq_len: 2048 shuffle: true shuffle_seed: 17...
Hi! I'm trying to merge index.jsons into one, so I have folder ``` dataset/ part.00000/ train/ index.json shard.00000.mds … val/ index.json shard.00000.mds … part.00001/ train/ index.json shard.00000.mds … val/ index.json...
Hi! Your benchmarks are functioning well with version 0.3.0 of lm-evaluation-harness. Are there any plans to update and support version 0.4.0?
Hello, I'm currently training LLaMA PRO. Initially, I expanded the model from 32 layers to 40 layers and proceeded to train only the newly added 8 layers (every fifth layer)....
Hi! Do you support fill in the middle technique in pretrain pipelines? If yes, do you have some documentation about this? Thanks!
Hello, I’m running a 7B model with a 32k context size and seeing unexpected memory scaling behaviors. Here’s the situation: - **Config**: same overall setup, only changing `global_batch_size`. - **Case...