Aflah
Aflah
I just realized the error message and this tutorial (https://lightning.ai/docs/fabric/stable/guide/multi_node/slurm.html) seems to imply I should use srun. Running with this now - ``` sbatch --partition=a100 --nodes=2 --gres=gpu:8 --cpus-per-task=32 --mem=244G --exclude=sws-3a100grid-01...
This command works - ``` sbatch --partition=a100 --nodes=2 --gres=gpu:8 --ntasks-per-node=8 --mem=244G --exclude=sws-3a100grid-01 --time=8-00:00 --output=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.out --error=/NS/llm-1/work/afkhan/USC_Collab/litgpt/SLURM_Runs/logs/litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192.err --job-name=litgpt-olmo-pretrain-slimpajama-2x8xA100-GBS-192 --wrap "srun litgpt pretrain --config config_hub/pretrain/slimolmo.yaml --data.data_path /NS/llm-1/static00/data/" ``` But when I look at...
Hi @Andrei-Aksionov @rasbt I was trying to figure out the best batch size for pretraining OLMo 1B on A100 machines. I tried a lot of different batch sizes but everything...
I looked at numbers from the Pythia paper and while training the 1B model they were able to use a batch size of 16 for a 40 GB A100 but...
Here's the WANDB GPU Usage Chart for Batch Size 16 - 
I do plan to but I think even if the entire batch was this big it should still not OOM as Pythia had the same seq length and a GPU...
Thanks I'll do that Also is there a simple way to use the profiler when pretraining? or do I need to modify pretrain.py and add the profiler in manually?
Any updates on this? I think this would be a much-needed feature for almost all chatbots
@chaosddp Thanks this look like an elegant way. Did you use this more? Any issues that might arise down the line? Also could you get this working with text streaming?...
For future readers, following this worked for me - https://discuss.streamlit.io/t/disable-st-input-chat-during-conversation/50258/2?u=aflah1