axolotl Error invalid configuration argument at line 119 in file /src/csrc/ops.cu

trafficstars

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

the train task should be start pertfectly

Current behaviour

Error invalid configuration argument at line 119 in file /src/csrc/ops.cu


[2024-04-17 00:08:54,225] [INFO] [axolotl.load_model:354] [PID:808742] [RANK:2] patching with flash attention for sample packing
[2024-04-17 00:08:54,225] [INFO] [axolotl.load_model:354] [PID:808744] [RANK:4] patching with flash attention for sample packing
[2024-04-17 00:08:54,230] [INFO] [axolotl.load_model:354] [PID:808743] [RANK:3] patching with flash attention for sample packing
[2024-04-17 00:08:54,246] [INFO] [axolotl.scripts.load_datasets:415] [PID:808746] [RANK:6] printing prompters...
[2024-04-17 00:08:54,249] [INFO] [axolotl.replace_llama_attn_with_flash_attn:133] [PID:808743] [RANK:3] optimized flash-attention RMSNorm not found (run `pip install 'git+https://github.com/Dao-AILab/flash-attention.git#egg=dropout_layer_norm&subdirectory=csrc/layer_norm'`)
[2024-04-17 00:08:54,249] [INFO] [axolotl.replace_llama_attn_with_flash_attn:133] [PID:808742] [RANK:2] optimized flash-attention RMSNorm not found (run `pip install 'git+https://github.com/Dao-AILab/flash-attention.git#egg=dropout_layer_norm&subdirectory=csrc/layer_norm'`)
[2024-04-17 00:08:54,249] [INFO] [axolotl.replace_llama_attn_with_flash_attn:133] [PID:808744] [RANK:4] optimized flash-attention RMSNorm not found (run `pip install 'git+https://github.com/Dao-AILab/flash-attention.git#egg=dropout_layer_norm&subdirectory=csrc/layer_norm'`)
[2024-04-17 00:08:54,249] [INFO] [axolotl.replace_llama_attn_with_flash_attn:133] [PID:808741] [RANK:1] optimized flash-attention RMSNorm not found (run `pip install 'git+https://github.com/Dao-AILab/flash-attention.git#egg=dropout_layer_norm&subdirectory=csrc/layer_norm'`)
[2024-04-17 00:08:54,249] [INFO] [axolotl.replace_llama_attn_with_flash_attn:133] [PID:808745] [RANK:5] optimized flash-attention RMSNorm not found (run `pip install 'git+https://github.com/Dao-AILab/flash-attention.git#egg=dropout_layer_norm&subdirectory=csrc/layer_norm'`)
[2024-04-17 00:08:54,249] [INFO] [axolotl.replace_llama_attn_with_flash_attn:133] [PID:808748] [RANK:7] optimized flash-attention RMSNorm not found (run `pip install 'git+https://github.com/Dao-AILab/flash-attention.git#egg=dropout_layer_norm&subdirectory=csrc/layer_norm'`)
[2024-04-17 00:08:54,249] [INFO] [axolotl.load_model:403] [PID:808742] [RANK:2] patching _expand_mask
[2024-04-17 00:08:54,249] [INFO] [axolotl.load_model:403] [PID:808743] [RANK:3] patching _expand_mask
[2024-04-17 00:08:54,249] [INFO] [axolotl.load_model:403] [PID:808745] [RANK:5] patching _expand_mask
[2024-04-17 00:08:54,249] [INFO] [axolotl.load_model:403] [PID:808744] [RANK:4] patching _expand_mask
[2024-04-17 00:08:54,249] [INFO] [axolotl.load_model:403] [PID:808741] [RANK:1] patching _expand_mask
[2024-04-17 00:08:54,249] [INFO] [axolotl.load_model:403] [PID:808748] [RANK:7] patching _expand_mask
[2024-04-17 00:08:54,287] [DEBUG] [axolotl.load_tokenizer:277] [PID:808746] [RANK:6] EOS: 2 / </s>
[2024-04-17 00:08:54,287] [DEBUG] [axolotl.load_tokenizer:278] [PID:808746] [RANK:6] BOS: 1 / <s>
[2024-04-17 00:08:54,287] [DEBUG] [axolotl.load_tokenizer:279] [PID:808746] [RANK:6] PAD: 2 / </s>
[2024-04-17 00:08:54,287] [DEBUG] [axolotl.load_tokenizer:280] [PID:808746] [RANK:6] UNK: 0 / <unk>
[2024-04-17 00:08:54,294] [INFO] [axolotl.load_model:354] [PID:808740] [RANK:0] patching with flash attention for sample packing
[2024-04-17 00:08:54,295] [INFO] [axolotl.replace_llama_attn_with_flash_attn:133] [PID:808740] [RANK:0] optimized flash-attention RMSNorm not found (run `pip install 'git+https://github.com/Dao-AILab/flash-attention.git#egg=dropout_layer_norm&subdirectory=csrc/layer_norm'`)
[2024-04-17 00:08:54,295] [INFO] [axolotl.load_model:403] [PID:808740] [RANK:0] patching _expand_mask
[2024-04-17 00:08:54,342] [INFO] [axolotl.load_model:354] [PID:808746] [RANK:6] patching with flash attention for sample packing
[2024-04-17 00:08:54,343] [INFO] [axolotl.replace_llama_attn_with_flash_attn:133] [PID:808746] [RANK:6] optimized flash-attention RMSNorm not found (run `pip install 'git+https://github.com/Dao-AILab/flash-attention.git#egg=dropout_layer_norm&subdirectory=csrc/layer_norm'`)
[2024-04-17 00:08:54,343] [INFO] [axolotl.load_model:403] [PID:808746] [RANK:6] patching _expand_mask
[2024-04-17 00:09:05,981] [INFO] [partition_parameters.py:349:__exit__] finished initializing model - num_params = 723, num_elems = 68.98B
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 29/29 [00:32<00:00,  1.12s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 29/29 [00:32<00:00,  1.12s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 29/29 [00:32<00:00,  1.12s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 29/29 [00:32<00:00,  1.12s/it]
[2024-04-17 00:09:38,492] [INFO] [axolotl.load_model:597] [PID:808743] [RANK:3] patching with SwiGLU
[2024-04-17 00:09:38,493] [INFO] [axolotl.load_model:597] [PID:808741] [RANK:1] patching with SwiGLU
[2024-04-17 00:09:38,495] [INFO] [axolotl.load_model:597] [PID:808744] [RANK:4] patching with SwiGLU
[2024-04-17 00:09:38,495] [INFO] [axolotl.load_model:597] [PID:808742] [RANK:2] patching with SwiGLU
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 29/29 [00:32<00:00,  1.12s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 29/29 [00:32<00:00,  1.12s/it]
[2024-04-17 00:09:38,511] [INFO] [axolotl.load_model:597] [PID:808745] [RANK:5] patching with SwiGLU
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 29/29 [00:32<00:00,  1.12s/it]
[2024-04-17 00:09:38,513] [INFO] [axolotl.load_model:597] [PID:808748] [RANK:7] patching with SwiGLU
[2024-04-17 00:09:38,518] [INFO] [axolotl.load_model:597] [PID:808746] [RANK:6] patching with SwiGLU
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 29/29 [00:32<00:00,  1.12s/it]
[2024-04-17 00:09:38,563] [INFO] [axolotl.load_model:597] [PID:808740] [RANK:0] patching with SwiGLU
[2024-04-17 00:14:54,032] [INFO] [axolotl.load_model:715] [PID:808741] [RANK:1] GPU memory usage after model load: 0.625GB (+1.723GB cache, +2.514GB misc)
[2024-04-17 00:14:54,036] [INFO] [axolotl.load_model:775] [PID:808741] [RANK:1] converting modules to torch.bfloat16 for flash attention
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[2024-04-17 00:14:54,466] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808741] [RANK:1] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:54,538] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808741] [RANK:1] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:54,615] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808741] [RANK:1] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:54,836] [INFO] [axolotl.load_model:715] [PID:808748] [RANK:7] GPU memory usage after model load: 0.625GB (+1.723GB cache, +2.373GB misc)
[2024-04-17 00:14:54,841] [INFO] [axolotl.load_model:775] [PID:808748] [RANK:7] converting modules to torch.bfloat16 for flash attention
[2024-04-17 00:14:54,870] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808741] [RANK:1] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[2024-04-17 00:14:55,271] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808748] [RANK:7] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:55,342] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808748] [RANK:7] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:55,414] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808748] [RANK:7] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:55,650] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808748] [RANK:7] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:55,666] [INFO] [axolotl.load_model:715] [PID:808745] [RANK:5] GPU memory usage after model load: 0.625GB (+1.723GB cache, +2.514GB misc)
[2024-04-17 00:14:55,670] [INFO] [axolotl.load_model:775] [PID:808745] [RANK:5] converting modules to torch.bfloat16 for flash attention
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[2024-04-17 00:14:55,997] [INFO] [axolotl.load_model:715] [PID:808740] [RANK:0] GPU memory usage after model load: 0.625GB (+1.723GB cache, +3.498GB misc)
[2024-04-17 00:14:56,002] [INFO] [axolotl.load_model:775] [PID:808740] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-04-17 00:14:56,051] [INFO] [axolotl.load_model:715] [PID:808744] [RANK:4] GPU memory usage after model load: 0.625GB (+1.723GB cache, +2.514GB misc)
[2024-04-17 00:14:56,055] [INFO] [axolotl.load_model:775] [PID:808744] [RANK:4] converting modules to torch.bfloat16 for flash attention
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[2024-04-17 00:14:56,110] [WARNING] [accelerate.utils.other.log:61] [PID:808740] Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[2024-04-17 00:14:56,122] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808745] [RANK:5] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:56,143] [INFO] [axolotl.train.log:61] [PID:808740] [RANK:0] Starting trainer...
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[2024-04-17 00:14:56,195] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808745] [RANK:5] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:56,271] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808745] [RANK:5] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:56,467] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808740] [RANK:0] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:56,518] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808744] [RANK:4] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:56,526] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808745] [RANK:5] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:56,538] [INFO] [axolotl.load_model:715] [PID:808746] [RANK:6] GPU memory usage after model load: 0.625GB (+1.723GB cache, +2.514GB misc)
[2024-04-17 00:14:56,542] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808740] [RANK:0] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:56,543] [INFO] [axolotl.load_model:775] [PID:808746] [RANK:6] converting modules to torch.bfloat16 for flash attention
[2024-04-17 00:14:56,599] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808744] [RANK:4] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:56,617] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808740] [RANK:0] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[2024-04-17 00:14:56,676] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808744] [RANK:4] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:56,704] [INFO] [axolotl.load_model:715] [PID:808742] [RANK:2] GPU memory usage after model load: 0.625GB (+1.723GB cache, +2.514GB misc)
[2024-04-17 00:14:56,709] [INFO] [axolotl.load_model:775] [PID:808742] [RANK:2] converting modules to torch.bfloat16 for flash attention
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[2024-04-17 00:14:56,872] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808740] [RANK:0] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:56,932] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808744] [RANK:4] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:56,934] [WARNING] [engine.py:1179:_do_optimizer_sanity_check] **** You are using ZeRO with an untested optimizer, proceed with caution *****
[2024-04-17 00:14:56,975] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808746] [RANK:6] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:57,051] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808746] [RANK:6] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:57,123] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808746] [RANK:6] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:57,164] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808742] [RANK:2] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:57,240] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808742] [RANK:2] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:57,315] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808742] [RANK:2] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
Parameter Offload: Total persistent parameters: 1318912 in 321 params
[2024-04-17 00:14:57,363] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808746] [RANK:6] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:57,570] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808742] [RANK:2] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:58,305] [INFO] [axolotl.load_model:715] [PID:808743] [RANK:3] GPU memory usage after model load: 0.625GB (+1.723GB cache, +2.514GB misc)
[2024-04-17 00:14:58,310] [INFO] [axolotl.load_model:775] [PID:808743] [RANK:3] converting modules to torch.bfloat16 for flash attention
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
[2024-04-17 00:14:58,743] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808743] [RANK:3] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:58,817] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808743] [RANK:3] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:58,890] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808743] [RANK:3] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
[2024-04-17 00:14:59,135] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:808743] [RANK:3] packing_efficiency_estimate: 0.9 total_num_tokens per device: 8223767
Error invalid configuration argument at line 119 in file /src/csrc/ops.cu
Error invalid configuration argument at line 119 in file /src/csrc/ops.cu
Error invalid configuration argument at line 119 in file /src/csrc/ops.cu
Error invalid configuration argument at line 119 in file /src/csrc/ops.cu
Error invalid configuration argument at line 119 in file /src/csrc/ops.cu
Error invalid configuration argument at line 119 in file /src/csrc/ops.cu
Error invalid configuration argument at line 119 in file /src/csrc/ops.cu
Error invalid configuration argument at line 119 in file /src/csrc/ops.cu
 [2024-04-17 00:15:20,046] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0

Steps to reproduce

I trained the Codellama-70b model using 8 A100 80G GPUs. I performed a full fine-tune and used the following shell to start the training process:

accelerate launch -m axolotl.cli.train examples/code-llama/70b/fft_optimized.yml --debug

Config yaml

base_model: /mnt/models/CodeLlama-70b-hf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: xxx
    type: 
      field_instruction: instruction
      field_output: response
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: /mnt/output

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
chat_template: chatml

adapter:
lora_model_dir:
lora_r:
lora_alpha:
lora_dropout:
lora_target_linear:
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.00005

train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
flash_attn_cross_entropy: false
flash_attn_rms_norm: true
flash_attn_fuse_qkv: false
flash_attn_fuse_mlp: true

warmup_steps: 200
evals_per_epoch: 4
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_params.json # multi-gpu only
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.11.5

axolotl branch-commit

main/132eb740f036eff0fa8b239ddaf0b7a359ed1732

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Apr 16 '24 16:04 jaywongs

Try pip install -U deepspeed.

This solved a similar problem with mistral 7b

Apr 17 '24 15:04 Napuh

@jaywongs , did the above solve it for you? I find this issue dependent on machine. It may also be bitsandbytes issue.

Apr 18 '24 14:04 NanoCode012

@jaywongs , did the above solve it for you? I find this issue dependent on machine. It may also be bitsandbytes issue.

Yes, it solved for me!

Apr 19 '24 18:04 monk1337

@jaywongs , did the above solve it for you? I find this issue dependent on machine. It may also be bitsandbytes issue.

Apologies for the delayed response. I have tried using the latest version of deepspeed, but the error persists.

Apr 24 '24 08:04 jaywongs

@jaywongs , did upgrading deepspeed work for you?

Apr 24 '24 11:04 NanoCode012

@jaywongs , did upgrading deepspeed work for you?

not work for me,i use the deepspeed 0.14.2

Apr 24 '24 12:04 jaywongs

@jaywongs , did upgrading deepspeed work for you?

not work for me,i use the deepspeed 0.14.2

Hello, have you solved it? I also encountered the same problem.

Aug 08 '24 10:08 MM-WW55

@jaywongs , did upgrading deepspeed work for you?

not work for me,i use the deepspeed 0.14.2

Hello, have you solved it? I also encountered the same problem.

Unfortunately, I was unable to solve it in the end.

Aug 08 '24 10:08 jaywongs

Same error here.

Error invalid configuration argument at line 218 in file /src/csrc/ops.cu

I used winglian/axolotl:main-latest docker image and my configurations is shown below:

**** Axolotl Dependency Versions *****
  accelerate: 0.33.0
        peft: 0.12.0
transformers: 4.44.0
         trl: 0.9.6
       torch: 2.3.1+cu121
bitsandbytes: 0.43.3
****************************************
deepspeed: 0.15.0

Aug 23 '24 07:08 likejazz

Hey everyone, apologies for taking so long to circle back to this. Unfortunately, I could not reproduce this issue on runpod nodes. I used winglian/axolotl-cloud:main-latest on 2xa40 and did not meet this issue with qlora configs.

Are these all from local systems or from cloud systems? If the latter, have you tried provisioning another node? Secondly, does it only happen with certain configs (large models / small models , full ft / adapter)?

Oct 16 '24 08:10 NanoCode012

axolotl axolotl copied to clipboard

Error invalid configuration argument at line 119 in file /src/csrc/ops.cu

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

axolotl
axolotl copied to clipboard