(torchscale) yehuicheng@bdp-gpu04:~/torchscale/examples/fairseq$ torchrun --nproc_per_node=8 --master_port 29501 --nnodes=1 train.py /home/data/dataset/yehuicheng/LongNet_example/DNA_example/longnet_example --num-workers 0 --activation-fn gelu --share-decoder-input-output-embed --validate-interval-updates 1000 --save-interval-updates 1000 --no-epoch-checkpoints --memory-efficient-fp16 --fp16-init-scale 4 --arch transformer --task language_modeling --sample-break-mode none --tokens-per-sample 4096 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 --clip-norm 0.0 --lr 5e-4 --lr-scheduler polynomial_decay --warmup-updates 750 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 --batch-size 4 --update-freq 1 --required-batch-size-multiple 1 --total-num-update 50000 --max-update 50000 --seed 1 --ddp-backend=c10d --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779]
W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] *****************************************
W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] *****************************************
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE]
[--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu]
[--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads]
[--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE]
[--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR]
[--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view]
[--plasma-path PLASMA_PATH] [--log-nvidia-smi]
[--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}]
[--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}]
[--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}]
[--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}]
[--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID]
[--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE]
[--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE]
[--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET]
[--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL]
[--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES]
[--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID]
[--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET]
[--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE]
[--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND]
[--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID]
[--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}]
[--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync]
[--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM]
[--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE]
[--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES]
[--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE]
[--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE]
[--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}]
[--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state]
[--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM]
[--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR]
[--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler]
[--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL]
[--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES]
[--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints]
[--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished]
[--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric]
[--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH]
[--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D]
[--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N]
[--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR]
[--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos]
[--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings]
[--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding]
[--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention]
[--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP]
[--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D]
[--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D]
[--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping]
[--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert]
[--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD]
[--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}]
[--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target]
[--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS]
[--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length]
[--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam]
[--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL]
[--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS]
[--unk UNK]
data
train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]
W1108 21:43:16.641655 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273819 closing signal SIGTERM
W1108 21:43:16.642491 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273820 closing signal SIGTERM
W1108 21:43:16.642741 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273821 closing signal SIGTERM
W1108 21:43:16.643247 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273822 closing signal SIGTERM
W1108 21:43:16.643435 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273823 closing signal SIGTERM
E1108 21:43:16.708592 140431967650432 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 2273818) of binary: /home/yehuicheng/miniconda3/envs/torchscale/bin/python3.8
Traceback (most recent call last):
File "/home/yehuicheng/miniconda3/envs/torchscale/bin/torchrun", line 8, in
sys.exit(main())
File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures:
[1]:
time : 2024-11-08_21:43:16
host : bdp-gpu04.bdp.biosino.org
rank : 6 (local_rank: 6)
exitcode : 2 (pid: 2273828)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-11-08_21:43:16
host : bdp-gpu04.bdp.biosino.org
rank : 7 (local_rank: 7)
exitcode : 2 (pid: 2273830)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2024-11-08_21:43:16
host : bdp-gpu04.bdp.biosino.org
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 2273818)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html