train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]

Open github2657529567 opened this issue 11 months ago • 0 comments

(torchscale) yehuicheng@bdp-gpu04:~/torchscale/examples/fairseq$ torchrun --nproc_per_node=8 --master_port 29501 --nnodes=1 train.py /home/data/dataset/yehuicheng/LongNet_example/DNA_example/longnet_example --num-workers 0 --activation-fn gelu --share-decoder-input-output-embed --validate-interval-updates 1000 --save-interval-updates 1000 --no-epoch-checkpoints --memory-efficient-fp16 --fp16-init-scale 4 --arch transformer --task language_modeling --sample-break-mode none --tokens-per-sample 4096 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-08 --clip-norm 0.0 --lr 5e-4 --lr-scheduler polynomial_decay --warmup-updates 750 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 --batch-size 4 --update-freq 1 --required-batch-size-multiple 1 --total-num-update 50000 --max-update 50000 --seed 1 --ddp-backend=c10d --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2] W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] ***************************************** W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1108 21:43:12.431143 140431967650432 torch/distributed/run.py:779] *****************************************

usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view] [--plasma-path PLASMA_PATH] [--log-nvidia-smi] [--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}] [--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}] [--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}] [--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}] [--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID] [--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE] [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE] [--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL] [--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES] [--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID] [--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}] [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync] [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM] [--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE] [--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES] [--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE] [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE] [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}] [--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state] [--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM] [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL] [--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES] [--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints] [--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished] [--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric] [--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT] [--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH] [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D] [--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N] [--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR] [--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos] [--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings] [--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding] [--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention] [--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP] [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D] [--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D] [--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping] [--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert] [--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD] [--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}] [--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target] [--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS] [--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length] [--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam] [--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL] [--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS] [--unk UNK] data train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2] usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view] [--plasma-path PLASMA_PATH] [--log-nvidia-smi] [--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}] [--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}] [--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}] [--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}] [--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID] [--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE] [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE] [--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL] [--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES] [--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID] [--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}] [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync] [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM] [--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE] [--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES] [--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE] [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE] [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}] [--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state] [--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM] [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL] [--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES] [--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints] [--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished] [--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric] [--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT] [--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH] [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D] [--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N] [--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR] [--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos] [--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings] [--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding] [--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention] [--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP] [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D] [--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D] [--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping] [--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert] [--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD] [--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}] [--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target] [--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS] [--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length] [--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam] [--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL] [--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS] [--unk UNK] data train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2] usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view] [--plasma-path PLASMA_PATH] [--log-nvidia-smi] [--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}] [--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}] [--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}] [--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}] [--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID] [--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE] [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE] [--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL] [--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES] [--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID] [--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}] [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync] [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM] [--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE] [--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES] [--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE] [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE] [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}] [--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state] [--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM] [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL] [--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES] [--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints] [--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished] [--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric] [--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT] [--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH] [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D] [--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N] [--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR] [--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos] [--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings] [--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding] [--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention] [--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP] [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D] [--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D] [--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping] [--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert] [--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD] [--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}] [--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target] [--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS] [--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length] [--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam] [--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL] [--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS] [--unk UNK] data train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2] usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view] [--plasma-path PLASMA_PATH] [--log-nvidia-smi] [--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}] [--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}] [--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}] [--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}] [--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID] [--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE] [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE] [--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL] [--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES] [--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID] [--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}] [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync] [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM] [--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE] [--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES] [--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE] [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE] [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}] [--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state] [--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM] [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL] [--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES] [--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints] [--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished] [--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric] [--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT] [--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH] [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D] [--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N] [--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR] [--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos] [--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings] [--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding] [--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention] [--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP] [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D] [--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D] [--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping] [--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert] [--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD] [--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}] [--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target] [--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS] [--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length] [--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam] [--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL] [--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS] [--unk UNK] data usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view] [--plasma-path PLASMA_PATH] [--log-nvidia-smi] [--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}] [--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}] [--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}] [--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}] [--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID] [--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE] [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE] [--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL] [--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES] [--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID] [--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}] [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync] [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM] [--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE] [--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES] [--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE] [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE] [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}] [--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state] [--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM] [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL] [--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES] [--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints] [--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished] [--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric] [--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT] [--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH] [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D] [--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N] [--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR] [--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos] [--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings] [--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding] [--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention] [--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP] [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D] [--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D] [--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping] [--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert] [--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD] [--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}] [--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target] [--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS] [--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length] [--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam] [--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL] [--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS] [--unk UNK] data train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2] train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2] usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view] [--plasma-path PLASMA_PATH] [--log-nvidia-smi] [--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}] [--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}] [--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}] [--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}] [--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID] [--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE] [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE] [--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL] [--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES] [--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID] [--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}] [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync] [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM] [--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE] [--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES] [--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE] [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE] [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}] [--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state] [--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM] [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL] [--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES] [--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints] [--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished] [--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric] [--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT] [--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH] [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D] [--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N] [--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR] [--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos] [--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings] [--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding] [--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention] [--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP] [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D] [--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D] [--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping] [--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert] [--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD] [--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}] [--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target] [--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS] [--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length] [--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam] [--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL] [--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS] [--unk UNK] data train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2] usage: train.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format {json,none,simple,tqdm}] [--log-file LOG_FILE] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--wandb-project WANDB_PROJECT] [--azureml-logging] [--seed SEED] [--cpu] [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE] [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE] [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE] [--model-parallel-size MODEL_PARALLEL_SIZE] [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile] [--reset-logging] [--suppress-crashes] [--use-plasma-view] [--plasma-path PLASMA_PATH] [--log-nvidia-smi] [--criterion {sentence_ranking,moe_cross_entropy,label_smoothed_cross_entropy,label_smoothed_cross_entropy_with_alignment,cross_entropy,wav2vec,composite_loss,nat_loss,model,sentence_prediction,legacy_masked_lm_loss,squad,adaptive_loss,masked_lm,ctc,vocab_parallel_cross_entropy,masked_lm_moe_cross_entropy}] [--tokenizer {nltk,space,moses}] [--bpe {hf_byte_bpe,gpt2,fastbpe,bytes,characters,sentencepiece,byte_bpe,subword_nmt,bert}] [--optimizer {sgd,composite,adafactor,adamax,lamb,nag,adadelta,adam,adam8bit,cpu_adam,adagrad}] [--lr-scheduler {inverse_sqrt,manual,tri_stage,pass_through,fixed,cosine,polynomial_decay,reduce_lr_on_plateau,triangular}] [--scoring {wer,chrf,sacrebleu,bleu}] [--task TASK] [--num-workers NUM_WORKERS] [--num-workers-valid NUM_WORKERS_VALID] [--skip-invalid-size-inputs-valid-test] [--max-tokens MAX_TOKENS] [--batch-size BATCH_SIZE] [--required-batch-size-multiple REQUIRED_BATCH_SIZE_MULTIPLE] [--required-seq-len-multiple REQUIRED_SEQ_LEN_MULTIPLE] [--dataset-impl {raw,lazy,cached,mmap,fasta}] [--data-buffer-size DATA_BUFFER_SIZE] [--train-subset TRAIN_SUBSET] [--valid-subset VALID_SUBSET] [--combine-valid-subsets] [--ignore-unused-valid-subsets] [--validate-interval VALIDATE_INTERVAL] [--validate-interval-updates VALIDATE_INTERVAL_UPDATES] [--validate-after-updates VALIDATE_AFTER_UPDATES] [--fixed-validation-seed FIXED_VALIDATION_SEED] [--disable-validation] [--max-tokens-valid MAX_TOKENS_VALID] [--batch-size-valid BATCH_SIZE_VALID] [--max-valid-steps MAX_VALID_STEPS] [--curriculum CURRICULUM] [--gen-subset GEN_SUBSET] [--num-shards NUM_SHARDS] [--shard-id SHARD_ID] [--distributed-world-size DISTRIBUTED_WORLD_SIZE] [--distributed-rank DISTRIBUTED_RANK] [--distributed-backend DISTRIBUTED_BACKEND] [--distributed-init-method DISTRIBUTED_INIT_METHOD] [--distributed-port DISTRIBUTED_PORT] [--device-id DEVICE_ID] [--distributed-no-spawn] [--ddp-backend {c10d,fully_sharded,legacy_ddp,no_c10d,pytorch_ddp,slow_mo}] [--bucket-cap-mb BUCKET_CAP_MB] [--fix-batches-to-gpus] [--find-unused-parameters] [--fast-stat-sync] [--heartbeat-timeout HEARTBEAT_TIMEOUT] [--broadcast-buffers] [--slowmo-momentum SLOWMO_MOMENTUM] [--slowmo-algorithm SLOWMO_ALGORITHM] [--localsgd-frequency LOCALSGD_FREQUENCY] [--nprocs-per-node NPROCS_PER_NODE] [--pipeline-model-parallel] [--pipeline-balance PIPELINE_BALANCE] [--pipeline-devices PIPELINE_DEVICES] [--pipeline-chunks PIPELINE_CHUNKS] [--pipeline-encoder-balance PIPELINE_ENCODER_BALANCE] [--pipeline-encoder-devices PIPELINE_ENCODER_DEVICES] [--pipeline-decoder-balance PIPELINE_DECODER_BALANCE] [--pipeline-decoder-devices PIPELINE_DECODER_DEVICES] [--pipeline-checkpoint {always,never,except_last}] [--zero-sharding {none,os}] [--no-reshard-after-forward] [--fp32-reduce-scatter] [--cpu-offload] [--use-sharded-state] [--arch ARCH] [--max-epoch MAX_EPOCH] [--max-update MAX_UPDATE] [--stop-time-hours STOP_TIME_HOURS] [--clip-norm CLIP_NORM] [--sentence-avg] [--update-freq UPDATE_FREQ] [--lr LR] [--stop-min-lr STOP_MIN_LR] [--use-bmuf] [--save-dir SAVE_DIR] [--restore-file RESTORE_FILE] [--finetune-from-model FINETUNE_FROM_MODEL] [--reset-dataloader] [--reset-lr-scheduler] [--reset-meters] [--reset-optimizer] [--optimizer-overrides OPTIMIZER_OVERRIDES] [--save-interval SAVE_INTERVAL] [--save-interval-updates SAVE_INTERVAL_UPDATES] [--keep-interval-updates KEEP_INTERVAL_UPDATES] [--keep-last-epochs KEEP_LAST_EPOCHS] [--keep-best-checkpoints KEEP_BEST_CHECKPOINTS] [--no-save] [--no-epoch-checkpoints] [--no-last-checkpoints] [--no-best-checkpoints] [--no-save-optimizer-state] [--no-save-optimizer-state-on-training-finished] [--symlink-best-and-last-checkpoints] [--best-checkpoint-metric BEST_CHECKPOINT_METRIC] [--maximize-best-checkpoint-metric] [--patience PATIENCE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT] [--load-checkpoint-on-all-dp-ranks] [--write-checkpoints-asynchronously] [--s3-upload-path S3_UPLOAD_PATH] [--activation-fn {relu,gelu,gelu_fast,gelu_accurate,tanh,linear}] [--dropout D] [--attention-dropout D] [--activation-dropout D] [--encoder-embed-path STR] [--encoder-embed-dim N] [--encoder-ffn-embed-dim N] [--encoder-layers N] [--encoder-attention-heads N] [--encoder-normalize-before] [--encoder-learned-pos] [--decoder-embed-path STR] [--decoder-embed-dim N] [--decoder-ffn-embed-dim N] [--decoder-layers N] [--decoder-attention-heads N] [--decoder-learned-pos] [--decoder-normalize-before] [--decoder-output-dim N] [--share-decoder-input-output-embed] [--share-all-embeddings] [--no-token-positional-embeddings] [--adaptive-softmax-cutoff EXPR] [--adaptive-softmax-dropout D] [--layernorm-embedding] [--no-scale-embedding] [--checkpoint-activations] [--offload-activations] [--no-cross-attention] [--cross-self-attention] [--encoder-layerdrop D] [--decoder-layerdrop D] [--encoder-layers-to-keep ENCODER_LAYERS_TO_KEEP] [--decoder-layers-to-keep DECODER_LAYERS_TO_KEEP] [--quant-noise-pq D] [--quant-noise-pq-block-size D] [--quant-noise-scalar D] [--min-params-to-wrap D] [--moe-freq D] [--encoder-moe-freq D] [--decoder-moe-freq D] [--moe-expert-count D] [--moe-gating-use-fp32] [--moe-second-expert-policy MOE_SECOND_EXPERT_POLICY] [--moe-normalize-gate-prob-before-dropping] [--moe-expert-ffn-dim MOE_EXPERT_FFN_DIM] [--moe-top1-expert] [--moe-eval-capacity-token-fraction MOE_EVAL_CAPACITY_TOKEN_FRACTION] [--moe-normalize-expert-grad MOE_NORMALIZE_EXPERT_GRAD] [--use-moe-pad-mask] [--alternate-ffn-embed-dim ALTERNATE_FFN_EMBED_DIM] [--sample-break-mode {none,complete,complete_doc,eos}] [--tokens-per-sample TOKENS_PER_SAMPLE] [--output-dictionary-size OUTPUT_DICTIONARY_SIZE] [--self-target] [--future-target] [--past-target] [--add-bos-token] [--max-source-positions MAX_SOURCE_POSITIONS] [--max-target-positions MAX_TARGET_POSITIONS] [--shorten-method {none,truncate,random_crop}] [--shorten-data-split-list SHORTEN_DATA_SPLIT_LIST] [--pad-to-fixed-length] [--pad-to-fixed-bsz] [--adam-betas ADAM_BETAS] [--adam-eps ADAM_EPS] [--weight-decay WEIGHT_DECAY] [--use-old-adam] [--fp16-adam-stats] [--block-wise] [--warmup-updates WARMUP_UPDATES] [--force-anneal FORCE_ANNEAL] [--end-learning-rate END_LEARNING_RATE] [--power POWER] [--total-num-update TOTAL_NUM_UPDATE] [--pad PAD] [--eos EOS] [--unk UNK] data train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2] W1108 21:43:16.641655 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273819 closing signal SIGTERM W1108 21:43:16.642491 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273820 closing signal SIGTERM W1108 21:43:16.642741 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273821 closing signal SIGTERM W1108 21:43:16.643247 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273822 closing signal SIGTERM W1108 21:43:16.643435 140431967650432 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2273823 closing signal SIGTERM E1108 21:43:16.708592 140431967650432 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 2) local_rank: 0 (pid: 2273818) of binary: /home/yehuicheng/miniconda3/envs/torchscale/bin/python3.8 Traceback (most recent call last): File "/home/yehuicheng/miniconda3/envs/torchscale/bin/torchrun", line 8, in sys.exit(main()) File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper return f(*args, kwargs) File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/yehuicheng/miniconda3/envs/torchscale/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: [1]: time : 2024-11-08_21:43:16 host : bdp-gpu04.bdp.biosino.org rank : 6 (local_rank: 6) exitcode : 2 (pid: 2273828) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-11-08_21:43:16 host : bdp-gpu04.bdp.biosino.org rank : 7 (local_rank: 7) exitcode : 2 (pid: 2273830) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-11-08_21:43:16 host : bdp-gpu04.bdp.biosino.org rank : 0 (local_rank: 0) exitcode : 2 (pid: 2273818) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Nov 08 '24 13:11 github2657529567

torchscale torchscale copied to clipboard

train.py: error: unrecognized arguments: --flash-attention --segment-length [2048,4096] --dilated-ratio [1,2]

train.py FAILED

Root Cause (first observed failure): [0]: time : 2024-11-08_21:43:16 host : bdp-gpu04.bdp.biosino.org rank : 0 (local_rank: 0) exitcode : 2 (pid: 2273818) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

torchscale
torchscale copied to clipboard