DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Cuda failure 700 when use deepcompile with zero stage 3

Open lantudou opened this issue 4 months ago • 1 comments

My environment is pytorch 2.7.1 with cuda 12.8 on H800. I have tested zero stage2 with deepcomplie and zero state3 with torch compile, everything is ok. but for the setting use zero stage3 with deepcompile, i got the error blow:

The test code is from https://github.com/tohtana/ds_verify_loss/tree/main, and I ensure this bug is no relevant with model. I have tried install deepspeed with the master branch and still got same error.

By the way, Could anyone can share their environment when using deepcomplie with zero state3? It can help me to locate the problem. Thanks!

`ipex flag is deprecated, will be removed in Accelerate v1.10. From 2.7.0, PyTorch has all needed optimizations for Intel CPU and XPU. [2025-08-14 17:36:32,349] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-14 17:36:34,422] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False W0814 17:36:36.583000 233886 site-packages/torch/distributed/run.py:766] W0814 17:36:36.583000 233886 site-packages/torch/distributed/run.py:766] ***************************************** W0814 17:36:36.583000 233886 site-packages/torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0814 17:36:36.583000 233886 site-packages/torch/distributed/run.py:766] ***************************************** Namespace(model_name='Qwen/Qwen3-0.6B', batch_size=1, num_epochs=5, seq_length=512, learning_rate=1e-06, max_grad_norm=1.0, gradient_accumulation_steps=1, activation_checkpointing=False, eval=False, dataset_name='wikitext', dataset_percentage=10.0, num_layers=0, attn_impl='sdpa', compile=True, passes=None, backend='inductor', offload_opt_states=False, profile=False, deterministic=False, seed=42, profile_dir=None, bench_step=100, warmup_step=15, zero_stage=3, log_interval=10, save_weights=False, load_weights=False, use_wandb=False, wandb_project='ds-verify-loss', wandb_run_name=None, wandb_tags=[])Namespace(model_name='Qwen/Qwen3-0.6B', batch_size=1, num_epochs=5, seq_length=512, learning_rate=1e-06, max_grad_norm=1.0, gradient_accumulation_steps=1, activation_checkpointing=False, eval=False, dataset_name='wikitext', dataset_percentage=10.0, num_layers=0, attn_impl='sdpa', compile=True, passes=None, backend='inductor', offload_opt_states=False, profile=False, deterministic=False, seed=42, profile_dir=None, bench_step=100, warmup_step=15, zero_stage=3, log_interval=10, save_weights=False, load_weights=False, use_wandb=False, wandb_project='ds-verify-loss', wandb_run_name=None, wandb_tags=[]) Namespace(model_name='Qwen/Qwen3-0.6B', batch_size=1, num_epochs=5, seq_length=512, learning_rate=1e-06, max_grad_norm=1.0, gradient_accumulation_steps=1, activation_checkpointing=False, eval=False, dataset_name='wikitext', dataset_percentage=10.0, num_layers=0, attn_impl='sdpa', compile=True, passes=None, backend='inductor', offload_opt_states=False, profile=False, deterministic=False, seed=42, profile_dir=None, bench_step=100, warmup_step=15, zero_stage=3, log_interval=10, save_weights=False, load_weights=False, use_wandb=False, wandb_project='ds-verify-loss', wandb_run_name=None, wandb_tags=[])

[2025-08-14 17:36:46,891] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-14 17:36:46,906] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-14 17:36:46,907] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-08-14 17:36:48,546] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False [2025-08-14 17:36:48,546] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False [2025-08-14 17:36:48,546] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False [2025-08-14 17:36:49,767] [INFO] [comm.py:821:init_distributed] cdb=None [2025-08-14 17:36:49,767] [INFO] [comm.py:852:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2025-08-14 17:36:49,767] [INFO] [comm.py:821:init_distributed] cdb=None [2025-08-14 17:36:49,767] [INFO] [comm.py:821:init_distributed] cdb=None NCCL version 2.26.2+cuda12.2 Running on device: cuda:1 is_deepspeed: True Running on device: cuda:2 is_deepspeed: True Running on device: cuda:0 is_deepspeed: True Loading model and tokenizer... [2025-08-14 17:36:53,276] [INFO] [config.py:684:init] Config mesh_device None world_size = 3 [2025-08-14 17:36:53,276] [INFO] [config.py:684:init] Config mesh_device None world_size = 3 [2025-08-14 17:36:53,276] [INFO] [config.py:684:init] Config mesh_device None world_size = 3 [2025-08-14 17:36:53,746] [INFO] [partition_parameters.py:366:exit] finished initializing model - num_params = 311, num_elems = 0.75B Loading dataset: wikitext (10.0% of data)... Dataset loaded: 180135 examples using column 'text' Tokenizing dataset...

Map: 0%| | 0/180135 [00:00<?, ? examples/s] Map: 1%| | 1000/180135 [00:00<00:37, 4785.00 examples/s] Map: 1%| | 2000/180135 [00:00<00:42, 4206.24 examples/s] Map: 2%|▏ | 3000/180135 [00:00<00:38, 4636.31 examples/s] Map: 2%|▏ | 4000/180135 [00:00<00:37, 4695.76 examples/s] Map: 3%|▎ | 5000/180135 [00:01<00:38, 4594.11 examples/s] Map: 3%|▎ | 6000/180135 [00:01<00:39, 4422.22 examples/s] Map: 4%|▍ | 7000/180135 [00:01<00:40, 4294.04 examples/s] Map: 4%|▍ | 8000/180135 [00:01<00:40, 4242.14 examples/s] Map: 5%|▍ | 9000/180135 [00:02<00:39, 4314.98 examples/s] Map: 6%|▌ | 10000/180135 [00:02<00:38, 4388.24 examples/s] Map: 6%|▌ | 11000/180135 [00:02<00:39, 4313.85 examples/s] Map: 7%|▋ | 12000/180135 [00:02<00:40, 4190.85 examples/s] Map: 7%|▋ | 13000/180135 [00:03<00:40, 4161.39 examples/s] Map: 8%|▊ | 14000/180135 [00:03<00:40, 4124.60 examples/s] Map: 8%|▊ | 15000/180135 [00:03<00:40, 4122.20 examples/s] Map: 9%|▉ | 16000/180135 [00:03<00:43, 3731.52 examples/s] Map: 9%|▉ | 17000/180135 [00:04<00:43, 3780.92 examples/s] Map: 10%|▉ | 18000/180135 [00:04<00:41, 3865.97 examples/s] Map: 11%|█ | 19000/180135 [00:04<00:40, 4017.94 examples/s] Map: 11%|█ | 20000/180135 [00:04<00:40, 3971.70 examples/s] Map: 12%|█▏ | 21000/180135 [00:05<00:39, 4017.93 examples/s] Map: 12%|█▏ | 22000/180135 [00:05<00:39, 4050.31 examples/s] Map: 13%|█▎ | 23000/180135 [00:05<00:37, 4137.14 examples/s] Map: 13%|█▎ | 24000/180135 [00:05<00:38, 4073.88 examples/s] Map: 14%|█▍ | 25000/180135 [00:05<00:36, 4229.83 examples/s] Map: 14%|█▍ | 26000/180135 [00:06<00:35, 4304.57 examples/s] Map: 15%|█▍ | 27000/180135 [00:06<00:37, 4117.80 examples/s] Map: 16%|█▌ | 28000/180135 [00:06<00:37, 4059.79 examples/s] Map: 16%|█▌ | 29000/180135 [00:06<00:36, 4095.68 examples/s] Map: 17%|█▋ | 30000/180135 [00:07<00:35, 4267.10 examples/s] Map: 17%|█▋ | 31000/180135 [00:07<00:35, 4206.85 examples/s] Map: 18%|█▊ | 32000/180135 [00:07<00:35, 4147.26 examples/s] Map: 18%|█▊ | 33000/180135 [00:07<00:38, 3845.70 examples/s] Map: 19%|█▉ | 34000/180135 [00:08<00:37, 3893.44 examples/s] Map: 19%|█▉ | 35000/180135 [00:08<00:35, 4083.61 examples/s] Map: 20%|█▉ | 36000/180135 [00:08<00:38, 3748.54 examples/s] Map: 21%|██ | 37000/180135 [00:08<00:36, 3949.70 examples/s] Map: 21%|██ | 38000/180135 [00:09<00:36, 3887.70 examples/s] Map: 22%|██▏ | 39000/180135 [00:09<00:36, 3890.49 examples/s] Map: 22%|██▏ | 40000/180135 [00:09<00:35, 3969.92 examples/s] Map: 23%|██▎ | 41000/180135 [00:10<00:55, 2509.84 examples/s] Map: 23%|██▎ | 42000/180135 [00:10<00:48, 2871.48 examples/s] Map: 24%|██▍ | 43000/180135 [00:10<00:40, 3377.19 examples/s] Map: 24%|██▍ | 44000/180135 [00:11<00:36, 3681.24 examples/s] Map: 25%|██▍ | 45000/180135 [00:11<00:35, 3828.61 examples/s] Map: 26%|██▌ | 46000/180135 [00:11<00:33, 3971.73 examples/s] Map: 26%|██▌ | 47000/180135 [00:11<00:37, 3578.70 examples/s] Map: 27%|██▋ | 48000/180135 [00:12<00:34, 3784.41 examples/s] Map: 27%|██▋ | 49000/180135 [00:12<00:32, 4036.61 examples/s] Map: 28%|██▊ | 50000/180135 [00:12<00:31, 4135.37 examples/s] Map: 28%|██▊ | 51000/180135 [00:12<00:31, 4157.59 examples/s] Map: 29%|██▉ | 52000/180135 [00:13<00:30, 4230.67 examples/s] Map: 29%|██▉ | 53000/180135 [00:13<00:30, 4106.45 examples/s] Map: 30%|██▉ | 54000/180135 [00:13<00:31, 4002.41 examples/s] Map: 31%|███ | 55000/180135 [00:13<00:29, 4206.01 examples/s] Map: 31%|███ | 56000/180135 [00:14<00:32, 3819.23 examples/s] Map: 32%|███▏ | 57000/180135 [00:14<00:31, 3848.69 examples/s] Map: 32%|███▏ | 58000/180135 [00:14<00:30, 4036.73 examples/s] Map: 33%|███▎ | 59000/180135 [00:14<00:30, 3935.27 examples/s] Map: 33%|███▎ | 60000/180135 [00:15<00:29, 4029.04 examples/s] Map: 34%|███▍ | 61000/180135 [00:15<00:29, 3988.32 examples/s] Map: 34%|███▍ | 62000/180135 [00:15<00:31, 3722.05 examples/s] Map: 35%|███▍ | 63000/180135 [00:15<00:30, 3806.07 examples/s] Map: 36%|███▌ | 64000/180135 [00:16<00:29, 3891.73 examples/s] Map: 36%|███▌ | 65000/180135 [00:16<00:29, 3865.05 examples/s] Map: 37%|███▋ | 66000/180135 [00:16<00:29, 3874.53 examples/s] Map: 37%|███▋ | 67000/180135 [00:16<00:29, 3853.13 examples/s] Map: 38%|███▊ | 68000/180135 [00:17<00:29, 3844.24 examples/s] Map: 38%|███▊ | 69000/180135 [00:17<00:28, 3945.49 examples/s] Map: 39%|███▉ | 70000/180135 [00:17<00:27, 3951.22 examples/s] Map: 39%|███▉ | 71000/180135 [00:17<00:27, 3920.83 examples/s] Map: 40%|███▉ | 72000/180135 [00:18<00:27, 3922.67 examples/s] Map: 41%|████ | 73000/180135 [00:18<00:27, 3951.66 examples/s] Map: 41%|████ | 74000/180135 [00:18<00:26, 4012.10 examples/s] Map: 42%|████▏ | 75000/180135 [00:18<00:25, 4137.18 examples/s] Map: 42%|████▏ | 76000/180135 [00:19<00:26, 3911.98 examples/s] Map: 43%|████▎ | 77000/180135 [00:19<00:26, 3939.47 examples/s] Map: 43%|████▎ | 78000/180135 [00:19<00:26, 3909.65 examples/s] Map: 44%|████▍ | 79000/180135 [00:19<00:25, 3898.19 examples/s] Map: 44%|████▍ | 80000/180135 [00:20<00:26, 3836.35 examples/s] Map: 45%|████▍ | 81000/180135 [00:20<00:25, 3886.80 examples/s] Map: 46%|████▌ | 82000/180135 [00:21<00:41, 2341.47 examples/s] Map: 46%|████▌ | 83000/180135 [00:21<00:35, 2732.98 examples/s] Map: 47%|████▋ | 84000/180135 [00:21<00:30, 3152.12 examples/s] Map: 47%|████▋ | 85000/180135 [00:21<00:28, 3381.72 examples/s] Map: 48%|████▊ | 86000/180135 [00:22<00:25, 3694.23 examples/s] Map: 48%|████▊ | 87000/180135 [00:22<00:23, 3929.47 examples/s] Map: 49%|████▉ | 88000/180135 [00:22<00:23, 3890.12 examples/s] Map: 49%|████▉ | 89000/180135 [00:22<00:24, 3777.51 examples/s] Map: 50%|████▉ | 90000/180135 [00:23<00:24, 3670.48 examples/s] Map: 51%|█████ | 91000/180135 [00:23<00:24, 3713.64 examples/s] Map: 51%|█████ | 92000/180135 [00:23<00:23, 3775.99 examples/s] Map: 52%|█████▏ | 93000/180135 [00:24<00:24, 3611.42 examples/s] Map: 52%|█████▏ | 94000/180135 [00:24<00:30, 2811.74 examples/s] Map: 53%|█████▎ | 95000/180135 [00:24<00:26, 3156.01 examples/s] Map: 53%|█████▎ | 96000/180135 [00:25<00:24, 3454.45 examples/s] Map: 54%|█████▍ | 97000/180135 [00:25<00:22, 3624.40 examples/s] Map: 54%|█████▍ | 98000/180135 [00:25<00:22, 3674.19 examples/s] Map: 55%|█████▍ | 99000/180135 [00:25<00:21, 3733.26 examples/s] Map: 56%|█████▌ | 100000/180135 [00:26<00:21, 3754.92 examples/s] Map: 56%|█████▌ | 101000/180135 [00:26<00:21, 3740.85 examples/s] Map: 57%|█████▋ | 102000/180135 [00:26<00:21, 3669.20 examples/s] Map: 57%|█████▋ | 103000/180135 [00:26<00:21, 3652.01 examples/s] Map: 58%|█████▊ | 104000/180135 [00:27<00:22, 3440.06 examples/s] Map: 58%|█████▊ | 105000/180135 [00:27<00:21, 3543.74 examples/s] Map: 59%|█████▉ | 106000/180135 [00:27<00:20, 3634.21 examples/s] Map: 59%|█████▉ | 107000/180135 [00:28<00:20, 3642.35 examples/s] Map: 60%|█████▉ | 108000/180135 [00:28<00:19, 3741.74 examples/s] Map: 61%|██████ | 109000/180135 [00:28<00:18, 3889.49 examples/s] Map: 61%|██████ | 110000/180135 [00:28<00:18, 3883.12 examples/s] Map: 62%|██████▏ | 111000/180135 [00:29<00:17, 3860.57 examples/s] Map: 62%|██████▏ | 112000/180135 [00:29<00:17, 3841.99 examples/s] Map: 63%|██████▎ | 113000/180135 [00:29<00:17, 3747.58 examples/s] Map: 63%|██████▎ | 114000/180135 [00:29<00:18, 3528.75 examples/s] Map: 64%|██████▍ | 115000/180135 [00:30<00:18, 3547.00 examples/s] Map: 64%|██████▍ | 116000/180135 [00:30<00:18, 3496.56 examples/s] Map: 65%|██████▍ | 117000/180135 [00:30<00:18, 3425.00 examples/s] Map: 66%|██████▌ | 118000/180135 [00:31<00:17, 3513.58 examples/s] Map: 66%|██████▌ | 119000/180135 [00:31<00:17, 3555.16 examples/s] Map: 67%|██████▋ | 120000/180135 [00:31<00:16, 3620.52 examples/s] Map: 67%|██████▋ | 121000/180135 [00:31<00:16, 3649.68 examples/s] Map: 68%|██████▊ | 122000/180135 [00:32<00:16, 3606.27 examples/s] Map: 68%|██████▊ | 123000/180135 [00:32<00:15, 3610.13 examples/s] Map: 69%|██████▉ | 124000/180135 [00:33<00:22, 2528.26 examples/s] Map: 69%|██████▉ | 125000/180135 [00:33<00:19, 2880.90 examples/s] Map: 70%|██████▉ | 126000/180135 [00:33<00:17, 3098.30 examples/s] Map: 71%|███████ | 127000/180135 [00:33<00:16, 3290.15 examples/s] Map: 71%|███████ | 128000/180135 [00:34<00:14, 3538.17 examples/s] Map: 72%|███████▏ | 129000/180135 [00:34<00:13, 3658.36 examples/s] Map: 72%|███████▏ | 130000/180135 [00:34<00:13, 3600.95 examples/s] Map: 73%|███████▎ | 131000/180135 [00:34<00:13, 3565.28 examples/s] Map: 73%|███████▎ | 132000/180135 [00:35<00:15, 3201.06 examples/s] Map: 74%|███████▍ | 133000/180135 [00:35<00:13, 3401.50 examples/s] Map: 74%|███████▍ | 134000/180135 [00:35<00:13, 3514.52 examples/s] Map: 75%|███████▍ | 135000/180135 [00:36<00:12, 3615.61 examples/s] Map: 75%|███████▌ | 136000/180135 [00:36<00:12, 3552.74 examples/s] Map: 76%|███████▌ | 137000/180135 [00:36<00:11, 3675.30 examples/s] Map: 77%|███████▋ | 138000/180135 [00:36<00:11, 3689.10 examples/s] Map: 77%|███████▋ | 139000/180135 [00:37<00:10, 3855.00 examples/s] Map: 78%|███████▊ | 140000/180135 [00:37<00:10, 3813.06 examples/s] Map: 78%|███████▊ | 141000/180135 [00:37<00:10, 3632.93 examples/s] Map: 79%|███████▉ | 142000/180135 [00:37<00:10, 3760.29 examples/s] Map: 79%|███████▉ | 143000/180135 [00:38<00:09, 3738.86 examples/s] Map: 80%|███████▉ | 144000/180135 [00:38<00:09, 3816.76 examples/s] Map: 80%|████████ | 145000/180135 [00:38<00:09, 3849.13 examples/s] Map: 81%|████████ | 146000/180135 [00:39<00:09, 3475.89 examples/s] Map: 82%|████████▏ | 147000/180135 [00:39<00:09, 3430.89 examples/s] Map: 82%|████████▏ | 148000/180135 [00:39<00:09, 3396.91 examples/s] Map: 83%|████████▎ | 149000/180135 [00:39<00:09, 3434.32 examples/s] Map: 83%|████████▎ | 150000/180135 [00:40<00:08, 3677.92 examples/s] Map: 84%|████████▍ | 151000/180135 [00:40<00:08, 3404.21 examples/s] Map: 84%|████████▍ | 152000/180135 [00:40<00:08, 3443.16 examples/s] Map: 85%|████████▍ | 153000/180135 [00:41<00:07, 3418.15 examples/s] Map: 85%|████████▌ | 154000/180135 [00:41<00:07, 3428.58 examples/s] Map: 86%|████████▌ | 155000/180135 [00:41<00:07, 3502.36 examples/s] Map: 87%|████████▋ | 156000/180135 [00:42<00:07, 3242.75 examples/s] Map: 87%|████████▋ | 157000/180135 [00:42<00:06, 3351.27 examples/s] Map: 88%|████████▊ | 158000/180135 [00:42<00:06, 3459.93 examples/s] Map: 88%|████████▊ | 159000/180135 [00:42<00:06, 3500.75 examples/s] Map: 89%|████████▉ | 160000/180135 [00:43<00:06, 3351.70 examples/s] Map: 89%|████████▉ | 161000/180135 [00:43<00:05, 3354.47 examples/s] Map: 90%|████████▉ | 162000/180135 [00:43<00:05, 3346.00 examples/s] Map: 90%|█████████ | 163000/180135 [00:44<00:05, 3397.91 examples/s] Map: 91%|█████████ | 164000/180135 [00:44<00:04, 3275.05 examples/s] Map: 92%|█████████▏| 165000/180135 [00:44<00:04, 3481.97 examples/s] Map: 92%|█████████▏| 166000/180135 [00:45<00:07, 1947.81 examples/s] Map: 93%|█████████▎| 167000/180135 [00:45<00:05, 2289.45 examples/s] Map: 93%|█████████▎| 168000/180135 [00:46<00:04, 2483.97 examples/s] Map: 94%|█████████▍| 169000/180135 [00:46<00:04, 2726.56 examples/s] Map: 94%|█████████▍| 170000/180135 [00:46<00:03, 2816.59 examples/s] Map: 95%|█████████▍| 171000/180135 [00:47<00:03, 2945.61 examples/s] Map: 95%|█████████▌| 172000/180135 [00:47<00:02, 3156.98 examples/s] Map: 96%|█████████▌| 173000/180135 [00:47<00:02, 3050.78 examples/s] Map: 97%|█████████▋| 174000/180135 [00:48<00:01, 3085.74 examples/s] Map: 97%|█████████▋| 175000/180135 [00:48<00:01, 3200.99 examples/s] Map: 98%|█████████▊| 176000/180135 [00:48<00:01, 3298.87 examples/s][2025-08-14 17:37:56,190] [INFO] [config.py:684:init] Config mesh_device None world_size = 3 [2025-08-14 17:37:56,190] [INFO] [config.py:684:init] Config mesh_device None world_size = 3

Map: 98%|█████████▊| 177000/180135 [00:48<00:00, 3628.29 examples/s] Map: 99%|█████████▉| 178000/180135 [00:49<00:00, 3911.72 examples/s] Map: 99%|█████████▉| 179000/180135 [00:49<00:00, 4133.14 examples/s] Map: 100%|█████████▉| 180000/180135 [00:49<00:00, 4248.64 examples/s] Map: 100%|██████████| 180135/180135 [00:49<00:00, 3634.41 examples/s] Tokenization complete. Dataset ready with 180135 examples. [2025-08-14 17:37:57,000] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed info: version=0.17.5+8aadf6cb, git-hash=8aadf6cb, git-branch=master [2025-08-14 17:37:57,000] [INFO] [config.py:684:init] Config mesh_device None world_size = 3 [2025-08-14 17:37:57,187] [INFO] [engine.py:1343:_configure_distributed_model] ********** distributed groups summary ********** self.dp_world_size=3 self.mp_world_size=1 self.seq_dp_world_size=3 self.sequence_parallel_size=1


[2025-08-14 17:37:57,188] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2025-08-14 17:37:57,189] [INFO] [logging.py:107:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2025-08-14 17:37:57,189] [INFO] [logging.py:107:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2025-08-14 17:37:57,199] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW [2025-08-14 17:37:57,199] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'> [2025-08-14 17:37:57,199] [INFO] [logging.py:107:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False [2025-08-14 17:37:57,200] [INFO] [logging.py:107:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer [2025-08-14 17:37:57,570] [INFO] [utils.py:781:see_memory_usage] Stage 3 initialize beginning [2025-08-14 17:37:57,575] [INFO] [utils.py:782:see_memory_usage] MA 0.39 GB Max_MA 0.77 GB CA 0.83 GB Max_CA 1 GB [2025-08-14 17:37:57,575] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 300.91 GB, percent = 14.9% [2025-08-14 17:37:57,577] [INFO] [stage3.py:186:init] Reduce bucket size 500000000 [2025-08-14 17:37:57,577] [INFO] [stage3.py:187:init] Prefetch bucket size 50000000 [2025-08-14 17:37:57,863] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2025-08-14 17:37:57,864] [INFO] [utils.py:782:see_memory_usage] MA 0.39 GB Max_MA 0.39 GB CA 0.83 GB Max_CA 1 GB [2025-08-14 17:37:57,864] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 300.52 GB, percent = 14.9% Parameter Offload - Persistent parameters statistics: param_count = 113, numel = 65536 [2025-08-14 17:37:58,141] [INFO] [utils.py:781:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2025-08-14 17:37:58,142] [INFO] [utils.py:782:see_memory_usage] MA 0.39 GB Max_MA 0.39 GB CA 0.83 GB Max_CA 1 GB [2025-08-14 17:37:58,142] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 300.52 GB, percent = 14.9% [2025-08-14 17:37:58,438] [INFO] [utils.py:781:see_memory_usage] Before creating fp16 partitions [2025-08-14 17:37:58,439] [INFO] [utils.py:782:see_memory_usage] MA 0.39 GB Max_MA 0.39 GB CA 0.83 GB Max_CA 1 GB [2025-08-14 17:37:58,439] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 300.56 GB, percent = 14.9% [2025-08-14 17:37:59,480] [INFO] [utils.py:781:see_memory_usage] After creating fp16 partitions: 1 [2025-08-14 17:37:59,481] [INFO] [utils.py:782:see_memory_usage] MA 0.37 GB Max_MA 0.39 GB CA 0.37 GB Max_CA 1 GB [2025-08-14 17:37:59,481] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 300.19 GB, percent = 14.9% [2025-08-14 17:37:59,741] [INFO] [utils.py:781:see_memory_usage] Before creating fp32 partitions [2025-08-14 17:37:59,741] [INFO] [utils.py:782:see_memory_usage] MA 0.37 GB Max_MA 0.37 GB CA 0.37 GB Max_CA 0 GB [2025-08-14 17:37:59,742] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 300.26 GB, percent = 14.9% [2025-08-14 17:37:59,987] [INFO] [utils.py:781:see_memory_usage] After creating fp32 partitions [2025-08-14 17:37:59,988] [INFO] [utils.py:782:see_memory_usage] MA 1.11 GB Max_MA 1.48 GB CA 1.48 GB Max_CA 1 GB [2025-08-14 17:37:59,988] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 300.27 GB, percent = 14.9% [2025-08-14 17:38:00,203] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states [2025-08-14 17:38:00,203] [INFO] [utils.py:782:see_memory_usage] MA 1.11 GB Max_MA 1.11 GB CA 1.48 GB Max_CA 1 GB [2025-08-14 17:38:00,203] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 300.44 GB, percent = 14.9% [2025-08-14 17:38:00,444] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states [2025-08-14 17:38:00,444] [INFO] [utils.py:782:see_memory_usage] MA 1.11 GB Max_MA 1.85 GB CA 2.22 GB Max_CA 2 GB [2025-08-14 17:38:00,444] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 300.35 GB, percent = 14.9% [2025-08-14 17:38:00,445] [INFO] [stage3.py:554:setup_for_real_optimizer] optimizer state initialized Model prepared: <class 'deepspeed.runtime.engine.DeepSpeedEngine'> optimizer: <class 'accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper'> Model prepared: <class 'deepspeed.runtime.engine.DeepSpeedEngine'> optimizer: <class 'accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper'> [2025-08-14 17:38:00,574] [INFO] [engine.py:3975:compile] Compiling deepcompile=True backend=inductor [2025-08-14 17:38:00,574] [INFO] [engine.py:3975:compile] Compiling deepcompile=True backend=inductor [2025-08-14 17:38:00,806] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer [2025-08-14 17:38:00,806] [INFO] [utils.py:782:see_memory_usage] MA 2.41 GB Max_MA 2.99 GB CA 3.16 GB Max_CA 3 GB [2025-08-14 17:38:00,806] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 300.73 GB, percent = 14.9% [2025-08-14 17:38:00,807] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer_Stage3 [2025-08-14 17:38:00,807] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = None [2025-08-14 17:38:00,807] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed LR Scheduler = None [2025-08-14 17:38:00,807] [INFO] [logging.py:107:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-06], mom=[(0.9, 0.999)] [2025-08-14 17:38:00,808] [INFO] [logging.py:107:log_dist] [Rank 0] [TorchCheckpointEngine] Initialized with serialization = True [2025-08-14 17:38:00,808] [INFO] [config.py:954:print] DeepSpeedEngine configuration: [2025-08-14 17:38:00,808] [INFO] [config.py:958:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2025-08-14 17:38:00,808] [INFO] [config.py:958:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'intra_op_parallelism': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False} [2025-08-14 17:38:00,808] [INFO] [config.py:958:print] amp_enabled .................. False [2025-08-14 17:38:00,808] [INFO] [config.py:958:print] amp_params ................... False [2025-08-14 17:38:00,808] [INFO] [config.py:958:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] bfloat16_config .............. enabled=True immediate_grad_update=False check_grad_overflow=False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] checkpoint_config ............ {'tag_validation': 'WARN', 'checkpoint_serialization': True, 'writer': None} [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] checkpoint_parallel_write_pipeline False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] checkpoint_tag_validation_enabled True [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] checkpoint_tag_validation_fail False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f2632546440> [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] communication_data_type ...... None [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] compile_config ............... deepcompile=True free_activation=False offload_activation=False offload_opt_states=False double_buffer=True symmetric_memory=False debug_log=False offload_parameters=False sync_before_reduce=False sync_after_reduce=False sync_before_allgather=False sync_after_allgather=False keep_int_input_tensors=True keep_all_input_tensors=False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] curriculum_enabled_legacy .... False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] curriculum_params_legacy ..... False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'pin_memory': False, 'curriculum_learning': {'enabled': False}, 'dynamic_batching': {'enabled': False, 'lr_scaling_method': 'linear', 'min_batch_size': 1, 'max_batch_size': None, 'sequence_picking_order': 'dataloader', 'verbose': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] data_efficiency_enabled ...... False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] dataloader_drop_last ......... False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] disable_allgather ............ False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] dump_state ................... False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] eigenvalue_enabled ........... False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] eigenvalue_gas_boundary_resolution 1 [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] eigenvalue_layer_name ........ bert.encoder.layer [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] eigenvalue_layer_num ......... 0 [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] eigenvalue_max_iter .......... 100 [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] eigenvalue_stability ......... 1e-06 [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] eigenvalue_tol ............... 0.01 [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] eigenvalue_verbose ........... False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] elasticity_enabled ........... False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] float16_config ............... enabled=False auto_cast=False loss_scale=0.0 initial_scale_power=16 loss_scale_window=1000 hysteresis=2 consecutive_hysteresis=False min_loss_scale=1 fp16_master_weights_and_grads=False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] flops_profiler_config ........ { "enabled": false, "recompute_fwd_factor": 0.0, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] global_rank .................. 0 [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] grad_accum_dtype ............. None [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] gradient_accumulation_steps .. 1 [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] gradient_clipping ............ 1.0 [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] gradient_predivide_factor .... 1.0 [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] graph_harvesting ............. False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] load_universal_checkpoint .... False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] memory_breakdown ............. False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] mics_hierarchial_params_gather False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] mics_shard_size .............. -1 [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] optimizer_legacy_fusion ...... False [2025-08-14 17:38:00,809] [INFO] [config.py:958:print] optimizer_name ............... None [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] optimizer_params ............. None [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True} [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] pld_enabled .................. False [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] pld_params ................... False [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] prescale_gradients ........... False [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] scheduler_name ............... None [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] scheduler_params ............. None [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] seq_parallel_communication_data_type torch.float32 [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] sparse_attention ............. None [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] sparse_gradients_enabled ..... False [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] steps_per_print .............. inf [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] tensor_parallel_config ....... dtype=torch.float16 autotp_size=0 tp_overlap_comm=False tensor_parallel=TPConfig(tp_size=1, tp_grain_size=1, mpu=None, tp_group=None) injection_policy_tuple=None keep_module_on_host=False replace_with_kernel_inject=False [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] timers_config ................ enabled=True synchronized=True [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] torch_autocast_dtype ......... None [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] torch_autocast_enabled ....... False [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] torch_autocast_lower_precision_safe_modules None [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] train_batch_size ............. 3 [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] train_micro_batch_size_per_gpu 1 [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] use_data_before_expert_parallel False [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] use_node_local_storage ....... False [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] wall_clock_breakdown ......... False [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] weight_quantization_config ... None [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] world_size ................... 3 [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] zero_allow_untested_optimizer True [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False module_granularity_threshold=0 use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False zeropp_loco_param=None mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True log_trace_cache_warnings=False [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] zero_enabled ................. True [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] zero_force_ds_cpu_optimizer .. True [2025-08-14 17:38:00,810] [INFO] [config.py:958:print] zero_optimization_stage ...... 3 [2025-08-14 17:38:00,810] [INFO] [config.py:944:print_user_config] json = { "bf16": { "enabled": true }, "zero_optimization": { "stage": 3, "overlap_comm": true }, "compile": { "deepcompile": true, "offload_activation": false, "offload_opt_states": false, "double_buffer": true, "symmetric_memory": false, "free_activation": false, "debug_log": false, "sync_before_reduce": false, "sync_after_reduce": false }, "gradient_accumulation_steps": 1, "gradient_clipping": 1.0, "steps_per_print": inf, "train_batch_size": 3, "train_micro_batch_size_per_gpu": 1, "wall_clock_breakdown": false, "fp16": { "enabled": false }, "zero_allow_untested_optimizer": true } Model prepared: <class 'deepspeed.runtime.engine.DeepSpeedEngine'> optimizer: <class 'accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper'> [2025-08-14 17:38:00,812] [INFO] [engine.py:3975:compile] Compiling deepcompile=True backend=inductor Using /root/.cache/torch_extensions/py310_cu128 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu128/dc/build.ninja... Using /root/.cache/torch_extensions/py310_cu128 as PyTorch extensions root... /hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module dc... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module dc... Using /root/.cache/torch_extensions/py310_cu128 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu128/dc/build.ninja... /hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module dc... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module dc... Loading extension module dc... Time to load dc op: 2.61519718170166 seconds Time to load dc op: 2.847108840942383 seconds Time to load dc op: 2.847123146057129 seconds huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using tokenizers before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) Launching compile passes: global_steps=0 passes=[<function add_z3_gather_release at 0x7f2667577250>]

[2025-08-14 17:38:34] desktop_9010:234951:234951 [2] enqueue.cc:1556 NCCL WARN Cuda failure 700 'an illegal memory access was encountered' MemoryProfiling error CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[rank2]: Traceback (most recent call last): [rank2]: File "/hpfs/syh/ds_verify_loss/verify_loss.py", line 344, in [rank2]: main() [rank2]: File "/hpfs/syh/ds_verify_loss/verify_loss.py", line 239, in main [rank2]: outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=input_ids, use_cache=False) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl [rank2]: return self._call_impl(*args, **kwargs) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl [rank2]: return forward_call(*args, **kwargs) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn [rank2]: ret_val = func(*args, **kwargs) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2109, in forward [rank2]: loss = self.module(*inputs, **kwargs) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl [rank2]: return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc] [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn [rank2]: return fn(*args, **kwargs) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl [rank2]: return forward_call(*args, **kwargs) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1432, in call [rank2]: return self._torchdynamo_orig_callable( [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1213, in call [rank2]: result = self._inner_convert( [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 598, in call [rank2]: return _compile( [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1110, in _compile [rank2]: raise InternalTorchDynamoError( [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 1059, in _compile [rank2]: guarded_code = compile_inner(code, one_graph, hooks, transform) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/_utils_internal.py", line 97, in wrapper_function [rank2]: return function(*args, **kwargs) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 761, in compile_inner [rank2]: return _compile_inner(code, one_graph, hooks, transform) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 797, in _compile_inner [rank2]: out_code = transform_code_object(code, transform) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1422, in transform_code_object [rank2]: transformations(instructions, code_options) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 278, in _fn [rank2]: torch.cuda.set_rng_state(cuda_rng_state) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/cuda/random.py", line 76, in set_rng_state [rank2]: _lazy_call(cb) [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/cuda/init.py", line 302, in _lazy_call [rank2]: callable() [rank2]: File "/hpfs/syh/envs/flux-12.8/lib/python3.10/site-packages/torch/cuda/random.py", line 74, in cb [rank2]: default_generator.set_state(new_state_copy) [rank2]: torch._dynamo.exc.InternalTorchDynamoError: RuntimeError: CUDA error: an illegal memory access was encountered [rank2]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2025-08-14 17:38:37] desktop_9010:234951:242135 [2] misc/strongstream.cc:333 NCCL WARN Cuda failure 'an illegal memory access was encountered'

[2025-08-14 17:38:37] desktop_9010:234951:242135 [2] init.cc:1896 NCCL WARN commDestroySync: comm 0x1555f290 rank 2 sync hostStream error 1

[2025-08-14 17:38:37] desktop_9010:234951:242135 [2] misc/strongstream.cc:333 NCCL WARN Cuda failure 'an illegal memory access was encountered'

[2025-08-14 17:38:37] desktop_9010:234951:242135 [2] init.cc:1899 NCCL WARN commDestroySync: comm 0x1555f290 rank 2 sync deviceStream error 1

`

lantudou avatar Aug 14 '25 09:08 lantudou

@tohtana - curious if you have any thoughts on this?

loadams avatar Sep 19 '25 04:09 loadams