distributed-embeddings
distributed-embeddings copied to clipboard
Synthetic Model Single GPU Example always gets OOM
Synthetic Model Single GPU Example always gets OOM, even I use a A100 machine and set batch_size=1
python main.py --model small --optimizer sgd --batch_size 1
root@2437d34894a8:/workspaces/distributed-embeddings/examples/benchmarks/synthetic_models# python main.py --model small --optimizer sgd --batch_size 1
2023-08-10 06:59:19.339639: I tensorflow/core/platform/cpu_feature_guard.cc:183] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-10 06:59:27.281514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1638] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38111 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:07:00.0, compute capability: 8.0
I0810 06:59:27.387234 139766424715712 synthetic_models.py:144] 107 embedding tables created.
I0810 06:59:27.409688 139766424715712 synthetic_models.py:83] Generated 116 categorical inputs for 107 embedding tables
2023-08-10 06:59:30.842389: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1780] (One-time warning): Not using XLA:CPU for cluster.
If you want XLA:CPU, do one of the following:
- set the TF_XLA_FLAGS to include "--tf_xla_cpu_global_jit", or
- set cpu_global_jit to true on this session's OptimizerOptions, or
- use experimental_jit_scope, or
- use tf.function(jit_compile=True).
To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a
proper command-line flag, not via TF_XLA_FLAGS).
/usr/local/lib/python3.10/dist-packages/keras/initializers/initializers.py:120: UserWarning: The initializer RandomUniform is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or a
void using the same initalizer instance more than once.
warnings.warn(
2023-08-10 06:59:59.695493: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:655] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-08-10 07:00:13.955161: W tensorflow/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran out of memory trying to allocate 26.29GiB (rounded to 28224000000)requested by op Fill
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2023-08-10 07:00:13.955235: I tensorflow/tsl/framework/bfc_allocator.cc:1039] BFCAllocator dump for GPU_0_bfc
2023-08-10 07:00:13.955251: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (256): Total Chunks: 244, Chunks in use: 244. 61.0KiB allocated for chunks. 61.0KiB in use in bin. 3.9KiB client-requested in use in bin.
2023-08-10 07:00:13.955261: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (512): Total Chunks: 2, Chunks in use: 2. 1.0KiB allocated for chunks. 1.0KiB in use in bin. 1.0KiB client-requested in use in bin.
2023-08-10 07:00:13.955271: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (1024): Total Chunks: 3, Chunks in use: 2. 3.8KiB allocated for chunks. 2.2KiB in use in bin. 2.0KiB client-requested in use in bin.
2023-08-10 07:00:13.955281: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (2048): Total Chunks: 1, Chunks in use: 1. 2.0KiB allocated for chunks. 2.0KiB in use in bin. 2.0KiB client-requested in use in bin.
2023-08-10 07:00:13.955290: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955299: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (8192): Total Chunks: 1, Chunks in use: 0. 10.0KiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955308: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955317: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955325: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (65536): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955336: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (131072): Total Chunks: 2, Chunks in use: 1. 381.5KiB allocated for chunks. 128.0KiB in use in bin. 128.0KiB client-requested in use in bin.
2023-08-10 07:00:13.955345: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (262144): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955356: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (524288): Total Chunks: 2, Chunks in use: 1. 1.55MiB allocated for chunks. 950.2KiB in use in bin. 512.0KiB client-requested in use in bin.
2023-08-10 07:00:13.955365: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (1048576): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955373: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (2097152): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955384: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (4194304): Total Chunks: 1, Chunks in use: 1. 4.93MiB allocated for chunks. 4.93MiB in use in bin. 4.93MiB client-requested in use in bin.
2023-08-10 07:00:13.955394: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (8388608): Total Chunks: 2, Chunks in use: 2. 17.85MiB allocated for chunks. 17.85MiB in use in bin. 15.91MiB client-requested in use in bin.
2023-08-10 07:00:13.955403: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (16777216): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955411: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955421: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955429: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2023-08-10 07:00:13.955440: I tensorflow/tsl/framework/bfc_allocator.cc:1046] Bin (268435456): Total Chunks: 2, Chunks in use: 1. 31.99GiB allocated for chunks. 26.29GiB in use in bin. 26.29GiB client-requested in use in bin.
2023-08-10 07:00:13.955450: I tensorflow/tsl/framework/bfc_allocator.cc:1062] Bin for 26.29GiB was 256.00MiB, Chunk State:
2023-08-10 07:00:13.955466: I tensorflow/tsl/framework/bfc_allocator.cc:1068] Size: 5.70GiB | Requested Size: 64B | in_use: 0 | bin_num: 20, prev: Size: 4.93MiB | Requested Size: 4.93MiB | in_use: 1 | bin_num: -1
2023-08-10 07:00:13.955474: I tensorflow/tsl/framework/bfc_allocator.cc:1075] Next region of size 34359738368
2023-08-10 07:00:13.955485: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7efca0000000 of size 28224000000 next 239
2023-08-10 07:00:13.955495: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f0332481000 of size 10330112 next 356
2023-08-10 07:00:13.955503: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f0332e5b000 of size 5165056 next 354
2023-08-10 07:00:13.955510: I tensorflow/tsl/framework/bfc_allocator.cc:1095] Free at 7f0333348000 of size 6120243200 next 18446744073709551615
2023-08-10 07:00:13.955518: I tensorflow/tsl/framework/bfc_allocator.cc:1075] Next region of size 8388608
2023-08-10 07:00:13.955525: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1247000000 of size 8388608 next 18446744073709551615
2023-08-10 07:00:13.955533: I tensorflow/tsl/framework/bfc_allocator.cc:1075] Next region of size 2097152
2023-08-10 07:00:13.955540: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400000 of size 256 next 1
2023-08-10 07:00:13.955548: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400100 of size 1280 next 2
2023-08-10 07:00:13.955555: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400600 of size 256 next 3
2023-08-10 07:00:13.955565: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400700 of size 256 next 4
2023-08-10 07:00:13.955572: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400800 of size 256 next 5
2023-08-10 07:00:13.955580: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400900 of size 256 next 6
2023-08-10 07:00:13.955588: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400a00 of size 256 next 7
2023-08-10 07:00:13.955596: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400b00 of size 256 next 8
2023-08-10 07:00:13.955605: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400c00 of size 256 next 9
2023-08-10 07:00:13.955613: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711400d00 of size 256 next 10
......
2023-08-10 07:00:13.957322: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f171140eb00 of size 256 next 232
2023-08-10 07:00:13.957329: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f171140ec00 of size 256 next 233
2023-08-10 07:00:13.957336: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f171140ed00 of size 256 next 234
2023-08-10 07:00:13.957344: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f171140ee00 of size 256 next 235
2023-08-10 07:00:13.957352: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f171140ef00 of size 256 next 236
2023-08-10 07:00:13.957359: I tensorflow/tsl/framework/bfc_allocator.cc:1095] Free at 7f171140f000 of size 10240 next 240
2023-08-10 07:00:13.957366: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711411800 of size 256 next 350
2023-08-10 07:00:13.957373: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711411900 of size 256 next 353
2023-08-10 07:00:13.957381: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711411a00 of size 2048 next 351
2023-08-10 07:00:13.957388: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412200 of size 256 next 352
2023-08-10 07:00:13.957395: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412300 of size 256 next 355
2023-08-10 07:00:13.957404: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412400 of size 1024 next 357
2023-08-10 07:00:13.957412: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412800 of size 256 next 360
2023-08-10 07:00:13.957421: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412900 of size 256 next 358
2023-08-10 07:00:13.957433: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412a00 of size 512 next 362
2023-08-10 07:00:13.957442: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412c00 of size 256 next 363
2023-08-10 07:00:13.957450: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412d00 of size 256 next 359
2023-08-10 07:00:13.957651: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711412e00 of size 256 next 368
2023-08-10 07:00:13.957659: I tensorflow/tsl/framework/bfc_allocator.cc:1095] Free at 7f1711412f00 of size 1536 next 369
2023-08-10 07:00:13.957667: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711413500 of size 512 next 371
2023-08-10 07:00:13.957675: I tensorflow/tsl/framework/bfc_allocator.cc:1095] Free at 7f1711413700 of size 259584 next 366
2023-08-10 07:00:13.957683: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711452d00 of size 131072 next 365
2023-08-10 07:00:13.957691: I tensorflow/tsl/framework/bfc_allocator.cc:1095] Free at 7f1711472d00 of size 653824 next 361
2023-08-10 07:00:13.957700: I tensorflow/tsl/framework/bfc_allocator.cc:1095] InUse at 7f1711512700 of size 973056 next 18446744073709551615
2023-08-10 07:00:13.957707: I tensorflow/tsl/framework/bfc_allocator.cc:1100] Summary of in-use Chunks by size:
2023-08-10 07:00:13.957717: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 244 Chunks of size 256 totalling 61.0KiB
2023-08-10 07:00:13.957726: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 2 Chunks of size 512 totalling 1.0KiB
2023-08-10 07:00:13.957734: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 1024 totalling 1.0KiB
2023-08-10 07:00:13.957741: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 1280 totalling 1.2KiB
2023-08-10 07:00:13.957750: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 2048 totalling 2.0KiB
2023-08-10 07:00:13.957759: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 131072 totalling 128.0KiB
2023-08-10 07:00:13.957768: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 973056 totalling 950.2KiB
2023-08-10 07:00:13.957776: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 5165056 totalling 4.93MiB
2023-08-10 07:00:13.957785: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 8388608 totalling 8.00MiB
2023-08-10 07:00:13.957794: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 10330112 totalling 9.85MiB
2023-08-10 07:00:13.957802: I tensorflow/tsl/framework/bfc_allocator.cc:1103] 1 Chunks of size 28224000000 totalling 26.29GiB
2023-08-10 07:00:13.957810: I tensorflow/tsl/framework/bfc_allocator.cc:1107] Sum Total of in-use chunks: 26.31GiB
2023-08-10 07:00:13.957818: I tensorflow/tsl/framework/bfc_allocator.cc:1109] Total bytes in pool: 34370224128 memory_limit_: 39963262976 available bytes: 5593038848 curr_region_allocation_bytes_: 34359738368
2023-08-10 07:00:13.957832: I tensorflow/tsl/framework/bfc_allocator.cc:1114] Stats:
Limit: 39963262976
InUse: 28249055744
MaxInUse: 28249055744
NumAllocs: 874
MaxAllocSize: 28224000000
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2023-08-10 07:00:13.957855: W tensorflow/tsl/framework/bfc_allocator.cc:497] ***********************************************************************************________________*
2023-08-10 07:00:13.957901: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at constant_op.cc:175 : RESOURCE_EXHAUSTED: OOM when allocating tensor with shape[220500000,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "/workspaces/distributed-embeddings/examples/benchmarks/synthetic_models/main.py", line 162, in <module>
app.run(main)
File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/workspaces/distributed-embeddings/examples/benchmarks/synthetic_models/main.py", line 135, in main
loss = train_step(numerical_features, cat_features, labels)
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/func_graph.py", line 1200, in autograph_handler
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: in user code:
File "/workspaces/distributed-embeddings/examples/benchmarks/synthetic_models/main.py", line 129, in train_step *
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
File "/usr/local/lib/python3.10/dist-packages/keras/optimizers/optimizer.py", line 1174, in apply_gradients **
return super().apply_gradients(grads_and_vars, name=name)
File "/usr/local/lib/python3.10/dist-packages/keras/optimizers/optimizer.py", line 637, in apply_gradients
self.build(trainable_variables)
File "/usr/local/lib/python3.10/dist-packages/keras/optimizers/sgd.py", line 146, in build
self.add_variable_from_reference(
File "/usr/local/lib/python3.10/dist-packages/keras/optimizers/optimizer.py", line 1106, in add_variable_from_reference
return super().add_variable_from_reference(
File "/usr/local/lib/python3.10/dist-packages/keras/optimizers/optimizer.py", line 507, in add_variable_from_reference
initial_value = tf.zeros(
ResourceExhaustedError: {{function_node __wrapped__Fill_device_/job:localhost/replica:0/task:0/device:GPU:0}} OOM when allocating tensor with shape[220500000,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Fill]