nevakrien
nevakrien
> @nevakrien Current hyperparameters in examples/pretrain_bert/README.md are verified on Intel GPU Max 1550. I'm afraid that it will be Out-Of-Memory on MAX 1100. You can reduce batch size to avoid...
@yitingw1 sorry for being slow to answer yes thats the case I am seeing 4 of them >>> tf.config.list_physical_devices("XPU") [PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU'), PhysicalDevice(name='/physical_device:XPU:1', device_type='XPU'), PhysicalDevice(name='/physical_device:XPU:2', device_type='XPU'), PhysicalDevice(name='/physical_device:XPU:3', device_type='XPU')] >>> which is...
well I changed to the recommanded settings from earlier in this thread set NUM_GPUS=1 and ran the example ``` 2024-03-14 14:22:49.706624: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform...
it did not seem to work we have the same issue... again alocates gpu memory gets stuck in what seems like an infinite loop and nothing meaningful really happens. I...
ran the multi gpu version exact same issue ``` (example_pretrain) sdp@gpunode:~/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT$ DATA_DIR=data (example_pretrain) sdp@gpunode:~/intel-extension-for-tensorflow/examples/pretrain_bert/DeepLearningExamples/TensorFlow2/LanguageModeling/BERT$ bash scripts/run_pretraining_lamb.sh $TRAIN_BATCH_SIZE_PHASE1 $TRAIN_BATCH_SIZE_PHASE2 $EVAL_BATCH_SIZE $LEARNING_RATE_PHASE1 $LEARNING_RATE_PHASE2 $DATATYPE $USE_XLA $NUM_GPUS $WARMUP_STEPS_PHASE1 $WARMUP_STEPS_PHASE2 $TRAIN_STEPS $SAVE_CHECKPOINT_STEPS $NUM_ACCUMULATION_STEPS_PHASE1 $NUM_ACCUMULATION_STEPS_PHASE2...
thank you for clarifying yes so I went off that route and I am using what you recommended with multiple gpus/ limiting visibility in both cases I am runing into...
I used intel_gpu_top I can see it does allocate memory but it does not use any cores which is the issue
``` sdp@gpunode:~$ xpu-smi dump -d 3 -m 0,5,18 Timestamp, DeviceId, GPU Utilization (%), GPU Memory Utilization (%), GPU Memory Used (MiB) 12:39:00.000, 3, N/A, 87.55, 43020.16 12:39:01.000, 3, N/A, 87.55,...
I would really like a general language tool for doing what this does for C Rust C++ etc. I am willing to put the time to make this happen is...
https://github.com/tsoding/arena/issues/3#issuecomment-2442563626 I was wondering if aligment would be an issue. Its totally solvble with a bit of elbow grease and some macro work. alternativly just make sure to use 1...