training
training copied to clipboard
The default training script of DLRM v2 does not reach the reported AUC.
Hi Teams,
I have run the default training script with the following changes based on the results table 1. GLOBAL_BATCH_SIZE=16384 2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)
P.S. I did not use the unique flags. Because I could not find the corresponding argument in dlrm_main.py.
the training result is: test AUC: 79.86% ( target: 80.30%)
Is there anyone have any idea?
cc @janekl Any thoughts on this?
Hello, I have two questions first that hopefully will help us to figure this out more effectively:
- Could you share the exact command you tried?
- What are "global_batch_size" and "opt_base_learning_rate" in the logs produced?
I expect that (batch size, learning rate) = (16384, 0.004) should work reasonably well and stably. But bear in mind that results may vary from run to run -- as the model is initialized randomly -- so it's best to run it several times.
Also, note that the threshold is 0.80275, not 0.803.
Finally, for MLPerf you should look at "eval_accuracy" logs for the validation set, not test set (it is better just not to use ---evaluate_on_training_end
flag to avoid confusion here).
Hi Teams,
I have run the default training script with the following changes based on the results table 1. GLOBAL_BATCH_SIZE=16384 2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)
P.S. I did not use the unique flags. Because I could not find the corresponding argument in dlrm_main.py.
the training result is: test AUC: 79.86% ( target: 80.30%)
Is there anyone have any idea?
did you load dense part ? i encontered with just sparse weights, but none dense. will you glad to show me how to check dense weights?
@Kevin0624 has this been resolved? Closing as it has been more than a year since the last activity