training The default training script of DLRM v2 does not reach the reported AUC.

The default training script of DLRM v2 does not reach the reported AUC.

Open Kevin0624 opened this issue 1 year ago • 3 comments

Hi Teams,

I have run the default training script with the following changes based on the results table 1. GLOBAL_BATCH_SIZE=16384 2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)

P.S. I did not use the unique flags. Because I could not find the corresponding argument in dlrm_main.py.

the training result is: test AUC: 79.86% ( target: 80.30%)

Is there anyone have any idea?

Apr 10 '23 02:04 Kevin0624

cc @janekl Any thoughts on this?

Apr 28 '23 22:04 erichan1

Hello, I have two questions first that hopefully will help us to figure this out more effectively:

Could you share the exact command you tried?
What are "global_batch_size" and "opt_base_learning_rate" in the logs produced?

I expect that (batch size, learning rate) = (16384, 0.004) should work reasonably well and stably. But bear in mind that results may vary from run to run -- as the model is initialized randomly -- so it's best to run it several times.

Also, note that the threshold is 0.80275, not 0.803.

Finally, for MLPerf you should look at "eval_accuracy" logs for the validation set, not test set (it is better just not to use ---evaluate_on_training_end flag to avoid confusion here).

May 04 '23 11:05 janekl

Hi Teams,

I have run the default training script with the following changes based on the results table 1. GLOBAL_BATCH_SIZE=16384 2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)

P.S. I did not use the unique flags. Because I could not find the corresponding argument in dlrm_main.py.

the training result is: test AUC: 79.86% ( target: 80.30%)

Is there anyone have any idea?

did you load dense part ? i encontered with just sparse weights, but none dense. will you glad to show me how to check dense weights?

Apr 23 '24 12:04 kkkparty

@Kevin0624 has this been resolved? Closing as it has been more than a year since the last activity

Aug 01 '24 14:08 ShriyaPalsamudram

training training copied to clipboard

The default training script of DLRM v2 does not reach the reported AUC.

training
training copied to clipboard