nv-alicheng comments

Results 6 comments of


                                            nv-alicheng

Submission checker: better modularize the code, and have better documentation of the expectation of results

Proposal for Submission Checker Refactor 1. MODEL_CONFIG is well over 1000 lines long, and is stored as an enormous nested dictionary. It is also extremely hard to figure out what...

LlaMa2-70b run_accuracy.sh issue with consolidate_results.py

consolidate_results.py is an optional step that was used to generate a pickle file for manual viewing / data analysis. It is not required to run the accuracy script. The pkl...

DLRMv2 GPU Reference Implementation crashes with BusError

Full output: ``` + python python/main.py --profile dlrm-multihot-pytorch --mlperf_conf ../../../mlperf.conf --model dlrm --model-path /home/model --dataset multihot-criteo --dataset-path /home/data/day23 --output /home/mlcommons/recommendation/dlrm_v2/pytorch/output/pytorch-gpu/dlrm --use-gpu --scenario Offline --samples-to-aggregate-quantile-file=./tools/dist_quantile.txt --max-batchsize=2048 --samples-per-query-offline=204800 --accuracy INFO:torch.distributed.nn.jit.instantiator:Created a temporary...

DLRMv2 GPU Reference Implementation crashes with BusError

@pgmpablo157321 Could you take a look?

DLRMv2 GPU Reference Implementation crashes with BusError

@arjunsuresh I believe that it took around 6.1 TB of disk space to run the Criteo preprocessing script when I ran it. Not sure how much it would take if...

DLRMv2 GPU Reference Implementation crashes with BusError

``` => du -sh /home/mlperf_inf_dlrmv2/criteo/day23 169G /home/mlperf_inf_dlrmv2/criteo/day23 ``` The day23 files are around ~169 GB. This is the breakdown: ``` 8.7G /home/mlperf_inf_dlrmv2/criteo/day23/day_23_dense.npy 681M /home/mlperf_inf_dlrmv2/criteo/day23/day_23_labels.npy 143G /home/mlperf_inf_dlrmv2/criteo/day23/day_23_sparse_multi_hot.npz 18G /home/mlperf_inf_dlrmv2/criteo/day23/day_23_sparse.npy ```