Matthew Frank issues

Results 8 issues of


                                            Matthew Frank

Proposed updates to documentation for `reader_name` argument of nvidia.dali.plugin..DALIIterator

The [`reader_name` argument of `nvidia.dali.plugin.*.DALI*Iterator()`](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/plugins/mxnet_plugin_api.html) has been difficult for us to understand. I'd like to propose a rewording of the documentation, but want to check that what I'm proposing is...

bug

Documentation

Clarify distinction between benchmark names and closed division model names

training_rules.adoc doesn't define what a benchmark _name_ is, nor what a _problem_ is, but Section 4 "Divisions" of the document implies that the benchmark name is given in the Problem...

ResNet-50 should be called ResNet-50 across all docs

https://github.com/mlcommons/training/tree/master/object_detection#4-model uses the term "ResNet50" twice. We are trying to standardize terminology and usage across mlcommons. Please change these to "ResNet-50" (with a dash between ResNet and 50).

object_detection

Is the Image Classification benchmark ResNet-50 v1 or ResNet-50 v1.5?

https://github.com/mlcommons/training/blob/master/image_classification/README.md#1-problem says > This benchmark uses resnet **_v1.5_** to classify images ... While https://github.com/mlcommons/training/blob/master/image_classification/README.md#structure--loss says > In brief, this is a 50 layer **_v1_** RNN ... Please clarify in the...

image_classification

BERT eval set contains 60 empty articles

PR https://github.com/mlcommons/training/pull/435 contains a script, `cleanup_scripts/separate_test_set.py` that is used to randomly extract articles from the training set for use as an evaluation set. A total of 10000 articles are extracted...

language_model

The system-desc-id_implementation-id.json file was removed from the t…

...raining file-system tree in some previous round, we no longer know what the contents were intended to be, the config checker doesn't know anything about this file, and ignores it...

Simplify and make more rigorous the check for 'status' while maintain…

…ing backward compatibility

optional use of mpi instead of gloo for distributed checkpoint load/save

We've been transiently seeing the error `[E ProcessGroupGloo.cpp:144] Gloo connectFullMesh failed with [/opt/pytorch/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:144]` when running at scales of 10k ranks or more. (The error seems to happen with increasing rate...