Matthew Frank
Matthew Frank
Proposed updates to documentation for `reader_name` argument of nvidia.dali.plugin.*.DALI*Iterator
The [`reader_name` argument of `nvidia.dali.plugin.*.DALI*Iterator()`](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/plugins/mxnet_plugin_api.html) has been difficult for us to understand. I'd like to propose a rewording of the documentation, but want to check that what I'm proposing is...
training_rules.adoc doesn't define what a benchmark _name_ is, nor what a _problem_ is, but Section 4 "Divisions" of the document implies that the benchmark name is given in the Problem...
https://github.com/mlcommons/training/tree/master/object_detection#4-model uses the term "ResNet50" twice. We are trying to standardize terminology and usage across mlcommons. Please change these to "ResNet-50" (with a dash between ResNet and 50).
https://github.com/mlcommons/training/blob/master/image_classification/README.md#1-problem says > This benchmark uses resnet **_v1.5_** to classify images ... While https://github.com/mlcommons/training/blob/master/image_classification/README.md#structure--loss says > In brief, this is a 50 layer **_v1_** RNN ... Please clarify in the...
PR https://github.com/mlcommons/training/pull/435 contains a script, `cleanup_scripts/separate_test_set.py` that is used to randomly extract articles from the training set for use as an evaluation set. A total of 10000 articles are extracted...
...raining file-system tree in some previous round, we no longer know what the contents were intended to be, the config checker doesn't know anything about this file, and ignores it...
…ing backward compatibility
We've been transiently seeing the error `[E ProcessGroupGloo.cpp:144] Gloo connectFullMesh failed with [/opt/pytorch/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:144]` when running at scales of 10k ranks or more. (The error seems to happen with increasing rate...