inference icon indicating copy to clipboard operation
inference copied to clipboard

Network Division - BERT QSL dataset pre-processing

Open G4V opened this issue 2 years ago • 12 comments

For non-network division BERT benchmarks the dataset is tokenised outside of running the benchmark but in the inference rules, the use of "Text in. No compression allowed." implies that for Network Division the tokenisation instead happens as part of the SUT. Is this the intention?

https://github.com/mlcommons/inference_policies/blob/master/inference_rules.adoc#624-benchmarks-and-qsl-preprocessing

One complication here is that as part of the tokenisation process the original 10570 examples is translated into 10833 samples. This is because some examples exceed the sequence length of 384 and so are split using a sliding window. Non network benchmarks treat the 10833 samples independently and so there is a one to one mapping of QuerySample -> response. Not the case case if instead a QuerySample represents an example.

https://github.com/mlcommons/inference/blob/master/language/bert/create_squad_data.py#L179

If tokenisation is included as part of the SUT the simplest approach may be to remove the examples from the dataset that can't be tokenized into a single sequence length of 384, keeping the relationship one to one.

The other option would be to allow the QSL to supply pre-tokenized data as per the other BERT benchmarks. This has the advantage that the network results can be compared directly with other BERT benchmarks but may not best match the use case the benchmark is looking to represent.

G4V avatar Nov 25 '22 12:11 G4V