iree Figured out why Bert in Shark tank performs much better than Bert from torchbench

Aug 16 '22 17:08 allieculp

Adding Ramiro's notes here.

Input and vocab size are different

Shark: input_size=128, vocab_size=2
Torchbench: input_size=512, vocab_size=30522

If I change torchbench parameters to match Shark, I get IREE 2x faster than PyTorch eager

Using input_size=512 and vocab_size that is of same order as 30522 but 32 aligned makes IREE perform 2x slower than PyTorch eager

Aug 16 '22 17:08 allieculp

@ThomasRaoux @mattwalsh @rsuderman @KoolJBlack @mariecwhite @silvasean for vis

Aug 16 '22 17:08 allieculp

A few things to double-check to make sure the models are close to apple-to-apples:

Since the Shark model has smaller input and vocab size, it would suggest that the model is smaller. For example, BERT-base and BERT-large vary in the number of layers, number of attention heads and number of hidden layers (12 layers vs 24 layers, 12 attention heads vs 16 attention heads, 768 hidden layers vs 1024 hidden layers). So even if the input size and vocab size are changed to match, the internals may still differ greatly.
Does changing the input size and vocab size change the internals as expected? Some rules of thumb: i) BERT scales quadratically according to sequence length (input_size), ii) As the vocab size increases, the matrices in the input and output layers should also increase.
BERT latency depends on input and thus how it's trained. Are the models pre-trained (on a generic dataset) or fine-tuned (for a specific task e.g. question-answering). We would need to make sure that both models have been trained similarly and the input data is the same.

Aug 16 '22 21:08 mariecwhite

A correction on the update I gave about the differences between bert-base-uncased in Shark vs. the one used by torchbench. Both models do actually use the same vocabulary size, 30522 (Shark does use a smaller input size: 128 vs 512). The difference is that the Shark model is doing sequence classification outputting classification scores on two labels, while the torchbench model is doing a masked language modeling outputting prediction scores for each of the 30522 vocabulary tokens.

The difference between the two is only at the end of the model. Because the Shark model is only doing classification into two labels, the final op has a matmul between tensors of size 1x768, 768x2. The model never does a computation involving an entire tensor that has a dimension of size 30522 (there is one op that just does a gather). On the other hand, the torchbench model performs a matmul at the end between tensors of size 512x768, 768x30522, which is the main source of the performance difference between the two. So the two models are actually achieving different goals.

Aug 17 '22 18:08 ramiro050

All configuration parameters used between the two models are the same:

Shark:

  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torchscript": true,
  "transformers_version": "4.12.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522

Torchbench:

  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.12.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522

Aug 17 '22 18:08 ramiro050

@powderluv, @monorimet Given the differences, can we update the Shark model to be the same as what is in Torch bench?

Sep 22 '22 21:09 mariecwhite

Sounds good. I can look into this.

Sep 22 '22 21:09 monorimet

Confirm model in Torch is same as shark tank

Sep 29 '22 22:09 erob710

Is this issue still valid? Sounds like they're the same now?

Oct 24 '22 17:10 benvanik

Confirm model in Torch is same as shark tank

Was this a statement or a request?

Oct 31 '22 09:10 silvasean

That was a request. I've looked into this and SHARK and TorchDynamo now look the same except for the sequence length. See https://github.com/nod-ai/SHARK/issues/324 for details. I'll close this and we can track it in the other issue.

Nov 15 '22 00:11 mariecwhite