[Contributors Wanted] A real-world model benchmark
To this day, there are no reliable benchmarks for "real-world models" across frameworks (Keras, PyTorch, JAX/Flax). A "real-world model" is the kind of model that is actually produced and used by data scientists and engineers (for instance, the models shown at keras.io/examples or in official PyTorch tutorials). This term is used in contrast with models used in the MLPerf benchmark, which require months of optimization work by specialized engineers and do not reflect the reality of the performance of a given framework in the real world.
A real-world model would only feature the kind of performance optimization that you could get in ~30 min of optimization work at most by a regular developer. In case of Keras, that just means:
- Using
jit_compile=Trueif applicable - Finding a good value for
steps_by_executionif applicable - Making sure you're loading data via
tf.dataand use prefetching, perhaps caching if applicable, and usenum_parallel_callswhen applicable.
Nothing else.
I'd be interested in seeing such a real-world model benchmark in Keras/PyTorch/Flax for a range of widely used models, e.g.
- A MLP or two
- A couple of standard convnets (e.g. EfficientNet)
- A Transformer built from scratch (e.g. this one)
This could enable us to identify areas for performance improvement.
Hello I would like to start working on this
I'm interested in it.
Awesome, thanks for the interest! I'd recommend starting out with a Colab notebook running on a K80 (the default Colab GPU runtime). Perhaps one of you can take the MLP + convnet and another one can take the Transformer? We can unify the infrastructure later on.
Just use open source repo or integrate them into a repo?
Just use open source repo or integrate them into a repo?
We probably want to start a new 3rd party repo to host the benchmarks, in the long term. In the short term we can just use Colab notebooks to prototype the logic.
Hello. Awesome! I can take the MLP+convnet part.
Hello @fchollet I am a little confused on the requirements. So are we required to compare performance of models in TensorFLow vs PyTorch or how exactly should we proceed? Could you tell a visual of the expected result maybe?
https://colab.research.google.com/drive/1cMNz_G6-WTEF-f09afJQUvCFuFL-Uno7?usp=sharing
Hello @fchollet I am a little confused on the requirements. So are we required to compare performance of models in TensorFLow vs PyTorch or how exactly should we proceed? Could you tell a visual of the expected result maybe?
Yes. Flax as well.
The expected outcome is to have, for each model architecture considered, 3 parallel model implementations that do exactly the same thing in the 3 different frameworks, and that exemplify best practices in each framework. Then we time their average training step time and inference step time on CPU, GPU, and TPU.
The hardest part is the implementation of identical models in all 3 frameworks.
@fsx950223 generally speaking we should implement models from scratch rather than using pretrained models and fine-tuning them (e.g. no TF-Hub, just Keras layers). Also it's not very important to use a real dataset, we could start with randomly generated data at first.
@fsx950223 generally speaking we should implement models from scratch rather than using pretrained models and fine-tuning them (e.g. no TF-Hub, just Keras layers). Also it's not very important to use a real dataset, we could start with randomly generated data at first.
Could I use tf model official apis?
@fsx950223 generally speaking we should implement models from scratch rather than using pretrained models and fine-tuning them (e.g. no TF-Hub, just Keras layers). Also it's not very important to use a real dataset, we could start with randomly generated data at first.
I have updated the notebook.
https://colab.research.google.com/drive/1yRXv1lzHiPfNfu4KlkJt4pEjsQwzDgML?usp=sharing Sharing a notebook here. Will be updating this. Starting with EfficientNet
Hello @fchollet, which convnets are we targetting here? Anything specific or will this be for a range of convnets?
To this day, there are no reliable benchmarks for "real-world models" across frameworks (Keras, PyTorch, JAX/Flax). A "real-world model" is the kind of model that is actually produced and used by data scientists and engineers (for instance, the models shown at keras.io/examples or in official PyTorch tutorials). This term is used in contrast with models used in the MLPerf benchmark, which require months of optimization work by specialized engineers and do not reflect the reality of the performance of a given framework in the real world.
I would be very curious to have @gfursin opinion on this..
Could I use tf model official apis?
For Bert specifically this seems ok. You can generally used any reusable building blocks from an official Keras/TF lib (like KerasCV, KerasNLP, TF-models, etc.)
Hello, seems like there’s already traction on this. I’d like to work on this as well + will catch up on what's already been done.
We are creating a new open workgroup in MLCommons to simplify MLPerf inference benchmark and make it easier to plug in any real world model, data set, framework, compiler and hardware. If it's of interest, please join us at https://github.com/mlcommons/ck/blob/master/docs/mlperf-education-workgroup.md - it's a community project and any feedback is very appreciated!
CC @bhack & @SamuelMarks
@fchollet Good to see your interest here. More than just benchmarking, my [proposed] approach puts in a central repository: solutions from different research papers; builtin to different frameworks; and community 'zoos'.
My idea is to put proper setup.py and module hierarchy—as opposed to silly Jupyter Notebooks and/or hardcoded-path filled Python files—then upload to PyPi with new PyPi classifiers.
With my strong compiler technology focus, the idea is to strongly type all interfaces and expose them for use in search-space(s) / database(s). For example, to utilise Neural Architecture Search. But on steroids, if you think in terms of Keras ontology:
- [categorical] Optimizers
- [categorical] Loss functions
- [categorical] Metrics
- [categorical] Callbacks
- [categorical] "Applications" (transfer learning models)
- [categorical] Datasets
- …
- [continuous] Alpha
- [continuous] Beta
- [continuous] Gamma
- [continuous] Learning rate
- …
With everything strongly typed you can run all kinds of neat optimisers across both continuous and categorical variables, including wrapping it around as input to the [e.g., CNN] that it is optimising (🐍).
You can imagine the types of experiments you can run:
- For metric AUCROC which Application gives highest accuracy across Datasets [list]?
- For new Optimizer AdamNotAdam55 how does it compare to all other Optimizers across all Datasets for given Metric?
…and then you go even further and create bots to automatically send PRs with Python module hierarchy + setup.py + CI/CD to repos from different research papers coming out, so that their approach is objectively comparable to other approaches and you can determine whether there is actual benefit.
This last point is especially important as the sheer amount of new research coming out makes it impossible to stay up-to-date.
BTW: My focus at Harvard Medical School is building new open-source medical devices and algorithms to facilitate mass screening for blinding eye diseases; glaucoma in particular. My algorithm is pretty darn good, and I keep up-to-date with Google and Meta's innovations and the major conferences. But some random undergrad in China could come up with a better algorithm that reduces my False Positives (FP) and I would miss it. With such as a system as I'm proposing [and building!] there would be an increased chance of the system automatically finding and including the new algorithm, the FP goes down, and fewer people go blind.
Now obviously this system can be used to run MobileNet in 10 different ML frameworks and compare speed, cache utilisation, temperature peaks and whatever other metric of interest.
PS: Happy to talk more online or off (I've signed the Google NDA).
Thanks for your time
Hello. So for the transformer, since the code with keras is already available, it only needs PyTorch and Jax comparision with the best practices right? Also it does not have to be trained for many epochs right?