Results 5 issues of leasunhy

The performance summary shows that my model spend ~50% time in the "kernel launch" step. I find other items easy to understand, but I have no idea what "kernel launch"...

For benchmarking purpose (refs #363), I created a python script that has thousands of lines. The script looks like this (which can be run with CPython): ```python class A: def...

* Use `torch.optim.AdamW` as fallback Adam implementation. * Support selecting the fused versions of the optimizers (via `--use-fused-optimizer`). Speed: custom_fused (only available for Adam) > fused > foreach

This PR replaces the custom EMA implementation with the one in PyTorch. Note that this PR breaks backward compatibility: it cannot load old-format checkpoints that were generated with ema enabled.