Ross Wightman
Ross Wightman
@csarofeen yes we're normalizing across C in the NCHW tensor. Thanks for the insight. From a high level, it was hard for me to fathom how the end results could...
@ngimel @csarofeen so, quick check of apex LN in the ResNet50 case (which is quite a bit worse for my measurements than in many of the hybrid cnn-transformer models). It's...
FY my resnet50 test case for ln was a quick hack in resnet.py ``` @register_model def resnet50_ln(pretrained=False, **kwargs): from .layers.norm import LayerNormExp2d, LayerNormExpNg2d, LayerNorm2d # different LN experiments model_args =...
A difference this big ins't adding up. And also Natalia's promising fusion codegen tests being done in float16, all of mine in AMP (since that's how I train). And looking...
> Yes, layer_norm is force-cast to fp32 by amp (tbh, I don't know if it's strictly necessary or is it out of abundance of caution, I've heard some stories where...
> Sorry, wouldn't `_cast_if_autocast_enabled` cast all inputs to fp32? I couldn't find this function. @ngimel it casts the args to get_autocast_gpu_dtype()
@wuye9036 that is a known issue, I don't run into it frequently because I rarely run LR plateau schedules of lengths long enough to care too much about resume (usually...
By the hack I mean edit line 233 in main to `lr_scheduler.step(start_epoch, metric=-100)` or opposite if your metric scale is reverse.
@Epiphqny yes, video is going to become a focus soon. I'm working on collecting some datasets and will start building/experimenting with model architectures and data loading/augmentation pipelines soonish. I have...
@tmabraham thanks, might take you up on that, currently thinking through the abstracitons, trying to hide most of cuda + distributed config vs xla + distributed config without making too...