chenyu
chenyu
it probably needs to explicitly check the end condition for `next_mask`
@chaosagent do you still want this? I don't see perf impact on benchmark so prefer to remove this unless there's big perf gain
okay let me BEAM=2 resent on master too
`HSA=1 DEFAULT_FLOAT=HALF WARMUP_EPOCHS=2 BS=768 GPUS=6 BENCHMARK=10 MODEL=resnet python3 examples/mlperf/model_train.py` uses ~87.4GB on master and this branch so no memory diff.
both this pr and master have 430ms step time with default changed to HALF. i think it's safe to delete this.
maybe a version of `BEAM_MIN_PROGRESS` that relies on relative time can mitigate the slowdown issue.
can we get a version with `BEAM_MAX_TASKS_PER_CHILD` change and uops MAX only first? i think these 2 are the least controversial
also the benchmark beam runs took 50%-100% longer
will measure resnet compile time again after this change
fyi you can add `DEBUG=4` to print the kernel source code, and `DEBUG=5` to print UOps