chenyu

Results 33 comments of chenyu

it probably needs to explicitly check the end condition for `next_mask`

@chaosagent do you still want this? I don't see perf impact on benchmark so prefer to remove this unless there's big perf gain

okay let me BEAM=2 resent on master too

`HSA=1 DEFAULT_FLOAT=HALF WARMUP_EPOCHS=2 BS=768 GPUS=6 BENCHMARK=10 MODEL=resnet python3 examples/mlperf/model_train.py` uses ~87.4GB on master and this branch so no memory diff.

both this pr and master have 430ms step time with default changed to HALF. i think it's safe to delete this.

maybe a version of `BEAM_MIN_PROGRESS` that relies on relative time can mitigate the slowdown issue.

can we get a version with `BEAM_MAX_TASKS_PER_CHILD` change and uops MAX only first? i think these 2 are the least controversial

also the benchmark beam runs took 50%-100% longer

will measure resnet compile time again after this change

fyi you can add `DEBUG=4` to print the kernel source code, and `DEBUG=5` to print UOps