Dirk Groeneveld

Results 200 comments of Dirk Groeneveld

This is totally untested at this point.

Ah, I did not try it on main. I just passed —reset_optimizer_state. On Jan 18, 2024 at 14:38:47, Pete ***@***.***> wrote: > @dirkgr does it crash on main? I remember...

@epwalsh This is done, right?

In block 0, `exp_avg_sq` for `attn_norm.weight.max` seems to spike on step 1581, earlier than all the other spikes.

`attn_out.weight.max` is even more pronounced.

Can you run this again with torch 21, so we get a cleaner profiler trace? With torch 20, it inserts all those "marker" blocks.

Also, can you change the script in two ways: 1. Create two streams up front, and then keep using the same streams (instead of creating a new stream every batch)...

That suggests a fix then?

Keep in mind that code changes over time, and needs maintenance to keep working. If we run this on all affected runs now, once, then we can throw away the...

One design constraint: Data transfer out of GCloud and AWS is expensive. All other data transfer is free. But the machine doing the data processing (so far) lives in GCloud.