ml-engineering icon indicating copy to clipboard operation
ml-engineering copied to clipboard

Adding another logbook (kinda)

Open boweiliu opened this issue 1 year ago • 2 comments

Have you read https://arxiv.org/pdf/2402.15627 already?

There's a lot of details in the later sections that deal with ML training in practice -- garbage collection, autorestarting, IB over ethernet issues etc.

boweiliu avatar May 15 '24 23:05 boweiliu

Thank you very much for the recommendation, @boweiliu

I have it on the list, but didn't have a chance to read it yet.

Your list sounds fitting the content of this repo.

stas00 avatar May 17 '24 03:05 stas00

the garbage collection issue outlined in this paper (section 6.3 MFU decreasing) also matches the observation from imbue blog

MFU graph gradually sagged downward over the course of a run, but returned to 100% upon any restart)

yaolu avatar Jul 12 '24 07:07 yaolu

Finally got the time to read the paper, added it here: https://github.com/stas00/ml-engineering/commit/185f7a59918e1782b43eca45b10cd0a37ed56559

Thank you for the recommendation, @boweiliu - a fantastic paper!

stas00 avatar Oct 27 '24 02:10 stas00

@yaolu, indeed that was a great technical post. Let's add it as well https://github.com/stas00/ml-engineering/commit/a7ef3a63f7a8ff32b88968509c0202244a2bad65

stas00 avatar Oct 27 '24 02:10 stas00