Aleksa Gordić
Aleksa Gordić
Adding to what @chinthysl has said we now also support ZeRO stage 1, where we shard the optimizer states, so only a shard of gradients is updated on each device...
Hey @akulchik are you still having problems with this?
+1 edit: I solved this by using python 3.9, 3.10 was causing issues. tmp workaround for me
I assume two calls are due to the fact that we don't want each thread in the kernel to do stochastic rounding with the same seed. At least that was...
PR is ready and will be merged into the LLaMA 3 fork.
can you post some (eval) results against edu fineweb?
@ngc92 tnx - added!
eyeballing your cmdline i'd say your batch size is too small and is causing an exception in the hellaswag eval, this is a known issue and we have a patch...
you would just need to tokenize images and everything else remains pretty much the same? we didn't have multimodal plans for this repo for the near future