Yu Zhang
Yu Zhang
I've also found that for 128K vocab, 8 chunks can be faster, with the cost of nearly
**UPDATE:** Just trained 3 370M models on 10B tokens of Fineweb-edu with 8K ctx length and 32K vocab Below are the results. Superisingly, 8 chunks exhibits the best ppl. V/H=32K/1K=32...
> The time it takes to resume depends on the expected maximum distance in this case right ? Do you know its relationship with $B$ Hi, I created a histogram...
Maybe there's a middle ground between rebuilding the buffer from scratch and storing the entire buffer, but the logic is a bit complicated and takes time to implement. At least...
@lhoestq I'm not sure if it's ok to use progress bar when having multiple workers. How about passing an arg `resumable=True` to `IterableDataset.shuffle` to allow for controling of the behaviors?
@lhoestq > Loading from disk is a good option for this (although it's not always possible to serialize the content of the buffer, in that case the buffer would restart...
@lhoestq > Are you ok with adding buffer_resuming_mode= to .shuffle() to enable buffer recovering using your method with buffer_resuming_mode="recover_from_source" ? (feel free to suggest other names for the parameter and...
Hi, @huyiwen I think the smoothest way for migration is to define hf style models and use torchtitan for training with 4d parallel. You may also be interested in https://github.com/fla-org/flame.git,...
Hi, @kwen2501 > Dumb q: would HF-style model definition enable composability with HF Trainer? Did HF document the style requirement somehwere? HF-style models (e.g., `AutoModelForCausalLM`) also inherit from `nn.Module`, making...
@tesla3 Hi, checkout this PR, I'm working on it.