dumpmemory
dumpmemory
> > I have faced hang issues after 1:30 hours training time wiht ft and zero3 > > same question u can try to update nccl 2.19.3
I have the similar experiments but from NLP. I have run the opt example with 1.3b gpt-neo. with one node 8 2080ti gpus, it is Throughput: 3664.00 token/s, 3.74 TFLOP/s....
did u check the _context.json version ?
There is the same issue for HuggingfaceTrainer, when using steps for saving frequency, like 1000 steps, the first checkpoint is checkpoint 00000, not checkpoint1000.
> How is this impacting workloads, aside from the Keras callback not saving the epoch? As far as I understand, the most important thing is that we have an incremental...
I haven't set checkpoint_frequency in CheckpointConfig
Thanks for your reply. I will try again.
how about check this one https://goombalab.github.io/blog/2024/hydra-part2-model/
according to mamba-2 paper section 6.3, the following function is my understanding ```python def ssd_flops(T,Q,P,N): # center blocks #print(T,Q,P,N) center_blocks_sma_compute = T*Q*N+T*Q*Q+T*P*N #print("center_blocks_sma_compute",center_blocks_sma_compute/1e9,T*Q*N/1e9,T*Q*Q/1e9,T*P*N/1e9) #low-rank blocks right factors b terms b_compute...
> > > > huh there's no requirement d_state / head_dim % 8 == 0 there's d_model / head_dim % 8 == 0 you can try the dimensions similar to...