venkata vishal kajjam
venkata vishal kajjam
@afausti Thanks for the reply. I will try it out.
@afausti Setting `autodiscover=True` did not fix the above issue. Also, I noticed that you set the replicaCount to 1(https://github.com/lsst-sqre/charts/blob/master/charts/kafka-aggregator/values.yaml#L3) for your worker. Have you deployed with a replicaCount greater than...
@bobh66 Thanks for the reply. - How many partitions do you have on your topic? You need at minimum one partition per worker I have one topic that has 6...
I am encountering the same issue using 1 A100 GPU 40 GiB for fine tuning.
Here are my notes from further investigating the issue. The RCA for the `micro_batch_per_gpu * gradient_acc_step * world_size 256 != 4 * 8 * 1` is that the deepspeed environment...