ART icon indicating copy to clipboard operation
ART copied to clipboard

Gradient reliability with sample-by-sample vs batch processing

Open zfflxx opened this issue 7 months ago • 2 comments

I'm examining the training implementation in src/art/unsloth/service.py and have a question about the gradient computation approach.

Currently, the code processes samples individually:

for offset in range(0, packed_tensors["tokens"].shape[0]): # Process single sample: v[offset : offset + 1] # Each sample triggers separate gradient computation and parameter update

This means:

  • Sample 1: θ₁ = θ₀ - lr * ∇L₁(θ₀)
  • Sample 2: θ₂ = θ₁ - lr * ∇L₂(θ₁) (based on updated θ₁)
  • Sample 3: θ₃ = θ₂ - lr * ∇L₃(θ₂) (based on updated θ₂)

Versus standard batch processing:

  • All samples: θ = θ₀ - lr * (∇L₁(θ₀) + ∇L₂(θ₀) + ∇L₃(θ₀))/batch_size

Question:

What's the reasoning behind this sequential gradient approach? Does it provide better gradient reliability or learning dynamics for your specific use case?

I'm particularly curious whether this design choice stems from:

  • Improved convergence properties
  • Better handling of gradient variance
  • Specific requirements for your training methodology

The downstream training code in train.py appears to support full batch processing, so I'm wondering if there are important gradient-related considerations I'm missing.

Thanks for any insights!

zfflxx avatar Aug 01 '25 03:08 zfflxx

In my experience doing more gradient updates works better. Here's some recent work that finds the same thing.

bradhilton avatar Aug 02 '25 18:08 bradhilton

@zfflxx does that help answer your question?

bradhilton avatar Aug 13 '25 00:08 bradhilton