ART Gradient reliability with sample-by-sample vs batch processing

I'm examining the training implementation in src/art/unsloth/service.py and have a question about the gradient computation approach.

Currently, the code processes samples individually:

for offset in range(0, packed_tensors["tokens"].shape[0]): # Process single sample: v[offset : offset + 1] # Each sample triggers separate gradient computation and parameter update

This means:

Sample 1: θ₁ = θ₀ - lr * ∇L₁(θ₀)
Sample 2: θ₂ = θ₁ - lr * ∇L₂(θ₁) (based on updated θ₁)
Sample 3: θ₃ = θ₂ - lr * ∇L₃(θ₂) (based on updated θ₂)

Versus standard batch processing:

All samples: θ = θ₀ - lr * (∇L₁(θ₀) + ∇L₂(θ₀) + ∇L₃(θ₀))/batch_size

Question:

What's the reasoning behind this sequential gradient approach? Does it provide better gradient reliability or learning dynamics for your specific use case?

I'm particularly curious whether this design choice stems from:

Improved convergence properties
Better handling of gradient variance
Specific requirements for your training methodology

The downstream training code in train.py appears to support full batch processing, so I'm wondering if there are important gradient-related considerations I'm missing.

Thanks for any insights!

Aug 01 '25 03:08 zfflxx

In my experience doing more gradient updates works better. Here's some recent work that finds the same thing.

Aug 02 '25 18:08 bradhilton

@zfflxx does that help answer your question?

Aug 13 '25 00:08 bradhilton