TokenPacker icon indicating copy to clipboard operation
TokenPacker copied to clipboard

Unable to reproduce benchmark results for TokenPacker-HD (7B, Scale=2, Patch=9)

Open worapob841 opened this issue 5 months ago • 0 comments

Hi, thank you for your great work on TokenPacker!

I’m trying to reproduce the TokenPacker-HD (7B, scale factor 2, patch number 9) experiments, but I’m not getting results close to the paper or the released checkpoint.


Hardware Setup

  • 4 × H100 GPUs

Results Comparison

  • Row 1: Results reported in the paper
  • Row 2: Results from the released checkpoint
  • Row 3+: My experiments under different settings
Method TextVQA OCRB DocVQA MMB MMMU MME VQAv2 VizWiz POPE
Reported in paper 68.0 452 60.2 67.4 35.4 1489/338 81.2 54.7 88.2
Released checkpoint 67.92 452 27 67.35 35.89 1489.02/337.5 81.17 54.63 88.15
Exp 1 41.29 17 9 21.13 31.44 675.46/283.93 67.5 48.12 56.70
Exp 2 36.53 14 8 20.79 28.89 653.94/248.57 67.12 48.12 55.6
Exp 3 40.14 19 8 21.05 31.22 666.27/240.36 45.7 47.53 51.07
Exp 4 40.37 17 8 21.21 30.67 720.37/273.21 45.25 47.92 58.94

Experiment Details

  1. Exp 1

    • Pretrain: LR = 1e-3, batch size = 256 (32 × 4 GPUs, grad_accum = 2)
    • Instruction FT: LR = 2e-4, batch size = 128 (16 × 4 GPUs, grad_accum = 2)
    • Results far from paper/released checkpoint
  2. Exp 2 (following Issue #12)

    • Based on https://github.com/CircleRadon/TokenPacker/issues/12#issuecomment-2328534898 trainer_state.json, I noticed the pretrain LR was 5e-4 with batch size 128.
    • Pretrain: LR = 5e-4, batch size = 128 (16× 4 GPUs, grad_accum = 2)
    • Instruction FT: LR = 2e-4, batch size = 128 (16 × 4 GPUs, grad_accum = 2)
    • Result still not close.
    • Notably, my pretraining loss drops only to ~1.6–1.7, while your trainer_state.json shows it drops to ~1.2. My pretrain-trainer_state.json and instruction-trainer_state.json
  3. Exp 3

    • Same as Exp 2, but batch size = 64 (16 × 4 GPUs, grad_accum = 1)
    • Still far from expected results.
  4. Exp 4

    • Same as Exp 1, but with deepspeed seed and dataset seed set to 2024
    • Still not close to paper/released checkpoint.

Questions

  1. Could you clarify:
    • The exact learning rate schedule and batch size settings used in pretraining/finetuning?
    • Whether there are other important hyperparameters (e.g., warmup steps, optimizer settings, gradient clipping) not mentioned in the paper but necessary to reproduce results?
  2. Could you also provide the pretraining dataset JSON and instruction-tuning dataset JSON?
    • I noticed that in sunshine-lwt/TokenPacker-HD-7b-9patch-144token the instruction-tuning trainer_state.json, the global step is 11,627. If the batch size is 128, that implies about 11,627 × 128 = 1,488,256 samples. But the actual size of the Mini Gemini instruction-tuning dataset is 1,511,341, meaning around 20k samples are missing.
    • Could you provide the exact JSON datasets used, so reproduction is faithful?

worapob841 avatar Oct 10 '25 11:10 worapob841