Unable to reproduce benchmark results for TokenPacker-HD (7B, Scale=2, Patch=9)

Open worapob841 opened this issue 5 months ago • 0 comments

Hi, thank you for your great work on TokenPacker!

I’m trying to reproduce the TokenPacker-HD (7B, scale factor 2, patch number 9) experiments, but I’m not getting results close to the paper or the released checkpoint.

Hardware Setup

4 × H100 GPUs

Results Comparison

Row 1: Results reported in the paper
Row 2: Results from the released checkpoint
Row 3+: My experiments under different settings

Method	TextVQA	OCRB	DocVQA	MMB	MMMU	MME	VQAv2	VizWiz	POPE
Reported in paper	68.0	452	60.2	67.4	35.4	1489/338	81.2	54.7	88.2
Released checkpoint	67.92	452	27	67.35	35.89	1489.02/337.5	81.17	54.63	88.15
Exp 1	41.29	17	9	21.13	31.44	675.46/283.93	67.5	48.12	56.70
Exp 2	36.53	14	8	20.79	28.89	653.94/248.57	67.12	48.12	55.6
Exp 3	40.14	19	8	21.05	31.22	666.27/240.36	45.7	47.53	51.07
Exp 4	40.37	17	8	21.21	30.67	720.37/273.21	45.25	47.92	58.94

Experiment Details

Exp 1
- Pretrain: LR = 1e-3, batch size = 256 (32 × 4 GPUs, grad_accum = 2)
- Instruction FT: LR = 2e-4, batch size = 128 (16 × 4 GPUs, grad_accum = 2)
- Results far from paper/released checkpoint
Exp 2 (following Issue #12)
- Based on https://github.com/CircleRadon/TokenPacker/issues/12#issuecomment-2328534898 trainer_state.json, I noticed the pretrain LR was 5e-4 with batch size 128.
- Pretrain: LR = 5e-4, batch size = 128 (16× 4 GPUs, grad_accum = 2)
- Instruction FT: LR = 2e-4, batch size = 128 (16 × 4 GPUs, grad_accum = 2)
- Result still not close.
- Notably, my pretraining loss drops only to ~1.6–1.7, while your trainer_state.json shows it drops to ~1.2. My pretrain-trainer_state.json and instruction-trainer_state.json
Exp 3
- Same as Exp 2, but batch size = 64 (16 × 4 GPUs, grad_accum = 1)
- Still far from expected results.
Exp 4
- Same as Exp 1, but with deepspeed seed and dataset seed set to 2024
- Still not close to paper/released checkpoint.

Questions

Could you clarify:
- The exact learning rate schedule and batch size settings used in pretraining/finetuning?
- Whether there are other important hyperparameters (e.g., warmup steps, optimizer settings, gradient clipping) not mentioned in the paper but necessary to reproduce results?
Could you also provide the pretraining dataset JSON and instruction-tuning dataset JSON?
- I noticed that in sunshine-lwt/TokenPacker-HD-7b-9patch-144token the instruction-tuning trainer_state.json, the global step is 11,627. If the batch size is 128, that implies about 11,627 × 128 = 1,488,256 samples. But the actual size of the Mini Gemini instruction-tuning dataset is 1,511,341, meaning around 20k samples are missing.
- Could you provide the exact JSON datasets used, so reproduction is faithful?

Oct 10 '25 11:10 worapob841