NanoCode012 comments

Results 342 comments of


                                            NanoCode012

no pad_token or eos_token in wandb eval table "Eval - Predictions vs Ground Truth"

To add more info from discord discussion, the problem stems from the eval_table code, which was written quite some time ago and hasn't been actively maintained. At this point, I'm...

Trion 3.2.0 Doesn't Work with GRPO+vllm

> [@winglian](https://github.com/winglian) that is good to know. What about the triton 3.2.0 issue that throws the PY_SSIZE_T_CLEAN error? Do you have the stack trace for that?

AxolotlGRPOTrainer still shuffles combined datasets even with curriculum_sampling flag enabled

Hey! Thanks for the report. Let's see what upstream trl does first.

AxolotlGRPOTrainer still shuffles combined datasets even with curriculum_sampling flag enabled

Hey! Thanks for checking back. In this case, you could override those 2 dataloader fn to return your custom `RepeatSampler` class. I looked a bit more and `curriculum_sampling` seems to...

"RuntimeError: Invalid device argument : did you call init? "When setting CUDA_VISIBLE_DEVICES

Which GPUs are you using? I just used the CUDA_VISIBLE_DEVICES yesterday, and it seemed to not have this issue.

"RuntimeError: Invalid device argument : did you call init? "When setting CUDA_VISIBLE_DEVICES

Hello, sorry I missed your earlier reply @zhanghanxing2022 . I ran your config (changing base_model + dataset) on 2xH200 SXM GPUs on runpod using our docker cloud image with `CUDA_VISIBLE_DEVICES='0,1'...

"RuntimeError: Invalid device argument : did you call init? "When setting CUDA_VISIBLE_DEVICES

Closing as stale

[Feat] Log Flops to wandb using callback

Yeah, I think this can be a quick callback to add though I haven't verified `flos` refers to the FLOPS

[Feat] Log Flops to wandb using callback

I went and checked that `total_flos` is the FLOPS count, however, the number may be off (GH Issue about miscounting for embed layers). Given that it may be incorrect, I'm...

FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP.

Forgot where this was thrown, but likely here https://github.com/axolotl-ai-cloud/axolotl/blob/80304c26a70e21ed8522fdbd53bcb290f9c6b7d3/src/axolotl/train.py#L246