Niklas issues

Results 127 issues of


                                            Niklas

Optionally load trainer state

I may be missing some nuances with the checkpointing but can we do sth akin to this PR to avoid trying to load the trainer state when the file is...

Missing instruction for BGE?

Cool work! Are you using the instruction when embedding queries using BGE? You need to prepend `Represent this sentence for searching relevant passages: ` to every query when embedding with...

samples_per_second_per_gpu or tokens_per_second_per_gpu?

I'm probably missing something but isn't this tokens per second per GPU: https://github.com/mlfoundations/open_lm/blob/083fa31449c3456e889269e44913578acfced67a/open_lm/train.py#L282 inputs.numel() gives all tokens; for samples it would be inputs.shape[0], no?

Add dMoE

Addresses https://github.com/databricks/megablocks/issues/57#issuecomment-2071247350 cc @afang-story

MoE Expert parallelism config

Why is https://github.com/mlfoundations/open_lm/blob/fc122608bcfd0678dead79c89ea0c0ac1739ea68/open_lm/train.py#L88C19-L88C47 forced to be True? cc @kernelmachine

Bad throughput with GLU

I'm training models with the below specs but seeing major throughput drop when switching to GLU - Do you know why? / Ideas what I could investigate? Thanks a lot!...

1-expert worse than dense model

I'm finding that training a 1-expert dMoE (brown) has worse training loss than an otherwise equivalent dense model (green). Is there some reason why this difference is expected or can...

OSError: Stale file handle with dMoE

I am getting the below error upon the first step of multinode training with dMoE. Meanwhile, multinode MoE training & single node dMoE works fine. Any ideas what the problem...

Add CRAG

Part of [CRAG](https://arxiv.org/pdf/2406.14497) is only to evaluate embedding/retrieval models i.e. without the generative part. Would be great to integrate that! (or CodeRAG)

good first issue

new-dataset