Results 127 issues of Niklas

I may be missing some nuances with the checkpointing but can we do sth akin to this PR to avoid trying to load the trainer state when the file is...

Cool work! Are you using the instruction when embedding queries using BGE? You need to prepend `Represent this sentence for searching relevant passages: ` to every query when embedding with...

I'm probably missing something but isn't this tokens per second per GPU: https://github.com/mlfoundations/open_lm/blob/083fa31449c3456e889269e44913578acfced67a/open_lm/train.py#L282 inputs.numel() gives all tokens; for samples it would be inputs.shape[0], no?

Addresses https://github.com/databricks/megablocks/issues/57#issuecomment-2071247350 cc @afang-story

Why is https://github.com/mlfoundations/open_lm/blob/fc122608bcfd0678dead79c89ea0c0ac1739ea68/open_lm/train.py#L88C19-L88C47 forced to be True? cc @kernelmachine

I'm training models with the below specs but seeing major throughput drop when switching to GLU - Do you know why? / Ideas what I could investigate? Thanks a lot!...

I'm finding that training a 1-expert dMoE (brown) has worse training loss than an otherwise equivalent dense model (green). Is there some reason why this difference is expected or can...

I am getting the below error upon the first step of multinode training with dMoE. Meanwhile, multinode MoE training & single node dMoE works fine. Any ideas what the problem...

Part of [CRAG](https://arxiv.org/pdf/2406.14497) is only to evaluate embedding/retrieval models i.e. without the generative part. Would be great to integrate that! (or CodeRAG)

good first issue
new-dataset