Luca Soldaini
Luca Soldaini
mmmh I have not tried to process DE Wikipedia in a while, but when I did it last year I was not having the same issue. I've heard good things...
Hi Maksym! Thank you for this pull request. I fully support the reasoning behind adding these mappers, but I would prefer avoiding Mappers argument that are callables, as they can...
That would be a great feature!
@dirkgr shall we merge?
(responded on ticket in Dolma repository)
Hi, Choosing which vocabularies to add should be done as part of the UMLS installation step (step 1 in the readme). You can definitely just install SNOMED or RXNORM that...
Hi John! Olmo V1 was trained on not explicitly trained on any instruction data. If any leak in occurs, I suspect is through the code subset we trained OLMo on....
hey @WenJett ! You command looks correct, so it is strange that it is failing. is the `data.json.gz` something you could share?
I tried with your file locally on my machine: ```shell dolma tokens --documents ./data.json.gz --destination ./ --tokenizer.name_or_path allenai/dolma2-tokenizer --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --dtype uint32``` ``` Checked the output as follows:...
For issues with OLMo code, please open an issue on [its repo](https://github.com/allenai/olmo), referencing this issue. Thank you!