Luca Soldaini

Results 43 comments of Luca Soldaini

mmmh I have not tried to process DE Wikipedia in a while, but when I did it last year I was not having the same issue. I've heard good things...

Hi Maksym! Thank you for this pull request. I fully support the reasoning behind adding these mappers, but I would prefer avoiding Mappers argument that are callables, as they can...

@dirkgr shall we merge?

Hi, Choosing which vocabularies to add should be done as part of the UMLS installation step (step 1 in the readme). You can definitely just install SNOMED or RXNORM that...

Hi John! Olmo V1 was trained on not explicitly trained on any instruction data. If any leak in occurs, I suspect is through the code subset we trained OLMo on....

hey @WenJett ! You command looks correct, so it is strange that it is failing. is the `data.json.gz` something you could share?

I tried with your file locally on my machine: ```shell dolma tokens --documents ./data.json.gz --destination ./ --tokenizer.name_or_path allenai/dolma2-tokenizer --tokenizer.eos_token_id 100257 --tokenizer.pad_token_id 100277 --dtype uint32``` ``` Checked the output as follows:...

For issues with OLMo code, please open an issue on [its repo](https://github.com/allenai/olmo), referencing this issue. Thank you!