OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

NotImplementedError: file size not implemented for 'https' files

Open zhuol opened this issue 1 year ago • 1 comments

❓ The question

Any one saw this error before?

I was running "torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml --wandb=null --save_overwrite" for a brand new training and I updated all the r2 path by https:// as the new path for public downloading, however, there is no https file size calculation exists, and there is error thrown.

Is there any workaround or there is an implementation required?

zhuol avatar Feb 10 '24 13:02 zhuol

Hey @zhuol at the moment you'd have to download the files and then change the paths to be local file paths.

We might be able to support streaming from the HTTPS URLs, but it depends if CloudFront (R2) allows range requests. This is worth investigating.

epwalsh avatar Feb 11 '24 20:02 epwalsh

Can you please describe which datasets I need to download for pre-training. Where to put these files, what is the directory structure for storing the files and how to modify the path in the config file. Thank you for the help.

juripapay avatar Mar 05 '24 11:03 juripapay

I apologize for our delay in response. In order to help surface current, unresolved issues, we are closing tickets prior to February 29. Please reopen your ticket if you are continuing to experience this issue. Thank you!

dumitrac avatar Apr 30 '24 18:04 dumitrac