About Training
Thanks for your work. I have some questions related to training. I tried to train the model with a small portion of the data, but when I tried to train using dataset online like: https://huggingface.co/datasets/imageomics/TreeOfLife-10M/blob/main/dataset/EOL/image_set_01.tar.gz, and download the dataset in local
python -m src.training.main \
--train-data 'https://huggingface.co/datasets/imageomics/TreeOfLife-10M/resolve/main/dataset/EOL/image_set_01.tar.gz' \
--val-data 'https://huggingface.co/datasets/imageomics/TreeOfLife-10M/resolve/main/dataset/EOL/image_set_01.tar.gz' \
--dataset-type 'webdataset' \
--pretrained 'openai' \
--text_type 'random' \
--warmup 100 \
--batch-size 1 \
--accum-freq 1 \
--epochs 10 \
--workers 1 \
--model ViT-B-16 \
--lr 1e-4 \
--log-every-n-steps 1 \
--dataset-resampled \
--local-loss \
--gather-with-grad \
--grad-checkpointing \
--logs '../storage/log/' \
--train-num-samples 98000 \
it always gets stuck at the following position
2024-12-11,23:16:02 | INFO | wandb_notes:
2024-12-11,23:16:02 | INFO | wandb_project_name: open-clip
2024-12-11,23:16:02 | INFO | warmup: 100
2024-12-11,23:16:02 | INFO | wd: 0.2
2024-12-11,23:16:02 | INFO | workers: 1
2024-12-11,23:16:02 | INFO | world_size: 1
2024-12-11,23:16:02 | INFO | zeroshot_frequency: 2
2024-12-11,23:16:02 | INFO | Finish counting shard total size: 98000.
2024-12-11,23:16:02 | INFO | Finish counting shard total size: 0.
2024-12-11,23:16:02 | INFO | Start epoch 0
<webdataset.compat.WebLoader object at 0x719706e3a170>
In addition, I found the missing "data/resolved.jsonl" file when creating the data,
python scripts/evobio10m/make_metadata.py --db /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite
and the ToL-EDA HF Repo mentioned in the readme has disappeared
Can you provide me with some help to solve these problems Or where can I find the details about training
Thank you very much
Hi, I think you should download the data to local and pass your local path to the parameter --train-data or --val-data.