bioclip icon indicating copy to clipboard operation
bioclip copied to clipboard

About Training

Open Forainest789 opened this issue 1 year ago • 1 comments

Thanks for your work. I have some questions related to training. I tried to train the model with a small portion of the data, but when I tried to train using dataset online like: https://huggingface.co/datasets/imageomics/TreeOfLife-10M/blob/main/dataset/EOL/image_set_01.tar.gz, and download the dataset in local

python -m src.training.main \
  --train-data 'https://huggingface.co/datasets/imageomics/TreeOfLife-10M/resolve/main/dataset/EOL/image_set_01.tar.gz' \
  --val-data 'https://huggingface.co/datasets/imageomics/TreeOfLife-10M/resolve/main/dataset/EOL/image_set_01.tar.gz' \
  --dataset-type 'webdataset' \
  --pretrained 'openai' \
  --text_type 'random' \
  --warmup 100 \
  --batch-size 1 \
  --accum-freq 1 \
  --epochs 10 \
  --workers 1 \
  --model ViT-B-16 \
  --lr 1e-4 \
  --log-every-n-steps 1 \
  --dataset-resampled \
  --local-loss \
  --gather-with-grad \
  --grad-checkpointing \
  --logs '../storage/log/' \
  --train-num-samples 98000 \

it always gets stuck at the following position

2024-12-11,23:16:02 | INFO | wandb_notes:
2024-12-11,23:16:02 | INFO | wandb_project_name: open-clip
2024-12-11,23:16:02 | INFO | warmup: 100
2024-12-11,23:16:02 | INFO | wd: 0.2
2024-12-11,23:16:02 | INFO | workers: 1 
2024-12-11,23:16:02 | INFO | world_size: 1 
2024-12-11,23:16:02 | INFO | zeroshot_frequency: 2 
2024-12-11,23:16:02 | INFO | Finish counting shard total size: 98000. 
2024-12-11,23:16:02 | INFO | Finish counting shard total size: 0. 
2024-12-11,23:16:02 | INFO | Start epoch 0 
<webdataset.compat.WebLoader object at 0x719706e3a170>

In addition, I found the missing "data/resolved.jsonl" file when creating the data,

python scripts/evobio10m/make_metadata.py --db /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite

and the ToL-EDA HF Repo mentioned in the readme has disappeared

Can you provide me with some help to solve these problems Or where can I find the details about training

Thank you very much

Forainest789 avatar Dec 12 '24 00:12 Forainest789

Hi, I think you should download the data to local and pass your local path to the parameter --train-data or --val-data.

work4cs avatar Jan 21 '25 19:01 work4cs