some issues about batchsize with deepspeed
Hi,
Thanks for your excellent work,
Following this repo, I am trying to add the deepspeed code for the glide-finetune (https://github.com/afiaka87/glide-finetune).
I find self.local_files = file_paths[shard:][::num_shards] brings some issues.
In my setting, I use 8 GPUS and set 'train_micro_batch_size_per_gpu'=8 for training glide-finetune. My dataset has 6921916 image-text pairs. The iter for each epoch should be 108154 (The number of datasets / the number of GPUS / the number of train_micro_batch_size_per_gpu ) (6921916/8/8). However, in my training log, the iter for each epoch is 13519 (The number of datasets / the number of GPUS / the number of train_micro_batch_size_per_gpu / the number of GPUS).
After review, I found that 'file_paths[shard:][::num_shards]' will divide the dataset into 8 parts manually when I create the dataset, and then send the dataset to distr_backend.distribute, the dataset will be divided into 8 parts again.
(glide_model, optimizer, dataloader, _) = distr_backend.distribute( args=args, model=glide_model, optimizer=optimizer, model_parameters=[x for x in glide_model.parameters() if x.requires_grad], training_data=None if args.use_webdataset else dataset, lr_scheduler=None, # TODO: allow for pytorch scheduler config_params=deepspeed_config, )
My issues are as follows:
[ ] Is self.local_files = file_paths[shard:][::num_shards] necessary?
[ ] When we use distr_backend.distribute, whether the dataset will be divided again?
Thank you very much!
https://github.com/afiaka87/latent-diffusion-deepspeed/blob/5ede46dcc9e217ef56a37f97bd5e3913b9b19435/latent_diffusion_deepspeed/image_text_datasets.py#L81