[Bug] Training with images masks not working (at least not for Qwen)

Open StableLlama opened this issue 1 month ago • 0 comments
Using the lastest release (3.1.4) trying to train Qwen with multiple resolutions and with masks as well as regularization I get these errors:
[RANK 0] 2025-11-30 14:17:36,803 [INFO] (id=clothing-512-image) Completed processing 291 captions.
[RANK 0] 2025-11-30 14:17:36,804 [INFO] (id=clothing-512-image) Creating VAE latent cache: vae_cache_dir='/root/SimpleTuner/modal_output/cache/vae/512'
[RANK 0] 2025-11-30 14:17:36,812 [INFO] Directory created: /root/SimpleTuner/modal_output/cache/vae/512
[RANK 0] 2025-11-30 14:17:36,819 [INFO] (id=clothing-512-image) Discovering cache objects..
[RANK 0] 2025-11-30 14:17:36,829 [INFO] VAECache has 97 unprocessed files.
Processing bucket 1.0:  77%|██████████████████████████████████████████████████▎              | 75/97 [00:05<00:01, 11.26it/s]                                                                                                                             
(id=clothing-512-image) Bucket 1.0 caching results: {'not_local': 0, 'already_cached': 0, 'cached': 1, 'total': 97}
[RANK 0] 2025-11-30 14:17:48,709 [INFO] Configured backend: {'id': 'clothing-512-image', 'config': {'repeats': 2, 'crop': False, 'crop_aspect': 'square', 'crop_style': 'random', 'disable_validation': False, 'conditioning_data': 'clothing-512-mask', 'resolution': 0.262144, 'resolution_type': 'area', 'caption_strategy': 'textfile', 'instance_data_dir': '/root/traindata/corset2shirt_512_250816/image', 'maximum_image_size': None, 'target_downsample_size': None, 'dataset_type': 'image', 'hash_filenames': True}, 'dataset_type': 'image', 'instance_data_dir': '/root/traindata/corset2shirt_512_250816/image'}
[RANK 0] 2025-11-30 14:17:48,714 [INFO] Configuring data backend: clothing-512-mask: {'id': 'clothing-512-mask', 'dataset_type': 'conditioning', 'conditioning_type': 'mask', 'type': 'local', 'instance_data_dir': '/root/traindata/corset2shirt_512_250816/mask', 'image_embeds': 'image-embed-storage', 'crop': False, 'resolution_type': 'pixel_area', 'metadata_backend': 'discovery', 'caption_strategy': 'filename', 'preserve_data_backend_cache': False, 'resolution': 512, 'minimum_image_size': 384}
[RANK 0] 2025-11-30 14:17:48,724 [INFO] (id=clothing-512-mask) Loading bucket manager.
[RANK 0] 2025-11-30 14:17:48,726 [WARNING] No cache file found, creating new one.
[RANK 0] 2025-11-30 14:17:48,732 [INFO] Configured backend: {'id': 'clothing-512-mask', 'config': {'crop': False, 'crop_aspect': 'square', 'crop_style': 'random', 'disable_validation': False, 'resolution': 0.262144, 'resolution_type': 'area', 'caption_strategy': 'filename', 'instance_data_dir': '/root/traindata/corset2shirt_512_250816/mask', 'maximum_image_size': None, 'target_downsample_size': None, 'dataset_type': 'conditioning'}, 'dataset_type': 'conditioning', 'instance_data_dir': '/root/traindata/corset2shirt_512_250816/mask'}
(id=clothing-512-mask) Dataset produced no usable samples. This typically happens when:
  - batch_size * num_gpus * gradient_accumulation_steps is too large for the dataset size
  - repeats is too low
  - samples were filtered out due to resolution/aspect ratio constraints

Suggestions:
  - Reduce batch_size or gradient_accumulation_steps
  - Increase repeats
  - Use fewer GPUs
  - Add more samples to the dataset
Dataset 'clothing-512-mask' produced no usable samples.
dataset_type: conditioning
instance_data_dir: /root/traindata/corset2shirt_512_250816/mask
constraints: minimum_image_size=0.147456, resolution_type=area, effective_batch_size=4, bucket_strategy=aspect_ratio, train_batch_size=1, repeats=0
post_split: 0 (bucket_count=0)
sampler_batches: 0
Traceback (most recent call last):
  File "/root/SimpleTuner/simpletuner/train.py", line 56, in <module>
    trainer.init_data_backend()
  File "/root/SimpleTuner/simpletuner/helpers/training/trainer.py", line 2185, in init_data_backend
    raise e
  File "/root/SimpleTuner/simpletuner/helpers/training/trainer.py", line 2136, in init_data_backend
    configure_multi_databackend(
  File "/root/SimpleTuner/simpletuner/helpers/data_backend/factory.py", line 3754, in configure_multi_databackend
    return configure_multi_databackend_new(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/SimpleTuner/simpletuner/helpers/data_backend/factory.py", line 3512, in configure_multi_databackend_new
    factory.configure_data_backends(data_backend_config)
  File "/root/SimpleTuner/simpletuner/helpers/data_backend/factory.py", line 2047, in configure_data_backends
    self._configure_single_data_backend(
  File "/root/SimpleTuner/simpletuner/helpers/data_backend/factory.py", line 3230, in _configure_single_data_backend
    self._handle_bucket_operations(backend, init_backend, conditioning_type)
  File "/root/SimpleTuner/simpletuner/helpers/data_backend/factory.py", line 2401, in _handle_bucket_operations
    self._handle_config_versioning(backend, init_backend)
  File "/root/SimpleTuner/simpletuner/helpers/data_backend/factory.py", line 2497, in _handle_config_versioning
    raise ValueError(
ValueError: (id=clothing-512-mask) Dataset produced no usable samples. This typically happens when:
  - batch_size * num_gpus * gradient_accumulation_steps is too large for the dataset size
  - repeats is too low
  - samples were filtered out due to resolution/aspect ratio constraints

Suggestions:
  - Reduce batch_size or gradient_accumulation_steps
  - Increase repeats
  - Use fewer GPUs
  - Add more samples to the dataset
Dataset 'clothing-512-mask' produced no usable samples.
dataset_type: conditioning
instance_data_dir: /root/traindata/corset2shirt_512_250816/mask
constraints: minimum_image_size=0.147456, resolution_type=area, effective_batch_size=4, bucket_strategy=aspect_ratio, train_batch_size=1, repeats=0
post_split: 0 (bucket_count=0)
sampler_batches: 0

[corset2shirt_251130_qwen_R1_01.local.json](https://github.com/user-attachments/files/23840288/corset2shirt_251130_qwen_R1_01.local.json)
Nov 30 '25 15:11 StableLlama