donut KeyError: 'image' in Dataset

Error

While training my custom dataset I am getting the following error:

KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/anaconda/envs/azureml_py38/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/pibit-ml/code/donut-poc/donut/donut/util.py", line 111, in __getitem__
    input_tensor = self.donut_model.encoder.prepare_input(sample["image"], random_padding=self.split == "train")
KeyError: 'image'

I previously used Google Colab to run the same code with the same dataset structure, but I did not see this error.

Dataset logs :

Using custom data configuration dataset-1000-7363c1598f36d3b2
Reusing dataset json (/home/azureuser/.cache/huggingface/datasets/json/dataset-1000-7363c1598f36d3b2/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5)
Dataset({
    features: ['file_name', 'ground_truth'],
    num_rows: 952
})
Using custom data configuration dataset-1000-7363c1598f36d3b2
Reusing dataset json (/home/azureuser/.cache/huggingface/datasets/json/dataset-1000-7363c1598f36d3b2/0.0.0/da492aad5680612e4028e7f6ddc04b1dfcec4b64db470ed7cc5f2bb265b9b6b5)
Dataset({
    features: ['file_name', 'ground_truth'],
    num_rows: 25
})

structure of dataset

dataset-1000
├── test
│   ├── metadata.jsonl
│   ├── {image_path0}
│   ├── {image_path1}
│             .
│             .
├── train
│   ├── metadata.jsonl
│   ├── {image_path0}
│   ├── {image_path1}
│             .
│             .
└── validation
    ├── metadata.jsonl
    ├── {image_path0}
    ├── {image_path1}
    │             .
    │             .

command:

$ python train.py --config config/train_cord.yaml \
    --pretrained_model_name_or_path "naver-clova-ix/donut-base" \    
    --dataset_name_or_paths '["dataset-1000"]' \
    --exp_version "test_experiment_1000"

config yaml:

resume_from_checkpoint_path: null # only used for resume_from_checkpoint option in PL
result_path: "./result"
pretrained_model_name_or_path: "naver-clova-ix/donut-base" # loading a pre-trained model (from moldehub or path)
dataset_name_or_paths: "naver-clova-ix/donut-base"  # loading datasets (from moldehub or path)
sort_json_key: False # cord dataset is preprocessed, and publicly available at https://huggingface.co/datasets/naver-clova-ix/cord-v2
train_batch_sizes: [1]
val_batch_sizes: [1]
input_size: [1280, 960] # when the input resolution differs from the pre-training setting, some weights will be newly initialized (but the model training would be okay)
max_length: 768
align_long_axis: False
num_nodes: 1
seed: 2022
lr: 3e-5
warmup_steps: 100 
num_training_samples_per_epoch: 900
max_epochs: 50
max_steps: -1
num_workers: 8
val_check_interval: 1.0
check_val_every_n_epoch: 3
gradient_clip_val: 1.0
verbose: True

System Info

OS: Ubuntu 20.04.6 LTS x86_64 (Azure VM) Python: 3.8.10 donut: 1.0.9

Apr 02 '23 10:04 pathikg

Here's an output for the same repo when used on a CPU VM

Dataset imagefolder downloaded and prepared to /home/azureuser/.cache/huggingface/datasets/imagefolder/dataset-1000-e58f8c769205d01c/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f. Subsequent calls will reuse this data.
Dataset({
    features: ['image', 'ground_truth'],
    num_rows: 952
})
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 953/953 [00:00<00:00, 41663.24it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27/27 [00:00<00:00, 15170.29it/s]
Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26/26 [00:00<00:00, 5582.96it/s]
Found cached dataset imagefolder (/home/azureuser/.cache/huggingface/datasets/imagefolder/dataset-1000-e58f8c769205d01c/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f)
Dataset({
    features: ['image', 'ground_truth'],
    num_rows: 25
})

Apr 03 '23 08:04 pathikg

I met the same problem when I ran the code on my local dataset. My solution is to add two extra lines to load local files: in file donut/util.py under line 64

if 'image' not in self.dataset.features:
    self.dataset = load_dataset("imagefolder",data_dir=dataset_name_or_path, split=self.split)

And it worked for me.

Oct 18 '23 20:10 YY-OhioU