ImageReward icon indicating copy to clipboard operation
ImageReward copied to clipboard

How to use HuggingFace Data?

Open liming-ai opened this issue 1 year ago • 11 comments

Hi, @xujz18 @Xiao9905

Thanks for this nice contribution. I noticed that we can load ImageReward data with: datasets.load_dataset("THUDM/ImageRewardDB", "8k")

However, the loaded data seem to not match with existing code, I have no idea how to move on with these code (I downloaded these HuggingFace data and save to disk, so I use load_from_disk to load them):

train_dataset = load_from_disk("data/RLHF/ImageRewardDB_8k/train")
valid_dataset = load_from_disk("data/RLHF/ImageRewardDB_8k/validation")
test_dataset  = load_from_disk("data/RLHF/ImageRewardDB_8k/test")

When I print train_dataset[0].keys(), it shows the same results in HuggingFace Dataset introduction:

dict_keys(['image', 'prompt_id', 'prompt', 'classification', 'image_amount_in_total', 'rank', 'overall_rating', 'image_text_alignment_rating', 'fidelity_rating'])

When I run python src/make_dataset.py, following the instruction in the README, this error happens:

making dataset:   0%|                                                         | 0/10000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/tiger/code/ImageReward/train/src/make_dataset.py", line 12, in <module>
    train_dataset = rank_pair_dataset("train")
  File "/home/tiger/code/ImageReward/train/src/rank_pair_dataset.py", line 59, in __init__
    self.data = self.make_data()
  File "/home/tiger/code/ImageReward/train/src/rank_pair_dataset.py", line 80, in make_data
    for generations in item["generations"]:
KeyError: 'generations'

Unfortunately, it is not compatible with the existing datasets code: https://github.com/THUDM/ImageReward/blob/1beb4e4de0932acbe7fc090c51208048b6269b58/train/src/rank_pair_dataset.py#L47

Does this mean we have to re-write the code if we want to use the downloaded dataset from HuggingFace?

liming-ai avatar Jun 21 '23 02:06 liming-ai

Yes, in order to make the dataset more readable and scalable, we have tidied up the dataset so that it does not directly match the training code. We will subsequently update the training code to match the Huggingface dataset, which is organised into single images, pairs, and groups (e.g. 8k, 8k_pair, 8k_group), so you can choose the one that is most convenient for you.

xujz18 avatar Jun 21 '23 02:06 xujz18

Yes, in order to make the dataset more readable and scalable, we have tidied up the dataset so that it does not directly match the training code. We will subsequently update the training code to match the Huggingface dataset, which is organised into single images, pairs, and groups (e.g. 8k, 8k_pair, 8k_group), so you can choose the one that is most convenient for you.

@xujz18

Huge thanks for the quick reply. May I ask when it will be released? If it is convenient, can you give me some suggestions so that I can solve this problem more quickly?

liming-ai avatar Jun 21 '23 03:06 liming-ai

We apologize that the update of the training code may be a week or two away as the friend who is responsible for putting this part of the training code together has been busy with some important matters of his own recently. For suggestions to solve quickly: The data format corresponding to the code of make_dataset.py is the same as https://github.com/THUDM/ImageReward/blob/main/data/test.json, you can use ImageRewardDB "8k_group" (you can refer to preview on HuggingFace to get a visual impression) and turn its format to "test.json". Or, more directly, you can modify the code of rank_pair_dataset.py to correspond to HuggingFace ImageRewardDB.

xujz18 avatar Jun 21 '23 03:06 xujz18

We apologize that the update of the training code may be a week or two away as the friend who is responsible for putting this part of the training code together has been busy with some important matters of his own recently. For suggestions to solve quickly: The data format corresponding to the code of make_dataset.py is the same as https://github.com/THUDM/ImageReward/blob/main/data/test.json, you can use ImageRewardDB "8k_group" (you can refer to preview on HuggingFace to get a visual impression) and turn its format to "test.json". Or, more directly, you can modify the code of rank_pair_dataset.py to correspond to HuggingFace ImageRewardDB.

Thanks @xujz18

Thanks a lot! I have one last question, how can I use ImageRewardDB "8k_group" as you mentioned just now? Can I load the 8k_group or 8k_pair subset directly with datasets.load_datasets?

liming-ai avatar Jun 21 '23 03:06 liming-ai

Yes. Like this: load_dataset("THUDM/ImageRewardDB", "8k_group")

xujz18 avatar Jun 21 '23 04:06 xujz18

Yes. Like this: load_dataset("THUDM/ImageRewardDB", "8k_group")

Thanks a lot!

liming-ai avatar Jun 21 '23 06:06 liming-ai

Yes. Like this: load_dataset("THUDM/ImageRewardDB", "8k_group")

Hi, @xujz18

I tried to download the 8k_group follow your instruction with following code:

from datasets import load_dataset

dataset = load_dataset("THUDM/ImageRewardDB", "8k_group", num_proc=8)
dataset.save_to_disk("data/ImageRewardDB_8k_group")

However, there are errors:

Found cached dataset image_reward_db (/Users/bytedance/.cache/huggingface/datasets/THUDM___image_reward_db/8k_group/1.0.0/33d18fdde6cd866eeeab2de1471592b802627df4ade050865b4e88c500ee63b7)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 165.72it/s]
Traceback (most recent call last):
  File "/Users/bytedance/Code/download.py", line 4, in <module>
    dataset.save_to_disk("data/ImageRewardDB_8k_group")
  File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/dataset_dict.py", line 1225, in save_to_disk
    dataset.save_to_disk(
  File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/arrow_dataset.py", line 1421, in save_to_disk
    for job_id, done, content in Dataset._save_to_disk_single(**kwargs):
  File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/arrow_dataset.py", line 1458, in _save_to_disk_single
    writer.write_table(pa_table)
  File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/arrow_writer.py", line 570, in write_table
    pa_table = embed_table_storage(pa_table)
  File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 2290, in embed_table_storage
    arrays = [
  File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 2291, in <listcomp>
    embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
  File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 1837, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 1837, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 2190, in embed_array_storage
    casted_values = _e(array.values, feature.feature)
  File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 1839, in wrapper
    return func(array, *args, **kwargs)
  File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/table.py", line 2164, in embed_array_storage
    return feature.embed_storage(array)
  File "/Users/bytedance/Library/Python/3.9/lib/python/site-packages/datasets/features/image.py", line 263, in embed_storage
    storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
  File "pyarrow/array.pxi", line 2788, in pyarrow.lib.StructArray.from_arrays
  File "pyarrow/array.pxi", line 3243, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean

Besides, this error happens both for 8k_group and 8k_pair

liming-ai avatar Jun 21 '23 13:06 liming-ai

We apologize that the update of the training code may be a week or two away as the friend who is responsible for putting this part of the training code together has been busy with some important matters of his own recently. For suggestions to solve quickly: The data format corresponding to the code of make_dataset.py is the same as https://github.com/THUDM/ImageReward/blob/main/data/test.json, you can use ImageRewardDB "8k_group" (you can refer to preview on HuggingFace to get a visual impression) and turn its format to "test.json". Or, more directly, you can modify the code of rank_pair_dataset.py to correspond to HuggingFace ImageRewardDB.

anything news? The train code have finished?

LinB203 avatar Jul 08 '23 08:07 LinB203

We apologize that the update of the training code may be a week or two away as the friend who is responsible for putting this part of the training code together has been busy with some important matters of his own recently. For suggestions to solve quickly: The data format corresponding to the code of make_dataset.py is the same as https://github.com/THUDM/ImageReward/blob/main/data/test.json, you can use ImageRewardDB "8k_group" (you can refer to preview on HuggingFace to get a visual impression) and turn its format to "test.json". Or, more directly, you can modify the code of rank_pair_dataset.py to correspond to HuggingFace ImageRewardDB.

I try both two ways, but raise the same error.

cannot identify image file 'cache_dir_\\downloads\\extracted\\ba65aabab9974598536781d2df59b59457a2473f4a5341d7c0e8d0dc1988830f\\deb066d4-54aa-4562-8d30-2c67a6badb98.webp' cannot identify image file 'cache_dir_\\downloads\\extracted\\ba65aabab9974598536781d2df59b59457a2473f4a5341d7c0e8d0dc1988830f\\832ff14c-14cd-4d35-965d-bd2c1616d598.webp' cannot identify image file 'cache_dir_\\downloads\\extracted\\ba65aabab9974598536781d2df59b59457a2473f4a5341d7c0e8d0dc1988830f\\99dddbdd-a5d3-41af-98f7-a2f8927405fe.webp'

It seems that there are few invalid images in HF.

LinB203 avatar Jul 08 '23 11:07 LinB203

Yes. Like this: load_dataset("THUDM/ImageRewardDB", "8k_group")

If I run load_dataset("THUDM/ImageRewardDB", "8k_pair"). I will get a error.

Traceback (most recent call last):
  File "/miniconda3/envs/torch1.13.0/lib/python3.8/site-packages/datasets/builder.py", line 1637, in _prepare_split_single
    num_examples, num_bytes = writer.finalize()
  File "/miniconda3/envs/torch1.13.0/lib/python3.8/site-packages/datasets/arrow_writer.py", line 579, in finalize
    self.check_duplicate_keys()
  File "/miniconda3/envs/torch1.13.0/lib/python3.8/site-packages/datasets/arrow_writer.py", line 501, in check_duplicate_keys
    raise DuplicatedKeysError(key, duplicate_key_indices)
datasets.keyhash.DuplicatedKeysError: Found multiple examples generated with the same key
The examples at index 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 have the key 000904-0035

LinB203 avatar Jul 12 '23 02:07 LinB203

Hi, @xujz18 @Xiao9905

Thanks for this nice contribution. I noticed that we can load ImageReward data with: datasets.load_dataset("THUDM/ImageRewardDB", "8k")

However, the loaded data seem to not match with existing code, I have no idea how to move on with these code (I downloaded these HuggingFace data and save to disk, so I use load_from_disk to load them):

train_dataset = load_from_disk("data/RLHF/ImageRewardDB_8k/train")
valid_dataset = load_from_disk("data/RLHF/ImageRewardDB_8k/validation")
test_dataset  = load_from_disk("data/RLHF/ImageRewardDB_8k/test")

When I print train_dataset[0].keys(), it shows the same results in HuggingFace Dataset introduction:

dict_keys(['image', 'prompt_id', 'prompt', 'classification', 'image_amount_in_total', 'rank', 'overall_rating', 'image_text_alignment_rating', 'fidelity_rating'])

When I run python src/make_dataset.py, following the instruction in the README, this error happens:

making dataset:   0%|                                                         | 0/10000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/tiger/code/ImageReward/train/src/make_dataset.py", line 12, in <module>
    train_dataset = rank_pair_dataset("train")
  File "/home/tiger/code/ImageReward/train/src/rank_pair_dataset.py", line 59, in __init__
    self.data = self.make_data()
  File "/home/tiger/code/ImageReward/train/src/rank_pair_dataset.py", line 80, in make_data
    for generations in item["generations"]:
KeyError: 'generations'

Unfortunately, it is not compatible with the existing datasets code:

https://github.com/THUDM/ImageReward/blob/1beb4e4de0932acbe7fc090c51208048b6269b58/train/src/rank_pair_dataset.py#L47

Does this mean we have to re-write the code if we want to use the downloaded dataset from HuggingFace?

Hello, I am also working on reproducing the training results, but I found the 'train.json' file in huggingface seems cannot be directly used for make_dataset.py. Could you share the processed train.json file? many thanks!

muse1998 avatar Nov 01 '23 11:11 muse1998

Yes. Like this: load_dataset("THUDM/ImageRewardDB", "8k_group")

If I run . I will get a error.load_dataset("THUDM/ImageRewardDB", "8k_pair")

Traceback (most recent call last):
  File "/miniconda3/envs/torch1.13.0/lib/python3.8/site-packages/datasets/builder.py", line 1637, in _prepare_split_single
    num_examples, num_bytes = writer.finalize()
  File "/miniconda3/envs/torch1.13.0/lib/python3.8/site-packages/datasets/arrow_writer.py", line 579, in finalize
    self.check_duplicate_keys()
  File "/miniconda3/envs/torch1.13.0/lib/python3.8/site-packages/datasets/arrow_writer.py", line 501, in check_duplicate_keys
    raise DuplicatedKeysError(key, duplicate_key_indices)
datasets.keyhash.DuplicatedKeysError: Found multiple examples generated with the same key
The examples at index 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 have the key 000904-0035

Have you solved it yet?

psycho-ygq avatar Oct 20 '24 12:10 psycho-ygq