datasets
datasets copied to clipboard
Feature request: download all features but only load only part of DGS Corpus at a time?
When attempting to load the DGS Corpus's default configuration on either my own workstation or in Colab, I run out of memory and crash.
Here are some screenshots
https://colab.research.google.com/drive/1_vWFvWo0ZMg5_6AFU6Ln2LPHwm9TW_Rz?usp=sharing for example, will crash given time. Is there a method such that all the features, video and pose and gloss, can all be downloaded, yet only some portion loaded into memory at a time?
Possibly something like
dgs_corpus = tfds.load('dgs_corpus', split=["train:2%"])
would work
If that works, I wonder if it would be good to: a. warn the user of projected memory usage somehow when they run the .load command? b. change the default load to not load the entire dataset into memory?
you can do it, DGS Corpus! I believe in you!
sigh
Giving it a try on my personal workstation
...nope, still "Killed"
All of these crash on my workstation, using up all 33 GB:
# dgs_corpus = tfds.load('dgs_corpus') # Killed
# dgs_corpus = tfds.load('dgs_corpus', split=["train:2%"]) # Killed
# dgs_corpus = tfds.load('dgs_corpus', split=["train:100"]) # Killed
# dgs_corpus = tfds.load('dgs_corpus', split=["train:10"]) # still Killed
Maybe one of these tricks can work?
https://www.tensorflow.org/guide/data_performance#reducing_memory_footprint
Maybe a custom data generator? https://medium.com/analytics-vidhya/write-your-own-custom-data-generator-for-tensorflow-keras-1252b64e41c3
https://www.tensorflow.org/datasets/performances#large_datasets
Maybe something from here? https://github.com/tensorflow/tfjs/issues/7801
Oh hey, this looks relevant, and I see a familiar name: https://github.com/huggingface/datasets/issues/741. It's transformers library though.
Reading https://www.tensorflow.org/datasets/api_docs/python/tfds/load, maybe we can do as_dataset separately?
https://www.reddit.com/r/deeplearning/comments/z8otan/if_the_dataset_is_too_big_to_fit_into_your_ram/ some more possibilities
OK, so tf.Dataset does support streaming... https://stackoverflow.com/questions/63140320/how-to-use-sequence-generator-on-tf-data-dataset-object-to-fit-partial-data-into, so is it the split generation where the issue comes?
Using the "manually add print statements to the site-packages in my conda env" method I kept following it all the way down and it gets killed in here:
https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/dgs_corpus/dgs_corpus.py#L330,
and makes it to here, https://github.com/tensorflow/datasets/blob/v4.9.3/tensorflow_datasets/core/dataset_builder.py#L1584
and makes it to here
https://github.com/tensorflow/datasets/blob/v4.9.3/tensorflow_datasets/core/split_builder.py#L415
and gets killed around there somewhere
https://github.com/gruns/icecream might be helpful, note to self
Or, you know, I could look at one of these: https://stackify.com/top-5-python-memory-profilers/
Or https://www.tensorflow.org/guide/profiler#memory_profile_tool
https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.07-Timing-and-Profiling.ipynb could also be of use
I've done a lot of searching but tfds just doesn't seem to have a way to stream part of a large dataset that I can find.
So I still can't figure out how to (1) only download a portion, (2) assuming it's all successfully downloaded, load only a portion into memory without the split generation using all available memory.
lots of comments... in the future it would be helpful if you keep editing the same comment or handful.
When you:
tfds.load('dgs_corpus', split=["train:2%"])
What happens is that first the entire dataset is being prepared, then only 2% of it is loaded. So you will need the exact same disk space.
Now since there are two processes here:
- preparing the entire dataset
- loading a part of the dataset
can you tell where the memory consumption is too high? my suspicion is number 1, but i don't know.
Right, sorry, I forget that I'm not the only one getting spammed by all these, apologies.
I'm also suspecting 1, based on the fact that I can sprinkle print statements all the way until https://github.com/tensorflow/datasets/blob/v4.9.3/tensorflow_datasets/core/split_builder.py#L415.
Edit: my big issue is that testing it currently requires me to run it until it crashes. Which, if I'm doing it on Google Colab, means that any modifications I've made to the code are then gone. I've got a workstation I can test locally on but don't have access as conveniently.
Edit again:
Certainly the download_and_prepare is using a lot of RAM, though this particular instance of Colab has not yet crashed:
Edit 3: ... and it crashed after using all available RAM. so that step does seem to use high memory... but the file system was not actually deleted. OK I can work with that, perhaps
Edit 4: OK, trying it in a "high-memory" notebook in Colab Pro, I get this:
Edit 5: full stacktrace on the high-memory notebook:
ValueError Traceback (most recent call last)
[<ipython-input-5-6bad8ee20d7b>](https://localhost:8080/#) in <cell line: 1>()
----> 1 dgs_corpus = tfds.load('dgs_corpus', )
10 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/logging/__init__.py](https://localhost:8080/#) in __call__(self, function, instance, args, kwargs)
166 metadata = self._start_call()
167 try:
--> 168 return function(*args, **kwargs)
169 except Exception:
170 metadata.mark_error()
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/load.py](https://localhost:8080/#) in load(name, split, data_dir, batch_size, shuffle_files, download, as_supervised, decoders, read_config, with_info, builder_kwargs, download_and_prepare_kwargs, as_dataset_kwargs, try_gcs)
647 try_gcs,
648 )
--> 649 _download_and_prepare_builder(dbuilder, download, download_and_prepare_kwargs)
650
651 if as_dataset_kwargs is None:
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/load.py](https://localhost:8080/#) in _download_and_prepare_builder(dbuilder, download, download_and_prepare_kwargs)
506 if download:
507 download_and_prepare_kwargs = download_and_prepare_kwargs or {}
--> 508 dbuilder.download_and_prepare(**download_and_prepare_kwargs)
509
510
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/logging/__init__.py](https://localhost:8080/#) in __call__(self, function, instance, args, kwargs)
166 metadata = self._start_call()
167 try:
--> 168 return function(*args, **kwargs)
169 except Exception:
170 metadata.mark_error()
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/dataset_builder.py](https://localhost:8080/#) in download_and_prepare(self, download_dir, download_config, file_format)
697 self.info.read_from_directory(self.data_dir)
698 else:
--> 699 self._download_and_prepare(
700 dl_manager=dl_manager,
701 download_config=download_config,
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/dataset_builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, download_config)
1666 return
1667
-> 1668 split_infos = self._generate_splits(dl_manager, download_config)
1669
1670 # Update the info object with the splits.
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/dataset_builder.py](https://localhost:8080/#) in _generate_splits(self, dl_manager, download_config)
1641 ):
1642 filename_template = self._get_filename_template(split_name=split_name)
-> 1643 future = split_builder.submit_split_generation(
1644 split_name=split_name,
1645 generator=generator,
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/split_builder.py](https://localhost:8080/#) in submit_split_generation(self, split_name, generator, filename_template, disable_shuffling)
329 # `_build_from_xyz` method.
330 if isinstance(generator, collections.abc.Iterable):
--> 331 return self._build_from_generator(**build_kwargs)
332 else: # Otherwise, beam required
333 unknown_generator_type = TypeError(
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/split_builder.py](https://localhost:8080/#) in _build_from_generator(self, split_name, generator, filename_template, disable_shuffling)
400 except Exception as e: # pylint: disable=broad-except
401 utils.reraise(e, prefix=f'Failed to encode example:\n{example}\n')
--> 402 writer.write(key, example)
403 shard_lengths, total_size = writer.finalize()
404
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/writer.py](https://localhost:8080/#) in write(self, key, example)
225 example: the Example to write to the shard.
226 """
--> 227 serialized_example = self._serializer.serialize_example(example=example)
228 self._shuffler.add(key, serialized_example)
229 self._num_examples += 1
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/example_serializer.py](https://localhost:8080/#) in serialize_example(self, example)
96 serialize_proto: `str`, the serialized `tf.train.Example` proto
97 """
---> 98 return self.get_tf_example(example).SerializeToString()
99
100
Edit: and here's how many resources were used:
Update: OK, it seems that just these three files being encoded is enough to use many gigabytes.
Well I am thoroughly stumped. I've narrowed it down to where in tfds the massive memory allocations are happening, but I still don't know why.
I just don't understand why it needs nearly 30 GiB to "encode" and then "serialize" the videos
Here's the memray report. memray_output_file.tar.gz
For some reason reading in the frames results in over 50k allocations? many GiB worth? This line right here, officer: https://github.com/google/etils/blob/main/etils/epath/abstract_path.py#L149 Called by this one https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/features/video_feature.py#L152
Also the serialize_example calls use huge memory as well
I don't know what to do or how to fix it. I know the UFC101 dataset doesn't have this issue. If anyone has thoughts let me know
This is all before we even get to the protobuf max size error
Update: well, I inspected the tmp folder that gets created, and there are indeed nearly 13k frames extracted from just one of the videos:
which ends up being nearly 5GB legitimately:
Note: perhaps something in here may be relevant?
- https://github.com/sign-language-processing/datasets/blob/master/sign_language_datasets/datasets/config.py#L61
- and https://www.tensorflow.org/datasets/api_docs/python/tfds/features/Video#example
Like maybe there's a setting in there to not load every file but just a list of paths?
Ok, so to me it seems like you are not using appropriate config
Like maybe there's a setting in there to not load every file but just a list of paths?
The "correct" config here would be:
config = DgsCorpusConfig(name="only-annotations", version="1.0.0", include_video=False, include_pose=None)
dgs_corpus = tfds.load('dgs_corpus', builder_kwargs=dict(config=config))
Which loads only the annotations.
You want to download the video but load them as paths? include_video=True, process_video=False
You want to load poses? include_pose="holistic" or include_pose="openpose"