What I need help with / What I was wondering

I want to load a dataset containing these without this happening (Colab notebook for replicating)

...How can I edit my dataset loader to use less memory when encoding videos?

Background: I am trying to load a custom dataset with a Video feature. When I try to tfds.load() it, or even just download_and_prepare, RAM usage goes up very high and then the process gets killed. For example this notebook will crash if allowed to run, though with a High-RAM instance it may not. It seems it is using over 30GB of memory to encode one or two 10 MB videos. I would like to know how to edit/update this custom dataset so that it will not use so much memory.

What I've tried so far

I did a bunch of debugging and tracing of the problem with memray, etc. See this notebook and this issue for detailed analysis including a copy of the memray report.

Tried various different ideas in the notebook, including loading just a slice, editing buffer size, and switching from .load() to download_and_prepare()

Finally I traced the problem to serializing and encoding steps under the See this comment, which was allocating many GiB of memory to encode even one 10MB video.

I discovered that even one 10MB video was extracted to over 13k video frames, taking up nearly 5GiB of space. And then the serializing would take up 14-15 GiB, and the encoding would take another 14-15, and so the process would be killed.

Relevant items:

The data loader in question, dgs_corpus.py
The full memray report: memray_output_file.tar.gz
Encoding path: The dataset uses a custom VideoFeature as well, defined here. The memray showsthat encode_example here ends up allocating 14.5 GiB
Serialization: The memray shows that the other path that uses memory is serialization: split_builder.py here which calls writer.py's serialization

It would be nice if...

...there were more examples of how to efficiently load video datasets, and explanations of why they are more efficient.
...there were a way to do this in some sort of streaming fashion that used less memory, e.g. loading in a batch of frames, using a sliding window, etc.
...there were some way to set a memory limit, and just have it process more slowly within that limit.
...there were a way to separate the download and prepare processes. A download_only option, like --download_only in the CLI
...there were a warning that the dataset was using a lot of memory in processing, before the OS kills the process.
...for saving disk space, a way to encode and serialize videos without extracting thousands of individual frames, ballooning the size from 10MB to multiple GiB. Maybe there is and I just don't know.
...it was possible to download only part of a dataset. It's possible to load a slice, but only after download_and_prepare does its whole thing.
...more explanation of what serialization and encoding are for, maybe? What are they?

Environment information I've tested it on Colab and a few other Ubuntu workstations. High-Ram Colab Instances seem to have enough memory to get past this.

Jul 03 '24 15:07 cleong110

Hey,

Thanks for your question. Those are some cool datasets! I'm very sorry to hear that you're running into these problems.

We brainstormed a bit and came up with a couple of ideas:

14-15GB for 13k frames means that each frame takes up ~1MB. IIUC ffmpeg extracts frames as PNG files. Switching to JPG could maybe bring ~5x savings. However, you'd still end up with ~3GB for a 10MB video. Not great.
Store the encoded video in the dataset. This means that the video will stay 10MB, but that the decoding needs to happen when you use the data. I'm not sure if using ffmpeg to decode when training would be a good solution (i.e. running a separate tool that writes 14-15 GB to disk, then read those 14-15 GB from disk). Alternatively, there seem to be Python libraries that can read videos, e.g. OpenCV.

Even if we make storing encoded videos work, I'm worried that the problem would just be moved to when the dataset is used. Namely, reading a single example would still require 14-15 GB of memory.

After the dataset has been prepared, how are you expecting that it will be used? Would it make sense to lower the FPS (it's 50 now right)? Will users only use chunks of the video? If so, perhaps you can store the chunks instead of the entire video.

Kind regards, Tom

Jul 04 '24 09:07 tomvdw

Tom,

Thank you very much for your reply, and those ideas!

How will they be used:

I'm just getting into Sign Language Processing research, so I'm still not quite sure how I want to use these, but potentially for training translation models for signed language videos to spoken-language text, or for pretraining a vision transformer, or a bunch of other things A few use-cases follow:

test out models on real data

I figured I'd start learning by at least running some inference pipelines with already-trained models, and got stuck on this step. I expected running a model to take significant memory, but didn't expect that loading the video would be the issue. I guess I'm successfully learning things! Specifically I'd like to load in some videos and run this demo of segmentation+recognition pipeline.

replicate other research on github

I went looking for examples of people using these, and it seems that not many use the video option, perhaps for this very reason, that loading them is too cumbersome.

This project on sign language translation loads actual videos in a number of places including for prediction here and here and here. And for training in this script.

replicate WMT results, or at least re-run their models

One thing I wanted to do was replicate results for the WMT Sign Language Translation contests, which provides data in a number of formats including video, and a number of the submissions do use video as inputs instead of poses.

WMT 22 data
WMT 23 data According to the "Findings" papers that came from these, a good number of the submissions to these did take videos as inputs instead of poses, I'd like to be able to tinker with those pipelines.

At least load the videos and then run pose estimation on them

Another thing I wanted to do was to be able to load the videos, run a pose estimator on them, and then use that, in order to potentially improve that part of the pipeline. A number of sign language translation models take pose keypoints as inputs, and I'd like to try those out.

At the very least I'd like to be able to do this! And then the pose methods may take less compute from there.

Jul 05 '24 14:07 cleong110

Regarding the suggestions:

seems pretty easy to test, worth a shot!
I admit I'm pretty ignorant about this, what is the encoding/decoding even doing exactly? What would it mean to store the encoded video, decode later, etc.? I read about it a bit, and I think I understand that encoding is to compress the frames to a video format, and decode is to expand out to the frames...? If so, then is there a way to load in only some limited number of the frames at a time? And why does the dataset need to encode when it's already encoded as a .mp4?

I guess I'd like to be able to, and I don't know if any of this is feasible, but:

If I have plenty of time but not memory or hard drive space, have a way to just slowly decode as needed.
If I have plenty of time AND hard drive space, expand it out to frames on the hard drive, but then only load into memory what I need when I need it.
If I have memory enough to load half the video, only load half. Stream the rest in in like a buffer
and so forth, but basically have it do its best with the available resources but not crash.

Did some further Googling, and I found a few things:

Memory issues when loading videos into frames suggestion is to use the pims library which lets you index/slice videos, and only loads them when used.
How to read part of a video and load into RAM without loading the entire video on RAM? suggests using the ffmpeg "trim" method

Jul 05 '24 14:07 cleong110

FPS lowering: that's another good idea, I think there might be a method in there to set that already. Maybe tweaking that would reduce memory usage, I can try.

Jul 05 '24 14:07 cleong110

Aquí tienes un resumen listo para usar sobre el issue abierto en tensorflow/datasets#5499:

Issue: Custom video dataset encoding/serialize uses all memory, process killed. How to fix?

Resumen del problema:

Al intentar cargar un dataset personalizado con videos usando TensorFlow Datasets, el proceso consume demasiada memoria (más de 30GB para archivos de solo 10MB).
Esto ocurre durante las etapas de serialización y codificación de videos, donde se extraen miles de frames y el proceso se termina por falta de RAM.
El usuario ya intentó varios enfoques, como cambiar el buffer, cargar solo un slice, y depuración con memray, sin éxito.

Referencias y recursos relevantes:

Lo que pide el usuario:

Ejemplos de cómo cargar videos de manera eficiente y reducir el uso de memoria.
Opciones para procesamiento por streaming o en ventanas (sliding window).
Posibilidad de limitar la memoria usada por TFDS.
Mejorar la documentación sobre cómo funcionan la serialización y codificación en TFDS.
Opciones para descargar y preparar datasets por separado, o descargar solo parte del dataset.

Imágenes y material de referencia en el issue:

Varias imágenes que muestran el uso elevado de RAM y el flujo de procesamiento.

Acciones recomendadas

Revisar la implementación de VideoFeature y considerar la opción de procesar video en streaming, evitando cargar todos los frames a memoria.
Investigar el uso de generadores o procesamiento por lotes (batch) en el pipeline de TFDS.
Consultar la documentación oficial de TFDS sobre carga de video y buscar ejemplos de datasets grandes de video.
Plantear una pregunta en el mismo issue o abrir una discusión en el repo para ver si hay mejores prácticas recomendadas.

¿Quieres una propuesta de código o guía paso a paso para optimizar la carga del dataset? ¿O necesitas ayuda para redactar una respuesta para ese issue?

May 28 '25 04:05 picazoluevano