Daft feat: Add new VideoFile FileType

Is your feature request related to a problem?

Given the discussion from https://github.com/Eventual-Inc/Daft/discussions/5054 there is a near term opportunity to add valuable video processing capabilities by extending the daft.File DataType.

In conjunction with a daft VideoFile there would also be a daft AudioFile.

As mentioned in the discussion a video type would need to support methods for:

Reading Metadata (Including width, height, fps, frame_count, and time_base)
Extracting Keyframes
Reading image frames (image frames + seeking)
reading audio frames (fixed duration + seeking)

This is intended to support several of the use cases outlined in the discussion to streamline both image, audio, and video ai preprocessing for inference/training.

A Few extra notes. The standard representation of an Image DataType in Daft materialized as a numpy array in a UDF. This appears to be the standard format for performing inference on open source audio and video models as well, while closed-source/proprietary inference providers prefer http or base64 data urls. File references or tempfile URI references are also sufficient for lots of transcription use cases but from what I've seen most workloads would appreciate intelligent numpy conversion.

Finally the AudioFile and VideoFile Types help set the stage for daft to more natively support audio and video ai workloads. Working with Files in this manner will also help to feed development in the wide world of documents like PDFs, HTML, Docx, PPT, and so on.

Describe the solution you'd like

Use PyAv for extracting metadata, keyframes, image frames, and audio frames seperately -> Should Materialize to metadata enriched numpy arrays or should at least be accompanied by metadata, keyframe info, for downstream inference/training.

Use Soundfile for extracting audio from audio files with resampling and seek support.

Last Note on Audio -> We should be able to write audio back to a new audio file.

Describe alternatives you've considered

A native DataType implementation has been considered, but since working with Audio and Video files is fundamentally different workload (data tends to be processed in a streaming fashion), a file based approach makes more sense.

Additional Context

@universalmind303 @stayrascal @malcolmgreaves @jaychia

Would you like to implement a fix?

No

Oct 23 '25 17:10 everettVT

There are a few related feature requests in other issues from read_video_frames as well if you end up referencing that. https://github.com/Eventual-Inc/Daft/issues/5174 https://github.com/Eventual-Inc/Daft/issues/5173

Oct 23 '25 18:10 everettVT

Hi @everettVT, may I ask, after daft natively supports Video & Audio DataType, what are the benefits compared to using third-party libraries?

Will the daft community start to implement it?

Oct 27 '25 02:10 caican00

Basically we are taking 3rd party implementations and implementing them in UDFs to start. Daft is operating as the data engine and vectorizing that task.

My goal is continue to discover and implement common preprocessing functions that steadily make it trivial to work with modalities traditionally stuck in files (audio, video, docs). That way it's clear how to make that data available for inference.

Inside the UDF, we're running python, but since daft manages memory and parallelization both locally and distributed, you don't have to worry about scaling so you can build more pipelines, instead of architecting data infrastructure.

Eventually most expressions/functions will move to rust, which for the VideoFile type, @universalmind303 has already given us a strong head start on.

Let me know if that answers your question.

Concerning contributions, absolutely! Feel free to submit a PR, create an issue, or start a discussion.

Also happy to answer questions here.

Oct 27 '25 02:10 everettVT

Basically we are taking 3rd party implementations and implementing them in UDFs to start. Daft is operating as the data engine and vectorizing that task.

My goal is continue to discover and implement common preprocessing functions that steadily make it trivial to work with modalities traditionally stuck in files (audio, video, docs). That way it's clear how to make that data available for inference.

Inside the UDF, we're running python, but since daft manages memory and parallelization both locally and distributed, you don't have to worry about scaling so you can build more pipelines, instead of architecting data infrastructure.

Eventually most expressions/functions will move to rust, which for the VideoFile type, @universalmind303 has already given us a strong head start on.

Let me know if that answers your question.

Concerning contributions, absolutely! Feel free to submit a PR, create an issue, or start a discussion.

Also happy to answer questions here.

@everettVT thanks for your reply. it would be clearer if there was a demo. thank you!

I'm very curious if this will have more advantages compared to ray data? Will there be an improvement in performance if a third-party library is used?

Nov 04 '25 03:11 caican00

Implemented in https://github.com/Eventual-Inc/Daft/pull/5346.

Some advantages of the video file is the ability to get information and metadata about video in a simple and efficient way, i.e.

class VideoMetadata(TypedDict):
    width: int | None
    height: int | None
    fps: float | None
    duration: float | None
    frame_count: int | None
    time_base: float | None

Additionally we will also be able to do optimizations like pushdowns, for example if the query only needs metadata, we may not even need to download the whole video into memory.

Nov 26 '25 21:11 colin-ho