VideoIO.jl icon indicating copy to clipboard operation
VideoIO.jl copied to clipboard

API discussion

Open kmsquire opened this issue 11 years ago • 19 comments

As discussed in https://github.com/ihnorton/libAV.jl/pull/2, it would be good to come up with (or copy) a nice front-end API. (Backend to be discussed elsewhere.)

Since I want to use this immediately for a project, I added an OpenCV-like AVCapture interface already (see https://github.com/kmsquire/AV.jl/blob/master/src/av_capture.jl, http://docs.opencv.org/modules/highgui/doc/reading_and_writing_images_and_video.html#videocapture). It's based off of tutorial01.jl, is currently video-only, and only works with files. It should be easy to support cameras (libav does), but will need some changes to work with audio.

Other API interface ideas/options include:

  • an iterator API
  • a callback API, possibly passing callbacks to libav

Other questions include to what extent this should depend on/interact with Images.jl. I included a couple of read functions for Images in the current AVCapture interface, but only if Images is open first. I'm inclined to go all in with Images, but would love some feedback.

cc: @timholy @lucasb-eyer @ihnorton

kmsquire avatar Jun 05 '14 19:06 kmsquire

I guess one big question is how low-level we want to go. Last I looked at it, I found the OpenCV video API extremely lacking. (Except for the webcam part.) I really enjoyed AVBin's API, which is what I'd call a mid-level API with a possibility of easily going low-level.

Ideally, I'd like to end up having a very high-level API à-la OpenCV, but including timestamps and some random-access/seek functions and a mid/low-level API such as AVBin's.

  1. For simple video-analysis, I'd guess an iterator-style is a nice high-level API, though the iterator should also contain a timestamp or a time delta. I didn't see any handling of time in the tutorial01, so it plays back as quickly as it can?
  2. For some video analysis, I once wanted to iterate through the full keyframes only. I was amazed as to how uncommon that wish is! I also don't know how easily feasible it is with libav*
  3. I'm not a fan of callback-style. It makes small toys and example code look much cleaner, but for anything more substantial, one usually wants to have control over the main-loop.
  4. I'd try to not depend on Images.jl, but integrate with it when it's present, as you did. See also JuliaLang/julia#2025. It should be possible to use AV.jl for audio-analysis without having to install any image-libs. I could imagine games using AV.jl for reading but something like an SDL/OpenGL wrapper for displaying. etc.

lucasb-eyer avatar Jun 05 '14 20:06 lucasb-eyer

Thanks for your feedback, Lucas, here and at #2.

  1. For simple video-analysis, I'd guess an iterator-style is a nice high-level API, though the iterator should also contain a timestamp or a time delta.

That's doable. It's also possible to make different iterators for different common cases.

I didn't see any handling of time in the tutorial01, so it plays back as quickly as it can?

Yes, that's right.

  1. I'd try to not depend on Images.jl, but integrate with it when it's present, as you did. See also JuliaLang/julia#2025. It should be possible to use AV.jl for audio-analysis without having to install any image-libs.

That's also quite reasonable.

kmsquire avatar Jun 05 '14 21:06 kmsquire

Sorry to take so long to get to this (I've been traveling).

An iterable interface is definitely more attractive. When I was planning on implementing AV support in Images (which started in the very early days of Images, but never became enough of a priority to finish), I was thinking there should be a StreamingImage type. It would be easy to define an iterator interface with such an object, and it could maintain the current timestamp in the object. I also often want to index movies by frame number; not sure if StreamingImage should have both as fields, or whether we should convert between the two (conversion is sometimes a bit problematic because I have noticed video files where the framerate is not coded correctly).

In addition to StreamingImage, there should also be a seekable type. I'm not sure whether it needs to have anything other than an AbstractArray interface (asking for img["t", 17] gives you the 17th frame), but obviously the implementation needs to differ so it would have to be a different type. SeekableImage? Doesn't exactly roll off the tongue.

Lucas' points are, as always, right on target. (l also like keyframes.)

I am happy to help write some/all of this code. (Of course, I have no objections if someone else prefers to do it themselves.) It may take another couple of days before I can get to it, but this is a very exciting direction!

timholy avatar Jun 07 '14 12:06 timholy

No worries, Tim--I really appreciate your input and collaboration, and I know what it's like to be busy.

I think both the streaming and interfaces sound fine. The media types that I'm working with have video and audio, so it might be nice to generalize the interface. AVBin actually does this nicely--for each media type you want, you "open" a separate stream. Presumably the underlying stream is demuxed efficiently behind the scenes.

Anyway, for now, I'll focus on the backend and providing an interface (through AVBin or directly) which makes the front end easy/possible.

kmsquire avatar Jun 07 '14 19:06 kmsquire

One thing I should have remembered earlier: Images already has an infrastructure for on-demand loading of specific file formats. We could simply have it triggered whenever someone says imread("file.avi") or *.mov or *.whateverotherextensionsareused. The only problem I see is that there are quite a lot of such different extensions.

You don't have to go that way, but it's already in place and also happens to solve the does-it-matter-which-order-I-load-these-packages-in? problem.

timholy avatar Jun 07 '14 19:06 timholy

I guess that page doesn't make it explicit, but the code needed to parse the format only gets loaded when a user actually needs it. So this approach would not make AV.jl a requirement for using Images.

timholy avatar Jun 07 '14 19:06 timholy

That sounds like a good idea, so we'd have an imread(..) API for loading the whole video into an (x,y,t) Image ignoring audio, then maybe an avread(..) for an AVBin-like API.

I just thought about subtitles, I have no clue about them and whether we want to support them. AVBin doesn't.

lucasb-eyer avatar Jun 07 '14 20:06 lucasb-eyer

Being able to imread() a video (or a chunk of video) into an (x,y,t) Image does sound quite useful, as does avread(). Just in case it's not obvious, libav automatically handles parsing of a ton of video (and audio and subtitle) types already (which is the main reason to wrap it), so we don't really need to tell it the stream type.

kmsquire avatar Jun 07 '14 20:06 kmsquire

Does Images.jl support, or is it easy to add support for, image types like yuv420p, in which is a planar format in which the u-v planes are 1/4 the size of the y frame?

kmsquire avatar Jun 07 '14 21:06 kmsquire

There is no "real" support for yuv420p, although if you create an Image from an array and set the "colorspace" property to "YUV420p" then at least you'll have specified its meaning.

In terms of implementing some code that actually does something with such images, I suspect the nastiest part will be dealing with the size. (EDIT: because each frame is a 2d array, but the number of actual pixels along the second coordinate is size(A, 2)*(2//3).) Images already has sdims, which is like ndims except specifically for the spatial coordinates. An RGB image encoded as img[c,x,y,t] would yield sdims(img) == 2. We could add ssize(img, d) to return the number of pixels along the dth spatial dimension. The harder part would be making sure ssize, rather than size, gets used in the appropriate places.

Presumably the next step would be to add conversion to RGB and probably a direct uint32color for display (which is essentially a conversion to RGB24 or ARGB32). Those should not be hard at all. I'm not sure it's worth supporting a bunch of algorithms that work directly on yuv420p, as I think its main value is as a compression format, but let me know if you disagree.

timholy avatar Jun 07 '14 21:06 timholy

FYI I will work on adding this.

timholy avatar Jun 07 '14 22:06 timholy

Thanks Tim. libswscale (which is part of libav) does handle this conversion (and many others) now.

The main question is whether someone might not want to do the conversion up front and work with the yuv420p image directly. For example, in http://roxlu.com/2014/039/decoding-h264-and-yuv420p-playback, the yuv420p image is passed directly to OpenGL without conversion, and is converted/rendered on the GPU.

Since we can get the conversion to RGB (and other color spaces) from libswscale, I would say this doesn't need to be high priority.

kmsquire avatar Jun 08 '14 14:06 kmsquire

Nice to know. But this kind of stuff is essentially exactly what Images was designed to do: make it possible to work with whatever format is most convenient or highest performance. Definitely "direct handoff" between libav and the rendering library was something I envisioned from the start, but haven't yet implemented. If OpenGL does the conversion/rendering, this would in fact be almost trivial (aside from all the OpenGL stuff needed to open the rendering window, etc.)

timholy avatar Jun 08 '14 15:06 timholy

I definitely think that it would useful for Images.jl to be able to (at least minimally) manipulate this format. The main point of my original question (which could have been clearer) was whether Images.jl could handle planar modes with different sized planes.

One question is, to what extent should Images.jl handle other video layouts? I picked out yuv420p because it's common in the videos I'm working with (and probably one of the most common these days). But libswscale (handles a number of others)[https://github.com/libav/libav/blob/master/libavutil/pixfmt.h#L215-L257]. It's actually a smaller list than I first thought, but it still would take a little bit of work to support all of them. Thoughts?

EDIT: because each frame is a 2d array, but the number of actual pixels along the second coordinate is size(A, 2)*(2//3)

I missed this the first time (reading via email). I might be misunderstanding you, but for yuv420p, I think the u and v frames have half the number coordinates along each dimension, not 2/3. Or am I mistaken?

kmsquire avatar Jun 08 '14 17:06 kmsquire

The main point of my original question (which could have been clearer) was whether Images.jl could handle planar modes with different sized planes.

Right, I got that, and that's what I meant about introducing ssize. I think it's best to have Image be a thin wrapper around an array, so I plan to (1) represent a YUV image as a 2d array in its native layout, (2) have size report the raw array's size information (it already does), but (3) introduce ssize as a more spatially-aware form of size. Alternatively, we could introduce an AbstractArray type with custom indexing that decodes the format.

One question is, to what extent should Images.jl handle other video layouts?

I don't yet know enough to answer that question properly.

I missed this the first time (reading via email). I might be misunderstanding you, but for yuv420p, I think the u and v frames have half the number coordinates along each dimension, not 2/3. Or am I mistaken?

I should have been clearer. IIUC u and v each has 1/4 the number of the y channel. So if the actual image is x by y pixels (OK, it's confusing but now by y I mean the height of the image), the full YUV-encoded array has size x by y+y/4+y4=3y/2. Thus if the encoded array is of size X by Y, the physical size is X by 2Y/3. That's where I got the 2//3 from.

timholy avatar Jun 08 '14 19:06 timholy

I should have been clearer. IIUC u and v each has 1/4 the number of the y channel. So if the actual image is x by y pixels (OK, it's confusing but now by y I mean the height of the image), the full YUV-encoded array has size x by y+y/4+y4=3y/2. Thus if the encoded array is of size X by Y, the physical size is X by 2Y/3. That's where I got the 2//3 from.

Okay. That seems like an interesting direction to get that number from, since I don't think we would see X and Y. But I got the same thing from adding up the plane sizes. I believe libuv gives back three pointers, one for each plane. I don't know if they're guaranteed to be contiguous.

kmsquire avatar Jun 09 '14 02:06 kmsquire

Didn't realize there was any risk they might not be contiguous. In that case, the right strategy is a new AbstractArray container type. I was literally going from this image on the YUV wp page.

timholy avatar Jun 09 '14 16:06 timholy

Maybe there's not--it might just be that there are 3 pointers for convenience. I'll try to check later.

kmsquire avatar Jun 09 '14 18:06 kmsquire

Hi guys, Sorry to resurrect this from the dead, but I'm trying to build some kind of a video explorer. The main idea is not to have a full-fledged video player, but more something like a tool I can step x frames forwards and backwards with, scroll through the video, and export click locations, time stamps, and simple stuff like that. In short, very similar to ImageView but for videos. It sounds like you were on the road to build something like that here...

Any updates?

yakir12 avatar Jun 02 '20 19:06 yakir12