PyAV Support yuv444p/yuvj444p in `to_ndarray`/`from

Support yuv444p/yuvj444p in `to_ndarray`/`from_ndarray`

Open NPN opened this issue 3 years ago • 16 comments

May 21 '21 08:05 NPN

How are you getting that error? I generated the test video and used this code, which reads in the video, converts each frame to a yuv444p ndarray, then back to a yuv444p frame, and then writes out the frame:

import av

with av.open("testsrc.mp4", "r") as vin, av.open("out.mp4", "w") as vout:
    sout = vout.add_stream("h264", rate=25)
    sout.width = 320
    sout.height = 240
    sout.pix_fmt = "yuv444p"

    for frame in vin.decode(video=0):
        assert frame.format.name == "yuv444p"
        array = frame.to_ndarray()
        assert array.shape == (240, 320, 3)
        new_frame = av.VideoFrame.from_ndarray(array, "yuv444p")
        for packet in sout.encode(new_frame):
            vout.mux(packet)

    for packet in sout.encode():
        vout.mux(packet)

For me, the output out.mp4 looks the same as testsrc.mp4.

As for figuring out the pixel format, I think I searched and read until I knew enough to understand the terminology around pixel formats. Then I could reference FFmpeg's pixfmt.h header, which has some comments about each pixel format. For yuv420p10le, it says:

Notice that each 9/10 bits sample is stored in 16 bits with extra padding. ... planar YUV 4:2:0, 15bpp, (1 Cr & Cb sample per 2x2 Y samples), little-endian

So we know that this format is:

planar: Pixels are stored as YYY...UUU...VVV..., as opposed to packed/interleaved, which is YUVYUV...
YUV: as opposed to RGB
4:2:0 chroma subsampling (Wikipedia has a section on the chroma subsampling notation)
15bpp (bits per pixel): 4 real pixels are stored as 4 luma (Y) pixels and 2 chroma (Cr, Cb) pixels. Each format pixel is 10 bits, so that's (6 format pixels * 10 bits) / 4 real pixels = 15 bits per pixel

little-endian: This means that each 10-bit sample is packed into 16 bits as follows:

               MSB      LSB
               v        v
10-bit pixel:  0123456789

               Byte 1      Byte 2
               v           v
16-bit pixel:  23456789    ......01

If you understand all the terminology, the description for each format should be enough. Unfortunately, I don't know of a resource that describes everything in one place, but as long as you search around, look at existing code, etc., you should be good.

Aug 26 '21 08:08 NPN

This is ultimately an API design decision, but I am not 100% sure if pyAV should convert yuv444p to channel-last internally. p indicates a planar pixel format, and I would have consequentially expected the axis order of [channel, height, width] and not [height, width, channel]. The latter, I would have expected to see from yuyv444 (which is a packed yuv444).

Doing this internal conversion may result in interesting behavior. The final np.moveaxis in this PR creates a view into a planar buffer (not a copy!), so depending on which application consumes the data later on, you may see artifacts due to malaligned channels (@WyattBlue 's comment). Here is a snippet to demonstrate the "problem":

>>> import av
>>> import numpy as np
>>> foo = av.open("testsrc.mp4")      
>>> frame = next(foo.decode(video=0))
>>> np_frame = np.stack([np.frombuffer(frame.planes[idx], dtype=np.uint8) for idx in range(3)]).reshape(3, frame.height, frame.width)
>>> np_frame2 = np.moveaxis(np_frame, 0, -1)
>>> np_frame2.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Another point is that np.stack will definitely create a copy of the frame's data, which could be avoided by directly using frame._buffer instead of splitting the data into planes and reassembling later. This suggestion does come with a grain of salt though, because I don't know if FFMPEG guarantees contiguity of planes (though I don't see any obvious reason why it shouldn't be able to do so), and further because I don't know the memory management model that pyAV uses for frames.

@WyattBlue You can find a good overview/breakdown of the various names used by ffmpeg here: https://ffmpeg.org/doxygen/trunk/pixfmt_8h_source.html (it's C code, but the comments are the important part).

yuv420p10le is a bit more involved than yuv444p, because it is subsampled and not 8-bit. If you want to get it into a similar format as discussed here ([height, width, channels]), you have to upsample the UV channels. (Fortunately, the data comes byte-aligned instead of being a bitstream)

>>> bar = av.open("testsrc.mp4")
>>> frame = next(bar.decode(video=0))
>>> # cast the planes to np.uint16 (they are 10-bit each in little endian, but have 2-byte alignment)
>>> y_plane, u_plane, v_plane = tuple(np.frombuffer(frame.planes[idx], dtype="<u2").reshape(frame.planes[idx].height, frame.planes[idx].width) for idx in range(3))
>>> # upsample U and V to match Y plane (this essentially creates a YUV444p image)
>>> np_frame = np.stack([y_plane, np.repeat(np.repeat(u_plane, 2, axis=0), 2, axis=1), np.repeat(np.repeat(v_plane, 2, axis=0), 2, axis=1)])
>>> # as per the above, if you want channel-last contiguous data you need to copy or use axis=-1 when stacking
>>> np_frame2 = np.moveaxis(np_frame, 0, -1).copy()

The above is a manual example of doing the extraction. You can alternatively use frame.reformat("yuyv444") and have FFMPEG do this conversion for you. While at it, why not reformat to RGB or which ever your target format is?

Feb 02 '22 11:02 FirefoxMetzger

@FirefoxMetzger Thanks for the suggestion--it does make more sense to put channels first. I think I was blindly following the order of the RGB/BGR/etc formats without realizing that they were packed.

Feb 08 '22 07:02 NPN

@NPN Your last commit solved the problem I had before. This pull request is ready to be merged.

Feb 08 '22 22:02 WyattBlue

@jlaine When you do your next round of issue reporting, could you let us know pyAV's policy on tests and coverage? Is it mandatory, and if so to what extent?

Side question: How does pyAV manage ffmpeg's frame object? Is it ref-counted and decommissioned by FFmpeg (afaik default), or are you handling the frame yourself? I.e., is it sane to do something like gray_image = np.frombuffer(frame.planes[0], dtype=np.uint8).reshape(...) in python land or do I have to worry that FFmpeg will garbage collect that buffer, because it doesn't know about numpy holding a pointer/reference to it? I'm asking because this has performance implications in this PR, but also in other places.

Feb 09 '22 06:02 FirefoxMetzger

Hi @FirefoxMetzger yes we do need unit tests for these new formats. I unfortunately haven't been able to wire up an automatic check for this, I'm not sure how coverage works on Cython code. Could we also have this code rebased on top of main as there have been some changes in VideoFrame?

I'm not sure I understand the question about memory ownership? As long a you hold a reference to the frame I don't expect its buffer to disappear under your feet?

EDIT: I've rebased on top of main, but we do need tests

Feb 25 '22 14:02 jlaine

I unfortunately haven't been able to wire up an automatic check for this, I'm not sure how coverage works on Cython code.

I guess you are aware of coverage.py support for cython (see here)? I know that it exists, but never actually set it up myself.

After you get the coverage xml, it should behave like any other project, i.e., upload to CodeCov (or another provider) and have it hook into CI to pass/fail the test suite based on coverage. This part I've set up for ImageIO, so I'm happy to help if you decide to go for it.

I'm not sure I understand the question about memory ownership? As long a you hold a reference to the frame I don't expect its buffer to disappear under your feet?

>>> import av
>>> import numpy as np
>>> frame = av.VideoFrame(16, 16, "rgb24")
>>> no_copy_array = np.frombuffer(frame.planes[0], dtype=np.uint8).reshape((16, 16, 3))
>>> del frame  # will this be dangerous?

Here, we created the image buffer via lib.av_image_alloc during frame's construction and then we free it via av_freep during the deconstruction of the frame. However, python land still holds a reference to that buffer (no_copy_array) which FFmpeg doesn't know about, so it will happily free the buffer. At least I think that's what will happen, but I am not sure. I wonder if this could lead to undefined behavior, especially if we do the above on a decoded frame instead.

Feb 25 '22 15:02 FirefoxMetzger

Feel free to open a new discussion on the memory ownership question, I'd rather keep discussion here narrowly focused on the PR.

Feb 25 '22 15:02 jlaine

I've been trying to put together a unit test for this, and the test would look like this:

    def test_ndarray_yuv444p(self):
        array = numpy.random.randint(0, 256, size=(3, 480, 640), dtype=numpy.uint8)
        frame = VideoFrame.from_ndarray(array, format="yuv444p")
        self.assertEqual(frame.width, 640)
        self.assertEqual(frame.height, 480)
        self.assertEqual(frame.format.name, "yuv444p")
        self.assertNdarraysEqual(frame.to_ndarray(), array)

=> AFAICT this would be the only format for which our ndarray has the channel along the first axis. For all the other formats it looks like (height, widtht, channel)?

Feb 25 '22 22:02 jlaine

AFAICT this would be the only format for which our ndarray has the channel along the first axis. For all the other formats it looks like (height, widtht, channel)?

There is also yuv420p and yuvj420p that are supported and technically channel-first; For these formats, however, it seems like we just get an array of lines:

>>> import av
>>> foo = av.VideoFrame(5*16, 5*16, "yuv420p")
>>> bar = foo.to_ndarray()
>>> bar.shape
(120, 80)

So I guess the answer is no you can get other layouts than (height, width, channel).

I wonder if subsampled formats like this should raise an error, promote to yuv444p or return ~macro-block~ macro-pixel strided arrays instead of an array of lines, but that's a different discussion. Is it worth having @jlaine ? If so, I'll open a new issue.

Feb 26 '22 06:02 FirefoxMetzger

I have added tests based on @jlaine's comment above. Let me know if anything else is needed.

Mar 29 '22 23:03 NPN

As stated I'm still not comfortable with the array dimensions, this creates an element of surprise for the user: the other formats do (height, width, channels). Either convince me or change it :)

Mar 30 '22 06:03 jlaine

As stated I'm still not comfortable with the array dimensions, this creates an element of surprise for the user: the other formats do (height, width, channels). Either convince me or change it :)

@jlaine What about yuv420p and yuvj420p? Both are supported and don't return (height, width, channels). (See https://github.com/PyAV-Org/PyAV/pull/788#issuecomment-1051648691 for an example)

Mar 30 '22 07:03 FirefoxMetzger

As stated I'm still not comfortable with the array dimensions, this creates an element of surprise for the user: the other formats do (height, width, channels). Either convince me or change it :)

@jlaine What about yuv420p and yuvj420p? Both are supported and don't return (height, width, channels). (See #788 (comment) for an example)

Those formats decimate the chrominance, so there is no way they are do have a "normal" array, regardless of the order of dimensions :)

Mar 30 '22 08:03 jlaine

Those formats decimate the chrominance, so there is no way they are do have a "normal" array, regardless of the order of dimensions :)

@jlaine What would be the difference between copying pixels here (channel-first to channel-last) and copying pixels for yuv420p? (we do need a copy here, because https://github.com/PyAV-Org/PyAV/pull/788#pullrequestreview-732456980)

yuv420p shares a chroma pixel amongst multiple luma pixels, so making a new copy that duplicates values and serves the familiar [height, width, channels] format seems about as surprising to me as yuv444p (which is planar/channel-first) serving [height, width, channels] (which is channel-last). Personally, I would not copy in either case, because the result is really surprising and unexpected (to me). I ask for something that is channel-first (yuv444p), but get something that is channel-last instead; worse, yuv444 is - afaik - not even a valid format in FFMPEG. If we have to do a copy, I would at least like to to be consistent across YUV formats in that they all make the copy to [height, width, channels].

Mar 30 '22 09:03 FirefoxMetzger

For channels first, the benefit is faithfulness to the underlying pixel format. So, code which operates on the raw ndarray data and expects a planar ordering will work correctly (e.g. WyattBlue's example above).

For channels last, the benefit is consistency with the other non-chroma subsampled formats. So, if you just want to get/set pixels using ndarray indexing, then you don't have to worry about whether the underlying format is planar or packed. Also, Pillow seems to always use channels last (though their YCbCr is based on JPEG, anyway).

~~Honestly, I'm not sure which is better. Unless someone else has an opinion, we can go with channels last, if that's what you prefer. I can just go back to the old code (after changing it to make a copy instead of a view, since np.moveaxis seems to cause problems?)~~

Edit: Somehow I missed FirefoxMetzger's comment above when writing this response. Those arguments seem pretty good, so I suppose it's up to what jlaine thinks.

Mar 30 '22 09:03 NPN

@NPN @FirefoxMetzger @jlaine I have merged these changes as-is in my fork.

Oct 08 '23 05:10 WyattBlue

Nice!

@WyattBlue Are you planning on forking PyAV and maintaining it, or is this more of a "one-off" thing?

Oct 12 '23 18:10 FirefoxMetzger

Nice!

@WyattBlue Are you planning on forking PyAV and maintaining it, or is this more of a "one-off" thing?

I definitely plan on maintaining it. I'm already using the fork for my project, auto-editor.

Oct 14 '23 20:10 WyattBlue

Thanks @NPN , it's merged!

Oct 30 '23 15:10 jlaine

PyAV PyAV copied to clipboard

Support yuv444p/yuvj444p in `to_ndarray`/`from_ndarray`

PyAV
PyAV copied to clipboard