mpv
mpv copied to clipboard
GL_OUT_OF_MEMORY [vo/gpu-next] Failed creating LUT texture
- mpv version latest git master build with gpu-next
- Linux Distribution and Version fedora 36
- Source of the mpv binary git master with gpu-next
- If known which version of mpv introduced the problem
- Window Manager and version wayland
- GPU driver and version TigerLake-LP GT2 [Iris Xe Graphics], Kernel driver in use: i915
Reproduction steps
use 512x512x512 3D lut
--icc-3dlut-size=
Expected behavior
happens on 512x512x512 cube lut, works fine at 256. was using 512x512x512 for last couple of months before the upgrade to gitmaster and gpu-next. lots of unused ram.
Actual behavior
black screen mpv hangs
Log file
"[vo/gpu-next] gl_tex_create: texture: OpenGL error: GL_OUT_OF_MEMORY [vo/gpu-next] Failed creating LUT texture"
The major difference between gpu
and gpu-next
is that the former generates 16-bit integer 3DLUTs while the latter generates 32-bit float 3DLUTs. This doubles the VRAM requirement from 1GB to 2GB. This is not an intentional change, and mostly a result of internal abstractions making that more convenient. But in testing such extreme sizes I also noticed that the gpu-next
generation is almost an order of magnitude slower as a consequence - and also not cached. I'll try and fix both those issues.
It's also worth pointing out that gpu-next
can generate two 3DLUTs, one for the file and one for the display. So it might be worth keeping that in mind - your VRAM usage may unexpectedly double a second time, if you're playing something with an embedded ICC profile and have a display ICC profile simultaneously.
Okay, some more poignant notes:
- I submitted this MR changing the 3DLUTs from
rgba32f
torgba16
, thus halving the VRAM usage as expected. - This does not, however, substantially affect 3DLUT generation times.
- The only reason the
vo=gpu
generation times were so fast was becausevo=gpu
was incorrectly generating large 3DLUTs, resulting in an effective precision closer to 49x49x49. So your 512x512x512 3DLUT never did anything to begin with, at least not compared to a smaller and more reasonable LUT size. I have submitted this PR to fix this issue, which (as a natural consequence) makes the 512x512x512 generation take 30 seconds with justvo=gpu
, too. -
libplacebo
actually has internal logic for picking the best 3DLUT size according to the type of ICC profile (e.g. picking smaller sizes for easily characterized ones, and larger ones for LUT-based display profiles). However, this heuristic is not active in mpv because of the way the--icc-3dlut-size
setting is designed - it always overrides the auto-detected 3DLUT size. I'm considering changing the way this is handled, anyway.
So here's my conclusion:
- It was failing for you because
gpu-next
uses double the precision, unnecessarily - an anti-feature which I agree should be fixed. (Wasting VRAM for no reason is never justified) - Your LUT size is way too large to the point of being completely useless, and even with these bugs fixed, you will not want to be waiting 30 seconds for the ~1GB 3DLUT to be generated on playback. (Though admittedly this is only an issue with
vo=gpu-next
currently as a result of ICC 3DLUT caching not being implemented yet)
you will not want to be waiting 30 seconds for the ~1GB 3DLUT to be generated on playback
Does gpu without -next uses rgba16? Anyway, is this 30 seconds on gpu-next without caching and on gpu? Little CMS really cannot be optimised further? Also, this thing can be generated with CUDA, it will be faster. Doom9 mentioned there is CUDA possibility for generation of this. Of course this would have to be implemented in Little CMS, IMHO.
Oh, and also what is the point of generating 512^3 [cube] if you present on 8 bit display, e.g.? Is this really making it better? Should not you calculate optimal 3DLUT for bitness of display?
Does gpu without -next uses rgba16?
Yes
Anyway, is this 30 seconds on gpu-next without caching and on gpu?
Yes, no
Little CMS really cannot be optimised further?
Don't ask me
Also, this thing can be generated with CUDA, it will be faster.
Patches welcome
Oh, and also what is the point of generating 512^3 [cube]
None
Should not you calculate optimal 3DLUT for bitness of display?
No, you should calculate optimal 3DLUT for internal resolution of the ICC profile (typically 49x49x49 or 65x65x65)
The actual bugs here seem resolved so closing.