mpv icon indicating copy to clipboard operation
mpv copied to clipboard

screenshot: long filename handling improvements

Open rtldg opened this issue 3 years ago • 8 comments

For screenshot filenames, it was possible for the basename to be longer than what filesystems generally support.

On Linux, this is 255 bytes. On Windows, this is 255 wchar_t units. Thus basenames are truncated to under 255 bytes so that the basename + extension are <= 255 with truncate_long_base_filename. It also makes sure not to produce an invalid UTF-8 codepoint in the filename.

For testing, filling screenshot-template= with 3-byte or 4-byte UTF-8 codepoints is best. Such as "ウ" (3-byte) or "🌂" (4-byte). Example: 84 * strlen("ウ") + strlen(".jpg") == 256 The last "ウ" is removed and the basename string will be filled with 83 "ウ" characters and ".jpg" totalling 253 bytes.

I only tested on Windows 10 21H2 x64 and also here's some copy & paste screenshot-templates

screenshot-template="ウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウ"
screenshot-template="aaa🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂"
screenshot-template="aa🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂"
screenshot-template="a🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂"

rtldg avatar Apr 04 '22 15:04 rtldg

~~I'll have to correct the trim_invalid_utf8 function as I realized I wrote it for something else that had different input string guarantees and it only trims the last codepoint if it's missing the last byte of the sequence.~~

~~edit: Backtracking the string some and using bstr_parse_utf8_code_length to see if it goes past the NUL terminator is probably what I'll do for it.~~

That's done

rtldg avatar Apr 04 '22 18:04 rtldg

I'd take the UTF-8 fix without the win32-specific changes. There's #12119 for that now.

sfan5 avatar Aug 20 '23 18:08 sfan5

Forced pushed to remove the Win32 changes and to make a couple of the screenshot file writing functions use mp_fopen to benefit from #12119 since they were using fopen previously.

rtldg avatar Aug 21 '23 05:08 rtldg

After an IRC discussion it turned out that choosing the right amount to truncate at is hard because it's not straightforward bytes on every OS. The maximum is 255 bytes on Linux and 255 UTF-16 (or UCS-2?) codepoints on Windows and macOS. However truncating on 255*3 can also be too late on Windows since 300 UTF-8 bytes can contain more than 255 UTF-16 codepoints...

I see three options:

  1. drop this PR and continue to consider it user error if he configures a too long template
  2. truncate of 255*3 on win32 and 255 elsewhere (and hope for the best)
  3. truncate to 255 (and restrict win32 more than necessary)

sfan5 avatar Aug 22 '23 20:08 sfan5

drop this PR and continue to consider it user error if he configures a too long template

Subtitle text via %{sub-text} can leave that out of the user's hands at times.

truncate of 255*3 on win32 and 255 elsewhere (and hope for the best)

Converting the basename with mp_from_utf8, truncating the new wchar_t*, and then back with mp_to_utf8 can be done for a more precise truncation to 255 wchar_ts. I had whipped up some code for this (in the details/spoiler below) but haven't given it enough testing since I think 3. [truncating to 255 UTF-8 bytes] is best for interoperability between operating systems (and network file shares). I even think it'd be worthwhile to use the same ILLEGAL_FILENAME_CHARS instead of for WIN32 and not.

edit: doesn't append the basename back to the path or anything so would need to be edited to do that

#ifdef _WIN32
static char *truncate_long_base_filename(char *s, const size_t ext_len)
{
    const size_t max_units = 255 - (ext_len + 1); // +1 for '.'
    char *basename = mp_basename(s);
    wchar_t *w = mp_from_utf8(NULL, basename);
    size_t len = wcslen(w);

    if (len <= max_units) {
        talloc_free(w);
        return s;
    }

    talloc_free(s);

    if (IS_HIGH_SURROGATE(w[max_units]))
        w[max_units-1] = 0; // chop off entire surrogate pair
    else
        w[max_units] = 0; // either not a pair or we're chopping off a 'low'

    char *res = mp_to_utf8(NULL, w);
    talloc_free(w);
    return res;
}
#else
/* "regular" truncate_long_base_filename here... */
#endif

rtldg avatar Aug 22 '23 21:08 rtldg

Subtitle text via %{sub-text} can leave that out of the user's hands at times.

You have to admit that this is a niche usecase. Users may very well write a script to correctly take screenshots named after subtitle text if they want to do that.

Converting the basename with mp_from_utf8, truncating the new wchar_t*, and then back with mp_to_utf8 can be done [...]

Sure, but this is a good example for platform-specific complicated support code that I'd like to avoid.

I even think it'd be worthwhile to use the same ILLEGAL_FILENAME_CHARS instead of for WIN32 and not.

Terrible idea IMO.

sfan5 avatar Aug 23 '23 14:08 sfan5

Subtitle text via %{sub-text} can leave that out of the user's hands at times.

You have to admit that this is a niche usecase. Users may very well write a script to correctly take screenshots named after subtitle text if they want to do that.

That's much more work than just throwing %{sub-text} into screenshot-template.

Converting the basename with mp_from_utf8, truncating the new wchar_t*, and then back with mp_to_utf8 can be done [...]

Sure, but this is a good example for platform-specific complicated support code that I'd like to avoid.

I'd like to avoid it too especially since it could cause file access issues if you were to have a filename on NTFS that'd be longer than 255 UTF-8 bytes in Linux.

I even think it'd be worthwhile to use the same ILLEGAL_FILENAME_CHARS instead of for WIN32 and not.

Terrible idea IMO.

Networked file shares and also file access issues from Linux -> Windows again.

rtldg avatar Aug 23 '23 15:08 rtldg

Honestly, I think long/unsupported filenames should be rejected with an error for user to act on. Truncating it implicitly doesn't really help anyone.

kasper93 avatar May 07 '24 09:05 kasper93