trakt-scrobbler icon indicating copy to clipboard operation
trakt-scrobbler copied to clipboard

[BUG] Encoding issues

Open Sp3EdeR opened this issue 3 years ago • 7 comments

Describe the bug

When assigning a directory whitelist in the configuration options, only ASCII paths seem to work well. Videos within the non-ASCII path are not recognized for scrobbling. It seems that an incorrect string encoding (UTF-8/cp1250) is selected.

Desktop (please complete the following information):

  • OS and Version: Windows 11 22000.856
  • Python Version: 3.10.2
  • Player and Version: MPC-HC 1.9.21.1 (slightly modified personal fork, but generally the same)

To Reproduce

CMD is used with codepage cp-1250 Actual directory path: E:\Letöltések config.yaml contains fileinfo/whitelist: "E:\\Let\xF6lt\xE9sek" (Added with the config tool. Other ASCII paths are added with single-quotation and no escaped characters)

  1. Open a video file from within E:\Letöltések
  2. See the error in the log file

Additional note: I also manually tried to add the "E:\\Let\xc3\x83\xc2\xb6lt\xc3\x83\xc2\xa9sek" and 'E:\Letöltések' whitelist strings, but that did not help. Additional note 2: The charmap error is the same even without using a whitelist at all.

Log file relevant parts

2022-08-26 23:17:16,698 - DEBUG - mpc-hc - utils - System encoding scheme: 'cp1250'
2022-08-26 23:17:16,699 - DEBUG - mpc-hc - utils - UTF8: b'E:\\Let\xc3\x83\xc2\xb6lt\xc3\x83\xc2\xa9sek\\<video_file_name>'
2022-08-26 23:17:16,699 - DEBUG - mpc-hc - utils - Error while logging 'charmap' codec can't encode character '\xc3' in position 6: character maps to <undefined>
2022-08-26 23:17:16,699 - INFO - mpc-hc - file_info - File path not in whitelist.

Sp3EdeR avatar Aug 26 '22 22:08 Sp3EdeR

Through further testing, it seems that the whitelisted path checking is case sensitive on Windows. When adding the paths with both a capital and non-capital drive letter, the whitelist tester seems to work. The actual tracking still does not seem to work well though.

Sp3EdeR avatar Aug 28 '22 10:08 Sp3EdeR

whitelisted path checking is case sensitive on Windows

Yep, that is definitely true. We literally just do a path.startswith(directory) check.

The main issue at hand is that we fail to properly decode the file name that we get from the player. In Windows, the following code is run any time we see a file path (before we process it through the whitelist logic): https://github.com/iamkroot/trakt-scrobbler/blob/b2b4d440d592b0f0c0d796aa0883874f8efe82c3/trakt_scrobbler/utils.py#L105-L119 We get the preferred encoding (cp1250 in your case) and try to convert the given raw path into utf-8. (The bare decode on line 109 defaults to utf-8). For some reason, the raw string given by the player does not conform to the cp1250 encoding. So we fail and go into the exception handler, where we just start logging things. We again fail at the System line (since the cp1250 decoding raised an exception). As you can see, the raw file path has the c3 byte in it which definitely shouldn't be there. Maybe check MPC source code for this?

I'm pretty sure most of this logic is wrong, but I'm not that well versed with encodings to be able to get this fixed once and for all. :/

iamkroot avatar Aug 28 '22 15:08 iamkroot

The MPC-HC 1.9.21.1 web interface / variables page returns the following HTML document in UTF-8 encoding:

<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8">
        <title>MPC-HC WebServer - Variables</title>
        <link rel="stylesheet" href="default.css">
        <link rel="icon" href="favicon.ico">
    </head>
    <body class="page-variables">
        <p id="file">Példa Videó Név Ékezetekkel.avi</p>
        <p id="filepatharg">E:%5cLet%c3%b6lt%c3%a9sek%5cP%c3%a9lda%20Vide%c3%b3%20N%c3%a9v%20%c3%89kezetekkel.avi</p>
        <p id="filepath">E:\Letöltések\Példa Videó Név Ékezetekkel.avi</p>
        <p id="filedirarg">E:%5cLet%c3%b6lt%c3%a9sek</p>
        <p id="filedir">E:\Letöltések</p>
        <p id="state">1</p>
        <p id="statestring">Szünet</p>
        <p id="position">442</p>
        <p id="positionstring">00:00:00</p>
        <p id="duration">7786520</p>
        <p id="durationstring">02:09:47</p>
        <p id="volumelevel">100</p>
        <p id="muted">0</p>
        <p id="playbackrate">1</p>
        <p id="size">695 MB</p>
        <p id="reloadtime">0</p>
        <p id="version">1.9.21.1</p>

    </body>
</html>

Since the header states that this page is in utf-8, it seems correct to me.

The system itself does use the cp1250 encoding by default (e.g. in the terminal, but that can be changed with chcp 65001 for example).

Sp3EdeR avatar Aug 28 '22 16:08 Sp3EdeR

I think that the solution here might be that cleanup_encoding should only be executed on strings sourced from the terminal. Strings sourced from HTTP should not be parsed using the system encoding, but instead the page encoding with a fallback to encoding detection (UTF-8, UTF16 LE/BE, UTF32 LE/BE are ~failsafe to detect, the fallback from that could be the default system encoding maybe).

I don't have MPC-HC 1.7.13 for testing on whether that ancient version had any bugs in its web interface that might influence the solution currently.

Sp3EdeR avatar Aug 28 '22 16:08 Sp3EdeR

Generic encoding information that may or may not be helpful - just in case...

  • cp1250 is a Windows-specific single-byte encoding that uses ASCII for the first 128 characters, then specifies central-European characters for the remaining 128. Since it is Windows-specific, it is practically never used on the web. Windows sets itself up to use such single-byte encodings for use in its terminal shell however, meaning that stdin, stdout, stderr may be encoded as such.
  • HTTP parsers can only assume ASCII text by default, but it supports the Content-Type header, which specifies the MIME type of the body, and for textual content types, may have an optional charset value that specifies encoding. Default is ISO-8859-1 (English), which can be overridden like Content-Type: text/html; charset=utf-8. In case of the MPC-HC interface, this is not used.
  • HTML documents (transmitted as the bodies of HTTP messages) can specify the document's encoding using a meta tag's charset attribute. MPC-HC uses this tag and specifies <meta charset="utf-8">. This tag overrides the HTTP Content-Type setting, if present.

UTF-8, ISO-8859-* and Windows code pages are all based on ASCII. Due to this, as long as only ASCII text is encoded in any of these encodings, they are exactly the same data. Since HTTP and HTML code is ASCII only, it is possible to parse the HTML document with - for example - UTF-8, and search for either the Content-Type or the HTML meta charset. If such a definition is found and it is not utf-8, then the decoder should be replaced with the indicated charset at that point and the parsing of the document's remainder will be successful.

UTF-16 and UTF-32 are very rarely used in either the Windows terminal or on the web. In case of the first encoding, each character is 2-bytes or 4 bytes long. Autodetection for this is usually easy due to ASCII characters always having a null byte before or after the character. UTF-32 always encodes every character into 4 bytes and is easily detected due to the null byte triplets throughout the code. UTF-8 is the most commonly used encoding. ASCII characters are encoded just like ASCII, everything else will be encoded into multiple bytes. The rarer the character, the more bytes it takes.

Sp3EdeR avatar Aug 28 '22 16:08 Sp3EdeR

Thing is, all strings are UTF-8 by default in Python. So in this case, when we fail to decode E:\\Let\xc3\x83\xc2\xb6lt\xc3\x83\xc2\xa9sek\\ as cp1250, we do fall back to UTF-8 (by virtue of not doing anything). The problem is that this isn't a valid UTF8 sequence either. Need to figure out who set this value- the player, the web server, or our parser. Maybe doing response.text is incorrect- https://github.com/iamkroot/trakt-scrobbler/blob/abf827e59f0be5403b1da219c8ae2baccc22c4c9/trakt_scrobbler/player_monitors/mpc.py#L37-L39

iamkroot avatar Sep 06 '22 02:09 iamkroot

The issue is that the MPCMon's response parsing handles only the HTTP header encoding here.

The Session object knows only the HTTP protocol, so it looks for the Content-Type / charset header. If it finds that - hurray (although technically HTML could override this, but I wouldn't expect this case to happen ever). Otherwise, it defaults to ISO-8859-1, which is quite possibly incorrect. Due to this, response.text is incorrectly decoded.

In this case, the HTML document contains the <meta charset="utf-8">. So instead of using response.text, response.content.decode('utf-8') should be used. I suspect that response.text should be usable to parse the meta tag out of the HTML document, however.

I managed to fire up debugging for this code, so I will provide a fix for these issues soon...

Sp3EdeR avatar Sep 06 '22 09:09 Sp3EdeR