Core dump due to Invalid UTF-8 error
Information: System: Arch Linux Install method: AUR sioyek-git Sioyek version: 2.0.0.r1041.g280fdb6-1 Libmupdf version: 1.26.1-1
Error description: On attempting to open this file, Sioyek core dumps due to an "Invalid UTF-8" error. I have not been able to find other files that produce this error; they open just fine. The problem file opens just fine using other PDF viewers.
Error message:
default_config_path: /etc/sioyek/prefs.config
default_keys_path: /etc/sioyek/keys.config
user_config_path: [ 0 ] /etc/xdg/sioyek/prefs_user.config
user_config_path: [ 1 ] /home/shadi/.config/sioyek/prefs_user.config
user_keys_path: [ 0 ] /etc/xdg/sioyek/keys_user.config
user_keys_path: [ 1 ] /home/shadi/.config/sioyek/keys_user.config
database_file_path: /home/shadi/.local/share/sioyek/test.db
local_database_file_path: /home/shadi/.local/share/sioyek/local.db
global_database_file_path: /home/shadi/.local/share/sioyek/shared.db
tutorial_path: /usr/share/sioyek/tutorial.pdf
last_opened_file_address_path: /home/shadi/.local/share/sioyek/last_document_path.txt
shader_path: /usr/share/sioyek/shaders
SIOYEK
Warning: key defined in /etc/sioyek/keys.config:288 overwritten by /home/shadi/.config/sioyek/keys_user.config:3. Overriding command: q: replacing quit with close_window
Warning: key defined in /etc/sioyek/keys.config:255 overwritten by /home/shadi/.config/sioyek/keys_user.config:5. Overriding command: s: replacing external_search with fit_to_page_width
Warning: key defined in /etc/sioyek/keys.config:227 overwritten by /home/shadi/.config/sioyek/keys_user.config:13. Overriding command: <tab>: replacing goto_portal with goto_toc
terminate called after throwing an instance of 'utf8::invalid_utf8'
what(): Invalid UTF-8
zsh: IOT instruction (core dumped) sioyek hobsbawm-eric-the-invention-of-tradition.pdf
Steps taken: I could not find any similar issues on GitHub. I could not attempt a fix myself because I don't understand this error.
Can't reproduce the issue. I assume this is because of incompatible mupdf versions (as is tradition) the development branch uses mupdf 1.25.
I had the same thing happen recently on the dev branch and get the same error with your file. I think the problem is technically with your pdf rather than sioyek (or mupdf) per se. Doing mutool show <your pdf> outline shows lots of nullbytes in the toc which I'm pretty sure shouldn't be there. Same for the pdf I had problems with. Both files cause Document::create_toc_tree / Document::convert_toc_tree to fail when calling utf8_decode. Zathura opens both files but displays weird hash-looking values instead of a proper toc. Removing the toc or replacing it with a non-broken one stops the crashing.
@ahrm: might just need an extra (try)/catch block somewhere appropriate so bad utf8 or whatever is wrong with the existing toc doesn't crash the whole app?
I can't reproduce the issue so I don't know where the appropriate place would be.
Can't reproduce the issue. I assume this is because of incompatible mupdf versions (as is tradition) the development branch uses mupdf 1.25.
I would have assumed a libmupdf incompatibility issue if Sioyek wouldn't open any files at all, and I would have assumed file corruption if this file was not openable using any PDF viewer. But neither is the case, so I assumed it's a bug.
I agree with @aw-cloud on that something should be done to avoid a wholesale crash.
@ShadiZade Does @aw-cloud 's pr fix your issue here: https://github.com/ahrm/sioyek/pull/1412?
@ahrm I don't know how to build from an unmerged pull request, so I'll assume it works. I'll report back here after it's merged.
I'm not going to merge it until I know it actually fixes the problem, that's why I am asking.
Alright, give me a minute to figure it out
You can clone my fork directly with git clone https://github.com/aw-cloud/sioyek.git -b development and then build as normal. Or one of the suggestions on this stack overflow post should work.
I did test that it works on my machine of course but I don't blame @ahrm for being cautious.
doing that now
./build_linux.sh is giving me the following error: make: *** No targets specified and no makefile found. Stop.
Sorry, I'm a bit of a beginner.
EDIT: per #730, passed --recursive and currently building with no issues
EDIT: Had to set QMAKE="/usr/bin/qmake6" right after the nested if in build_linux.sh to abjure the Project ERROR: Unknown module(s) in QT: texttospeech. Build successful.
EDIT: For future reference, to build from the dev branch, the following commands should work:
git clone --recursive https://github.com/ahrm/sioyek.git -b development
cd sioyek
QMAKE=qmake6 ./build_linux.sh
The final build is in ./build/sioyek.
Can confirm PR solves the issue. Thank you @aw-cloud and @ahrm and sorry for the trouble.
This issue has reappeared with the exact same error message, this time with this file. The issue persists even with the toc removed, and also when individual pages are opened after pdftk burst. This is not an mupdf compatibility issue because sioyek opens other files just fine, and this is not a corruption issue because zathura opens this file just fine.
sioyek version: (latest AUR sioyek-git) 2.0.0.r1064.g143e5d51-1 libmupdf version: 1.26.10-1
Pinging @aw-cloud
Your file has invalid utf8, I think in the metadata title field. On opening the document detect_paper_name is called which calls utf8_decode; neither check whether the input is actually valid utf8 so you get an exception and crash. As before, utf8_decode can throw and so the function calling it should do error handling.
Can't utf8_decode be made to be able to safely deal with invalid utf8?
Yes but it makes more sense to let utf8_decode fail when it's given invalid input. If utf8_decode handled errors you'd have to decide on something sensible to return in those cases. What's sensible (if anything at all) probably depends most on what the caller is using the return value for, so the caller might as well be the one to handle the error.
In that case, every single utf8_decode call should be checked for pre-call error handling. Is that doable?
Not sure what you mean by pre-call error handling, but that's not necessarily true. You can catch exceptions at any convenient place in the stack, you can cover multiple calls to utf8_decode in one try block, some calls you might reasonably know that you won't have invalid utf8, and sometimes you might just want to give up and crash if there's no sensible way to recover. Covering all the calls with good error handling might take some time but of course it's doable.