sioyek Core dump due to Invalid UTF-8 error

Information: System: Arch Linux Install method: AUR sioyek-git Sioyek version: 2.0.0.r1041.g280fdb6-1 Libmupdf version: 1.26.1-1

Error description: On attempting to open this file, Sioyek core dumps due to an "Invalid UTF-8" error. I have not been able to find other files that produce this error; they open just fine. The problem file opens just fine using other PDF viewers.

Error message:

default_config_path: /etc/sioyek/prefs.config
default_keys_path: /etc/sioyek/keys.config
user_config_path: [ 0 ] /etc/xdg/sioyek/prefs_user.config
user_config_path: [ 1 ] /home/shadi/.config/sioyek/prefs_user.config
user_keys_path: [ 0 ] /etc/xdg/sioyek/keys_user.config
user_keys_path: [ 1 ] /home/shadi/.config/sioyek/keys_user.config
database_file_path: /home/shadi/.local/share/sioyek/test.db
local_database_file_path: /home/shadi/.local/share/sioyek/local.db
global_database_file_path: /home/shadi/.local/share/sioyek/shared.db
tutorial_path: /usr/share/sioyek/tutorial.pdf
last_opened_file_address_path: /home/shadi/.local/share/sioyek/last_document_path.txt
shader_path: /usr/share/sioyek/shaders
SIOYEK
Warning: key defined in /etc/sioyek/keys.config:288 overwritten by /home/shadi/.config/sioyek/keys_user.config:3. Overriding command: q: replacing quit with close_window
Warning: key defined in /etc/sioyek/keys.config:255 overwritten by /home/shadi/.config/sioyek/keys_user.config:5. Overriding command: s: replacing external_search with fit_to_page_width
Warning: key defined in /etc/sioyek/keys.config:227 overwritten by /home/shadi/.config/sioyek/keys_user.config:13. Overriding command: <tab>: replacing goto_portal with goto_toc
terminate called after throwing an instance of 'utf8::invalid_utf8'
  what():  Invalid UTF-8
zsh: IOT instruction (core dumped)  sioyek hobsbawm-eric-the-invention-of-tradition.pdf

Steps taken: I could not find any similar issues on GitHub. I could not attempt a fix myself because I don't understand this error.

May 28 '25 06:05 ShadiZade

Can't reproduce the issue. I assume this is because of incompatible mupdf versions (as is tradition) the development branch uses mupdf 1.25.

May 28 '25 07:05 ahrm

I had the same thing happen recently on the dev branch and get the same error with your file. I think the problem is technically with your pdf rather than sioyek (or mupdf) per se. Doing mutool show <your pdf> outline shows lots of nullbytes in the toc which I'm pretty sure shouldn't be there. Same for the pdf I had problems with. Both files cause Document::create_toc_tree / Document::convert_toc_tree to fail when calling utf8_decode. Zathura opens both files but displays weird hash-looking values instead of a proper toc. Removing the toc or replacing it with a non-broken one stops the crashing.

@ahrm: might just need an extra (try)/catch block somewhere appropriate so bad utf8 or whatever is wrong with the existing toc doesn't crash the whole app?

May 28 '25 07:05 aw-cloud

I can't reproduce the issue so I don't know where the appropriate place would be.

May 28 '25 07:05 ahrm

Can't reproduce the issue. I assume this is because of incompatible mupdf versions (as is tradition) the development branch uses mupdf 1.25.

I would have assumed a libmupdf incompatibility issue if Sioyek wouldn't open any files at all, and I would have assumed file corruption if this file was not openable using any PDF viewer. But neither is the case, so I assumed it's a bug.

I agree with @aw-cloud on that something should be done to avoid a wholesale crash.

May 28 '25 07:05 ShadiZade

@ShadiZade Does @aw-cloud 's pr fix your issue here: https://github.com/ahrm/sioyek/pull/1412?

May 28 '25 09:05 ahrm

@ahrm I don't know how to build from an unmerged pull request, so I'll assume it works. I'll report back here after it's merged.

May 28 '25 10:05 ShadiZade

I'm not going to merge it until I know it actually fixes the problem, that's why I am asking.

May 28 '25 10:05 ahrm

Alright, give me a minute to figure it out

May 28 '25 10:05 ShadiZade

You can clone my fork directly with git clone https://github.com/aw-cloud/sioyek.git -b development and then build as normal. Or one of the suggestions on this stack overflow post should work.

I did test that it works on my machine of course but I don't blame @ahrm for being cautious.

May 28 '25 11:05 aw-cloud

doing that now

May 28 '25 12:05 ShadiZade

./build_linux.sh is giving me the following error: make: *** No targets specified and no makefile found. Stop.

Sorry, I'm a bit of a beginner.

EDIT: per #730, passed --recursive and currently building with no issues EDIT: Had to set QMAKE="/usr/bin/qmake6" right after the nested if in build_linux.sh to abjure the Project ERROR: Unknown module(s) in QT: texttospeech. Build successful.

EDIT: For future reference, to build from the dev branch, the following commands should work:

git clone --recursive https://github.com/ahrm/sioyek.git -b development
cd sioyek
QMAKE=qmake6 ./build_linux.sh

The final build is in ./build/sioyek.

May 28 '25 12:05 ShadiZade

Can confirm PR solves the issue. Thank you @aw-cloud and @ahrm and sorry for the trouble.

May 28 '25 13:05 ShadiZade

This issue has reappeared with the exact same error message, this time with this file. The issue persists even with the toc removed, and also when individual pages are opened after pdftk burst. This is not an mupdf compatibility issue because sioyek opens other files just fine, and this is not a corruption issue because zathura opens this file just fine.

sioyek version: (latest AUR sioyek-git) 2.0.0.r1064.g143e5d51-1 libmupdf version: 1.26.10-1

Oct 26 '25 07:10 ShadiZade

Pinging @aw-cloud

Oct 26 '25 07:10 ShadiZade

Your file has invalid utf8, I think in the metadata title field. On opening the document detect_paper_name is called which calls utf8_decode; neither check whether the input is actually valid utf8 so you get an exception and crash. As before, utf8_decode can throw and so the function calling it should do error handling.

Oct 26 '25 22:10 aw-cloud

Can't utf8_decode be made to be able to safely deal with invalid utf8?

Oct 27 '25 00:10 ShadiZade

Yes but it makes more sense to let utf8_decode fail when it's given invalid input. If utf8_decode handled errors you'd have to decide on something sensible to return in those cases. What's sensible (if anything at all) probably depends most on what the caller is using the return value for, so the caller might as well be the one to handle the error.

Oct 27 '25 02:10 aw-cloud

In that case, every single utf8_decode call should be checked for pre-call error handling. Is that doable?

Oct 27 '25 02:10 ShadiZade

Not sure what you mean by pre-call error handling, but that's not necessarily true. You can catch exceptions at any convenient place in the stack, you can cover multiple calls to utf8_decode in one try block, some calls you might reasonably know that you won't have invalid utf8, and sometimes you might just want to give up and crash if there's no sensible way to recover. Covering all the calls with good error handling might take some time but of course it's doable.

Oct 27 '25 03:10 aw-cloud