koreader-base Bump mupdf, kopt, lept and tess

mupdf 1.17, tracking https://github.com/ezdiy/mupdf kopt 2.53, tracking https://github.com/ezdiy/libk2pdfopt lept 1.79, frozen tess 4.1, frozen

Due to complexity and heavy interdependencies, these were frozen in time for past 2-3 years, and are starting showing their age.

This is a large undertaking so it will take a while, any tips are welcome. Currently "mostly works".

tbd:

android build
mupdf falling on its face with glyphs that are missing in freefont

This change is

Sep 29 '20 12:09 ezdiy

It's even more annoying when you add k2pdfopt in the mix, because it mostly expects a frozen set of dependencies for those, so, you'll probably want to bump that, too.

(Although, last I checked, you should have some breathing room on that front with current k2pdfopt versions).

Hopefully, the way I went about it the last time should make it slightly less arcane to devise what's k2pdfopt and what's our own patchset on top of it...

See https://github.com/koreader/koreader-base/pull/762 for a few comments about this and MµPDF API changes, (and the last time we bumped k2pdfopt: https://github.com/koreader/libk2pdfopt/pull/32).

Sep 29 '20 12:09 NiLuJe

Also, I never bothered with it because, ahahaha, but the ZIP encryption thingy relies on an AES lib from minizip, and the minizip build is also frozen to a god-awfully old version.

I don't rightly recall what happened when I tried to bump it, but it was.... not good :D.

Sep 29 '20 12:09 NiLuJe

@NiLuJe Minizip is nuked, and moved to mupdf. Ye the hardcoded dependencies of kopt is nasty (the frozen versions instead of tracking is pretty much because of that). Luckily we dont care about lept/tess all that much as long kopt is happy. While with mupdf it pays off to stay on the edge for as long as possible, as both our and kopt use of it is more or less non-invasive so it's just about hoping they won't mess with api all that much in the future.

https://github.com/koreader/koreader-base/pull/762#issuecomment-490415883

This from what I've seen will be an issue, the api is completely gone replaced with something entirely new. The rest seems to map reasonably well.

Sep 29 '20 13:09 ezdiy

This is a large undertaking so it will take a while, any tips are welcome.

I assume you know your way around git blame (whether locally or via the GH web GUI) to figure out what some cryptic line is supposed to do, but besides that nothing in particular. Of course you should take a look at https://github.com/koreader/koreader-base/pull/762, possibly https://github.com/koreader/koreader-base/pull/577, as well as https://github.com/koreader/libk2pdfopt/pull/32 but I imagine you either already have or will when you need to.

Thanks so much for looking into this! The amount of time required is a real obstacle. :+1:

Sep 30 '20 13:09 Frenzie

Yep, and feel free to poke our brains, I still vaguely remember what I did the last time ^^.

Sep 30 '20 13:09 NiLuJe

@Frenzie It's kind of long distance telepathy as I have no other way to test it, but better than nothing :)

Oct 05 '20 06:10 ezdiy

Same, that's why I added it as soon as I noticed GH made it available. It kept silently breaking. ^_^

Oct 05 '20 06:10 Frenzie

Regarding your addition of FT_Get_Sfnt_Name() : do you plan on handling https://github.com/koreader/koreader/issues/6763 in a standalone way (so I don't have to get xtext involved in that, https://github.com/koreader/koreader/issues/6763#issuecomment-705192275) ? Or do you need that for something else?

And in a few words before we see your code: how do you plan on having harfbuzz help you with MuPDF ?

Oct 09 '20 09:10 poire-z

Regarding your addition of FT_Get_Sfnt_Name() : do you plan on handling koreader/koreader#6763 in a standalone way (so I don't have to get xtext involved in that, koreader/koreader#6763 (comment)) ?

You were right, FT kinda sucks for this, as HB can handle most of the encoding issues for us. I think just rewriting it 1:1 to harfbuzz.lua so utility function so as to not pollute xtext with it, sketch:

local hb = require("ffi/harfbuzz")
local hbface = hb.hb_ft_face_create_referenced(uiface)
local buf = ffi.new("char[256]")
hb_ot_name_get_utf8(hbface, hb.HB_OT_NAME_ID_FULL_NAME, lang, ffi.new("int[1]", 256), buf)
return ffi.string(buf)

There are still some minor gotchas like having to check all FONT_FAMILY, NAME and FULLNAME as only one may have something useful still apply.

I'll bring some cherrypicked PR for this as it's not really dependent on what I'm doing with mupdf anymore, the hb parts are already in master.

Or do you need that for something else?

And in a few words before we see your code: how do you plan on having harfbuzz help you with MuPDF ?

This whole ritual dance is essentially for the sake of ghetto fontconfig to get fonts working in mupdf beyond the extent of hardcoded builtins (which aren't even there to begin with now :). fz_install_load_system_font_funcs() gets you script, language and typeface name in a callback, and we should spit some FT_Face instance out.

To do this with device user/supplied fonts, there's not getting around coverage and some modest heuristics to pick something, as there's no luxury of restarting layout whenever we step on a missing glyph. When we respond with a face, we commit to coverage of requested lang and/or script. This can also help avoiding spastic substitutions in mixed script layout block whenever available font choice is more looser than single main+builtins list.

All of this stuff is meant to live more or less parallel to cre, so as to not step on its toes, though some font sharing (on blob/ftmemory level) should be implemented in the future between the two so as to avoid spamming too many copies of same face.

As for the SFNT name fields in FT, it was meant to get pristine view in the cached info for font select should it be needed, but turns out all the fields there are useless, and the one that is helpful for ui, is better asked via harfbuzz.

Oct 09 '20 11:10 ezdiy

I think just rewriting it 1:1 to harfbuzz.lua so utility function so as to not pollute xtext with it, sketch

Fine with me :+1:

Thanks for the other info. Just a question, as I'm not really familiar with PDF. I thought PDF has all the layout fixed, i.e. lines of text are already linebroken and each word has some kind of x/y absolute positionning. Which I feel would require the exact font to be known/used so the text is nice. By using arbitrary external fonts, isn't there the risk that with a font whose line height and glyph widths are different from the original font used by the author (but not embedded), the text will look bad? Or is it that real nice documents should ship embedded fonts - and what we do here is for plain simple PDF documents where we have to do with that risk of bad layout? Or is PDF supposed to do reflow/line breaking in that case, and handle more stuff than I thought it does?

Oct 09 '20 11:10 poire-z

For PDF, it's about those that actually don't embed fonts. CJK ones quite often don't (too much bloat), musical notes, papers that are just .ps in pdf wrapper... - as it stands you get only hardcoded builtins for those, and even that ostensibly fails unless you manually butcher base14 builtin to point at freefont/noto/droid (what old patch did). This at least avoids dropping glyphs for "exotic" latin scripts such as czech, at the cost of barring font choice from the document entirely.

As for line height/glyph width that is an issue indeed - it's generally the reason why we should try real hard to supply the font that is asked for, at least something close to that. PDFs like that are not really fixed-sized, the post script path commands do have wiggle room for layout engine to make decisions. The text itself in a pdf is "reflowable", but the text boxes in the document itself tend to be laid out too haphazardly to be of any practical use for whole-page reflow. In HTML terms, imagine pdf as a document with everything being laid out using <table>, text generally at paragraph granularity.

Finally, there's html itself - while CRE is vastly superior renderer, mupdf does offer some interesting options. svg that at least sometimes gave me better results than tinysvg. One can also get very fast pagination (think instant document opens and em size changes). Though admittedly the pdfdocument frontend needs to be made a lil bit smarter to make reasonable use of that (forgo kopt and tile cache when not needed on reflowable documents etc).

Oct 09 '20 11:10 ezdiy

@ezdiy : long time no news :/ Are you ok and fine? Just busy ? Offline ?

Should somebody else (mhh, not me :) continue this PR?

Feb 04 '21 14:02 poire-z

I'll close this one in favor of #1750 and related PRs, but thanks so much for the effort!

May 03 '24 16:05 Frenzie

koreader-base koreader-base copied to clipboard

Bump mupdf, kopt, lept and tess

koreader-base
koreader-base copied to clipboard