go-noto-universal icon indicating copy to clipboard operation
go-noto-universal copied to clipboard

Why does GoNotoCurrent not render Korean glyphs whereas GoNotoCJKCore does?

Open xplip opened this issue 2 years ago • 20 comments

Thank you for providing this great library! I am currently trying to render text in various languages with the pygame library and it seems that when I am using GoNotoCurrent, I can render Japanese and Chinese glyphs just fine, but Korean glyphs are only rendered as empty boxes. When I am using GoNotoCJKCore, Korean is rendered properly as well, so I am wondering what the main difference between the two is. I can get around the issue by rendering my texts with the Pillow library and a libraqm layout engine which builds on harfbuzz, but this is horribly slow, so I'd prefer to keep using pygame and get it to work with GoNotoCurrent. Do you have an idea why rendering Korean might not work in my setup?

xplip avatar May 09 '22 13:05 xplip

Hi Phillip, thanks for the bug report.

The reason is that GoNotoCurrent does not include "Hangul Syllables" Unicode block (U+AC00 to U+D7AF) whereas GoNotoCJKCore does. This block contains about 11000+ codepoints and at least as many glyphs. However, GoNotoCurrent is currently at ~61000 glyphs in the font file, the maximum limit being 64K (this limit is imposed by spec). Hence there is not really enough "glyph space" for including all of Hangul syllables. So, there is really not much that can be done.

One option is to find a smaller subset (say ~2500 glyphs) of the 11K codepoints and include them in GoNotoCurrent, still honouring the 64K limit. Obviously this leaves out a large chunk of the Korean repertoire, so it is of little practical utility.

satbyy avatar May 09 '22 20:05 satbyy

Many precomposed syllables are not actually used in Korean. You could use KS X 1001’s list of 2,350 common Hangul syllables.

dscorbett avatar May 09 '22 20:05 dscorbett

@dscorbett I gave it a try on my local machine (KSX1001 subset), but now we hit the cmap format 4 table limit of 65535. Such subsetting causes fragmentation of "Hangul Syllables" block (U+AC00 to U+D7AF) -- the subset ttf's cmap 4 table is about 13000 length whereas GoNotoCurrent is already at 64706, so the total 77666 > 65535

satbyy avatar May 09 '22 21:05 satbyy

Or maybe I'm looking at this the wrong way. Attached below is the rendering of Korean wikipedia homepage, using GoNotoCurrent.ttf. It seems that the initial + final components are not combined/stacked correctly. Am I dropping some tables unknowingly?

korean-wiki

satbyy avatar May 09 '22 21:05 satbyy

I forgot about 'cmap' fragmentation. I guess that idea won’t work.

The syllables are exploded because the lookups that join them together are not applied when the language system is Korean. I’m not sure why.

dscorbett avatar May 09 '22 23:05 dscorbett

Thanks a lot for the explanations and taking a stab at it already! I wasn’t aware Korean relied so heavily on the precomposed syllables. If the glyph limit is reached then I suppose there is not so much that can be done.

I think for my personal use case, having Korean in the font is more important than the Math, Music, and Symbol Fonts, though. I quickly tried to rebuild the GoNotoCurrent font without those four (NotoSansSymbols-Regular.ttf, NotoSansSymbols2-Regular.ttf, NotoSansMath-Regular.ttf, NotoMusic-Regular.ttf) in the categories.sh and with this file https://raw.githubusercontent.com/sozysozbot/korean_hanja_sound/master/KSX1001.txt passed to pyftsubset via the --unicodes-file flag in create_korean_hangul_subset().

Out came a font file that seems to render my Korean sample texts fine. The command otfinfo -g GoNotoCurrent.ttf | wc -l returns 65251, so it looks like it didn't go over the glyph limit. I'm not really confident any of this was the correct approach, though, so I would appreciate it a lot if you could double-check this :)

xplip avatar May 10 '22 00:05 xplip

@xplip Yes, that is a good approach and that's all there is to it. Enjoy your new font!

satbyy avatar May 10 '22 06:05 satbyy

Hey there, sorry for bringing this topic up again, I originally thought I could just follow the steps proposed by @xplip and generate a GoNotoCurrent file with increased support for Korean Hangul syllables, but when trying to run the temporal_fonts.sh after both editing categories.sh (to remove the symbols, math and music fonts) and injecting the KSX1001.txt via the unicodes file flag in helper.sh at line 254, the process just randomly crashes.

The stacktrace is as follows:

Traceback (most recent call last):
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/bin/pyftmerge", line 8, in <module>
    sys.exit(main())
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/misc/loggingTools.py", line 372, in wrapper
    return func(*args, **kwds)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/merge/__init__.py", line 201, in main
    font.save(outfile)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 185, in save
    writer_reordersTables = self._save(tmp)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 225, in _save
    self._writeTable(tag, writer, done, tableCache)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 654, in _writeTable
    self._writeTable(masterTable, writer, done, tableCache)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 654, in _writeTable
    self._writeTable(masterTable, writer, done, tableCache)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 658, in _writeTable
    tabledata = self.getTableData(tag)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/ttFont.py", line 680, in getTableData
    return self.tables[tag].compile(self)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/tables/_g_l_y_f.py", line 132, in compile
    glyphData = glyph.compile(self, recalcBBoxes)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/tables/_g_l_y_f.py", line 673, in compile
    data = data + self.compileComponents(glyfTable)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/tables/_g_l_y_f.py", line 903, in compileComponents
    data = data + compo.compile(more, haveInstructions, glyfTable)
  File "/mnt/c/Users/oskar/Storage/code/projects/hydra/go-noto-universal/venv_fonty/lib/python3.10/site-packages/fontTools/ttLib/tables/_g_l_y_f.py", line 1469, in compile
    return struct.pack(">HH", flags, glyphID) + data
struct.error: 'H' format requires 0 <= number <= 65535

From what I can tell this exception gets thrown whilst trying to merge the base font files into the big single font file.

Since I am unfortunately pretty new to this field I am quite clueless on what to do in order to fix this issue. The last logs before this exception happens are always different, so there's nothing that would help debugging it. The first issue I was thinking of was that maybe there might somehow be too many glyphs to fit into the font file. Confusingly enough this exception occurred even after removing more fonts from the categories.sh file.

I am running the temporal_fonts.sh file on WSL2 22.04, and I think it could potentially be related to that, since the crashes appear so inconsistently.

Any help or hint on how to get this working would be greatly appreciated! Thanks so much for the awesome work :)

rxsto avatar Feb 26 '23 16:02 rxsto

I am facing the same issue as @rxsto . I am running the script on macOS Ventura, and after following the steps proposed by @xplip, I am getting the exact same stacktrace. Where you able to fix it, @rxsto ?

rubiomiguel06 avatar Apr 14 '23 22:04 rubiomiguel06

For the record:

I have managed to fix the issue I was facing. Basically, there were more glyphs than what the spec allows (64K). Thus, the error struct.error: 'H' format requires 0 <= number <= 65535.

@xplip explanation is good, but, to make it clearer and easier, I would change the following line:

... and with this file https://raw.githubusercontent.com/sozysozbot/korean_hanja_sound/master/KSX1001.txt passed to pyftsubset via the --unicodes-file flag in create_korean_hangul_subset().

for:

In helper.sh, inside the method create_korean_hangul_subset() add the following codepoints:

codepoints+="U+AC00-D7A3," # Hangul syllables

That way all the Hangul syllables are added to the korean subset font and the glyph count limit is respected.

I hope I am not skipping any important glyphs for Korean. All my tests were successful, so I don't think so.

rubiomiguel06 avatar Apr 18 '23 14:04 rubiomiguel06

AFAIK, usually open source fonts projects, especially large fonts with many glyphs, have their fonts made in 2 files.

Take "Hanazono fonts" as example: https://osdn.net/projects/hanazono-font/

They release their font Hanazono in 2 files: HanaMinA.ttf HanaMinB.ttf

HanaMinA.ttf are font containing CJK glyphs, which are more commonly used, and HanaMinB are font with less used glyphs.

Most systems nowadays - Windows, *nix, Android can be set to use them as a pair.

2 files each 65536 glyphs should be enough for daily uses.

stephen0z avatar May 28 '23 17:05 stephen0z

@stephen1864 Thanks, that is a good idea to create two "A" and "B" fonts, one with Korean glyphs and one without them. I could work on it in the coming days or weeks.

satbyy avatar May 28 '23 19:05 satbyy

I also trapped in the issue of the Korean symbols missing. But as I'm not much experienced with font creation (what must be in/ what not), I could not follow all the discussions here.

I think the workaround of xplip is the one I need (I can easly skip Math, Music, and Symbol Fonts, but I need Korean) , but currently I have no idea how to create the font correctly? May it be a idea to provide that font too (or provide that to me by some way)? This will help me very much.

On the other hand the separation to GoNotoCurrent A and B Font may help, if the A font is similar the GoNotoCurrent with Korean.

As I like to use the Font for embedding a PDF, I think to use it as a pair may not be a working idea. I need to use one TTF font.

user6905 avatar Jul 19 '23 20:07 user6905

@user6905 For embedding a PDF, it is best to put what is only needed, othewise the PDF will grow extremely large. If you don't need Math, Music, and Symbol Fonts but complete Korean, you may go directly to Noto Font, which is the source of this project, and choose one useful:

https://fonts.google.com/noto/fonts?noto.lang=ko_Kore&noto.continent=Asia&noto.script=Kore

stephen0z avatar Jul 20 '23 21:07 stephen0z

@Stephen: That does not help in my case. I must be as universal as possibel, because or international use. But ist limited to technical conversation. For this reason I can Math, Music, and Symbol Fonts. But for Korean, I now its used. So a single Noto font makes no sense. I search for a better replacement of the UniFont. So GoNotCurrent is perfect (much better than UniFont), if it supports Korean. Genrally in PDF ists not that bad, as I can subset the font and and because of some pictures, the font is not the only reason why the PDF gets a bit bigger. and anyway the font can be subsetted in the PDF. So I really need an universal Font like GoNotoCurrent with Korean symbols.

user6905 avatar Jul 20 '23 22:07 user6905

@user6905 here's the font I've created back when I participated in this thread. Feel free to use it and test it in your specific scenario. GoNotoCJKCore.zip

I don't remember the details of what IS and what IS NOT included. But you can check by yourself.

rubiomiguel06 avatar Jul 27 '23 13:07 rubiomiguel06

Thank you Miguel. Meanwhile I installed Ubuntu und was able to use your fix.

  • with adding codepoints+="U+AC00-D7A3," # Hangul syllables
  • and removing symbols, math and music fonts

from GoNotoCurrent, I build a own Font based on that.

@satbyy: May you consider to include that Font in your collection? I think that can be helpful for some others too.

@satbyy: BTW - Is GoNoto... a correct name for the fonts? According to OFL License I thought you must not use reserved names (RFNs). And Noto is a TM of Google.

user6905 avatar Jul 27 '23 13:07 user6905

@user6905 and all, can you please download the font from the CI pipeline? Now there are two variants:

  • Go Noto Kurrent (with a K, for Korean) with full Hangul syllables but removes symbols/emoji/math.
  • Go Noto Current (existing as it is) with poor Korean support but includes symbols/emoji/math.

If you are satisfied, I will close this issue and make a new release.

satbyy avatar Jul 31 '23 03:07 satbyy

Generally the scirpt on Ubuntu works well and the created font included the Korean signs - I can confirm that. Thanks a lot. I don't have a full test suite but everything looks fine for several Asian languages.

I only wonder that GoNotoCurrent-Regular.ttf from your zip file has only 14.669.722 Bytes. Mine have 15.485.612 Bytes and 64623 Glyphs. I did not tested your font from the the zip so far.

user6905 avatar Jul 31 '23 20:07 user6905

Amazing thank you @satbyy and @xplip

We are using this receipe and specifically the Kurrent font within the @globaleaks project all together with the FPDF2 library.

This makes us possible to print PDF able to render texts coming by any international user!

evilaliv3 avatar May 24 '24 10:05 evilaliv3