source-han-serif icon indicating copy to clipboard operation
source-han-serif copied to clipboard

Glyph/CID difference across SHSerif versions

Open NightFurySL2001 opened this issue 2 years ago • 5 comments

Is there a list of glyphs/CIDs that can reveal glyphs and CIDs that are added, changed and/or removed between Source Han Serif v1.0 and v2.0? It would be beneficial to provide a list of changes for font developers such that version migration can be planned ahead more easily.

In Source Han Sans Readme, there were at least a descriptive recording:

As a result of removing approximately 1,750 glyphs in order to make room for approximately 1,750 new glyphs, the CID assignments of the glyphs necessarily—and drastically—changed. The CID assignments of exactly 200 glyphs are unchanged from Version 1.004: 0–107, 2570–2633, 47223–47232, 47262–47272, 47281–47286, and 65484.

NightFurySL2001 avatar Oct 26 '21 02:10 NightFurySL2001

You will be able do this yourself quite easily. To prepare for this, grab the ordering file, AI0-SourceHanSans, for Version 1.004, remove all of the lines except for the range for Extension A and URO (they are contiguous), then extract the fourth column, which are the glyph names. Name that file v1004.txt. Once Version 2.000 is released, do the same, but name the file v2000.txt. The following command line will provide what you requested:

% diff v1004.txt v2000.txt | grep "^<" | grep JP

In the meantime, I remembered that command line from @kenlunde (who is obviously no longer in charge of this project) to check the glyph difference between v1 and v2, so I compiled a list (but ended up choosing all Ideographs rather than just Extension A and URO) for Source Han Serif, so at least these files should be a little useful, even if something changed in the glyph name between v1 and v2 that would make the changes less accurate.

Serif Removed JP Glyphs.txt Serif Removed KR Glyphs.txt Serif Removed CN Glyphs.txt Serif Removed TW Glyphs.txt Serif Removed HK Glyphs.txt

Marcus98T avatar Oct 26 '21 09:10 Marcus98T

I would believe Adobe have a record for which glyphs are removed and which glyphs are added (mostly HK) across 2.0 since such a big update should have a record on it, and it is near to impossible to manually check all 65,535 glyphs across version with the changes of CIDs.

NightFurySL2001 avatar Nov 02 '21 12:11 NightFurySL2001

Well I guess if there's no reply from Adobe for quite a while, that probably means they don't have such a list and were unable to check what glyphs were changed and removed. But we will keep this issue open in case someone has the time (a lot of free time) to check for themselves, Adobe or community-wise. Maybe someone can add in what else was changed or removed here, bit by bit.

Marcus98T avatar Nov 14 '21 08:11 Marcus98T

Fixing the TW/HK mapping issues in 2.000 is a much higher priority which is why I haven't responded to this yet.

There are many lists for many things, but there is no single list that tells you absolutely everything. Most of this information is very easy to get with some basic scripting. In the uncommon situation where a glyph is renamed it won't be obvious without a note, but it's very easy to see which glyphs were added or removed. uni8D17-TW was renamed to uni8D17-CN and is noted in the ReadMe along with other glyph corrections. I realize I did miss noting the other renames in the ReadMe. These were renamed to match the Source Han Sans names:

uni9686-CN --> uniF9DC-JP
uni52E4-CN --> uniFA34-JP
uni58A8-CN --> uniFA3A-JP
uni7891-CN --> uniFA4B-JP
uni5DFD-CN --> u2F884-JP
uni5448-TW --> uni5448uE0101-JP
uni5E3D-CN --> uni5E3DuE0101-JP
uni655E-CN --> uni655EuE0101-JP
uni6F74-CN --> uni6F74uE0101-JP
uni8422-CN --> uni8422uE0101-JP
uni8842-CN --> uni8842uE0101-JP
uni8CCA-CN --> uni8CCAuE0102-JP
uni8D05-CN --> uni8D05uE0101-JP
uni8DDA-CN --> uni8DDAuE0101-JP
uni976D-CN --> uni976DuE0102-JP

In this case it doesn't really matter which name is used if the mapping is correct, but it makes it much easier to compare the Sans and Serif glyph sets if names are a little more consistent.

CID changes don't make it easier or harder in any way to compare. The information is all right there in the layout file. For basic comparison I wrote this Python script in about 10 seconds. You just need the path to the 1.001 layout file and the 2.000 layout file (or whatever versions). Along with the list of renamed glyphs you can see what was added, removed, or had the CID changed.

#!/usr/bin/env python3 

old_layout = '/path/to/old/layout/AI0-SourceHanSerif'
new_layout = /path/to/new/layout/'AI0-SourceHanSerif'
name2cid_old = {}
name2cid_new = {}

with open(old_layout, 'r') as f:
    old_lines = [line.strip() for line in f.readlines()]

for line in old_lines:
    cid, _, _, uni = line.split()
    name2cid_old[uni] = int(cid) 

with open(new_layout, 'r') as f:
    new_lines = [line.strip() for line in f.readlines()]

for line in new_lines:
    cid, _, _, uni = line.split()
    name2cid_new[uni] = int(cid)

print('Removed glyphs:')
for name, old_cid in name2cid_old.items():
    if name not in name2cid_new:
        print(f'{name} removed')

print('Added glyphs:')
for name in name2cid_new:
    if name not in name2cid_old:
        print(f'{name} added')

print('CID changed:')
for name, new_cid in name2cid_new.items():
    if name in name2cid_old:
        old_cid = name2cid_old[name]
        if new_cid != old_cid:
            print(f'{name}: {old_cid} --> {new_cid}')

punchcutter avatar Nov 15 '21 06:11 punchcutter

There seem to be 3822 glyphs removed from v1 (and 514 reserved CID), most of which are of JP; 4336 glyphs are added in v2.001 for HK support. This file list all of the removed/added CID and their names along with difference PDF.

  • Note that the list is incomplete as there might be swapped glyphs retaining same name.

P/S: uni8D05-CN --> uni8D05uE0101-JP is incorrect: uni8D05uE0101-JP already existed in v1, and uni8D05-CN was directly removed in v2.

NightFurySL2001 avatar Feb 17 '23 13:02 NightFurySL2001