vscode_rainbow_csv
vscode_rainbow_csv copied to clipboard
Alignment weird when encountering Chinese characters
I find that the alignment of commas is not correct when processing csv files containing Chinese characters, as shown in the figure below:
What I expect is like the figure below (I manually align it, but indeed it cannot be aligned perfectly):
The following is the original text in case you need it.
编号 ,配件 ,型号 ,数量 ,渠道 ,¥ ,状态
1 ,主板 ,微星B450m mortar max ,1 ,二手 ,284.5 ,已买
2 ,CPU ,AMD 3600(带原装散热) ,1 ,二手 ,674.5 ,已买
3 ,显卡 ,AMD HD7870 ,1 ,二手 ,200 ,-
4 ,内存 ,Crucial 8Gx2 3200 MHz ,1 ,京东 ,399 ,已买
5 ,固态 ,SN550 1T ,1 ,淘宝 ,719 ,已买
6 ,电源 ,酷冷GX450 ,1 ,京东 ,269 ,已买
7 ,风扇 ,先马12cm黑框红叶 ,3 ,淘宝 ,27 ,已买
8 ,机箱 ,酷冷大师Q300L ,1 ,二手 ,135 ,已买
Hi @github-young , Thank you for bringing up this interesting problem! I didn't thoroughly analyze this issue yet, but it looks that the default VSCode font ("Consolas, 'Courier New', monospace") allocates two times more space for Chinese characters. So it could be possible to adjust the alignment logic based on the character values. But this wouldn't guarantee proper alignment for users who use non-default fonts in VSCode. I am not an expert in fonts so any suggestions on how we can fix this in a reliable way without making the problem worse would be much appreciated.
Adding a relevant link as a starting point for the discussion: https://denisbider.blogspot.com/2015/09/when-monospace-fonts-arent-unicode.html
just some reference:
- CJKV fonts are all 2 spaces width.
- some useful article there. stackoverflow
- new native regex filter. stackoverflow
- CJKV
- Must system has built-in monospaced CJKV font called Proportional Font list below
- 新細明體 PMingLiU
- 細明體
- 中易宋體
- 中易黑體
- さざなみ Sazanami
- MS明朝
If I have time I will try fix this.
Hi @mechatroner,
I am trying to fix this issue. But, "unicode grapheme cluster" is too difficult for me. In particular, emoji with "zero width joiner" is too difficult.
So, I used npm library "graphemesplit".
https://github.com/ja-jp-utf8/vscode_rainbow_csv/tree/support_align_with_cjkv
It works to adjust with correct alignment in Japanese.
If you like this fix, I will create pull request.
Hi @ja-jp-utf8, that's really cool! I see you did a lot of research to implement this, thank you!
Here are some of my thoughts on this:
- I usually prefer not to add additional dependencies when it is not absolutely necessary, but I am feel kinda OK towards "graphemesplit" - the dependency tree is not very high, looks like it has only few transitive dependencies, all of them seems to be professional projects that require deep unicode knowledge. So this is not an issue - just a comment on your selection of graphemesplit.
- We need an integration test, "rainbow_csv" has already a couple of them, so we need another one to at least have some minimal assurance that the new alignment algorithm works correctly - I need a test CSV file or couple of files with lots of wide characters and some emojis since as you say they represent significant challenge for the alligners.
- This is a question - when you test your algorithms on some random files, do the vertical allignment lines look completely vertical or there is still some minor wobbling?
- Do I understand correctly that the main reason we need to add the "graphemesplit" lilbrary is emojis allignment and otherwise you can write a standalone alignment module? If this is the case I would argue that this is kinda not a big deal. Properly aligning "normal" wide logograms is already a huge improvement over the current situation. So if you can write a standalone aligner that would properly align normal "wide" characters but would work incorrectly in some corner cases such as emojis, I would strongly prefer to use this home-made solution. But we still would need a CSV integration test file with these emojis just to show where the new aligner does a non-ideal job and where it works reasonably well (see item 2)
- What about performance? I need some metrics to compare how much longer it now takes to allign a 300K lines file - this is about the biggest file that VSCode can support. BTW this will also slow down alignment for files that have only ascii characters in them and I would argue that this is a significant proportion of CSV files. If the time difference is significant we can try to come up with some optimizations - e.g. start with optimistic aligning but if at some point we encounter a wide character ( which should be fairly early in majority of CSV files that have wide characters ), restart the whole alignment process using the different algorithm.
So, as you see, I just can't merge the new code without a new integration test and some performance measurements from you. And if you are saying that we can do without the library in cases when we don't have emojis - I say let's just ignore this case for now and roll out our own custom aligner that will work in other 95% cases with "normal" wide chars.