defcon
defcon copied to clipboard
What is tools/ for?
What is https://github.com/typesupply/defcon/tree/master/tools ?
It doesn't appear to be used in the build at all.
Is it used to generate https://github.com/typesupply/defcon/blob/master/Lib/defcon/tools/unicodeTools.py ? If so, shouldn't that be part of the build instead of including generated files in source?
I'm asking because I'm working with @medicalwei on packaging this for Debian and the Unicode data is technically under a different license.
The file in unicodeTools.py seems to be from here: ftp://www.unicode.org/Public/9.0.0/ucd/Scripts.txt
So this should be partially attributed to Unicode Consortium. According to this file it seems to be DFSG free as well: http://www.unicode.org/copyright.html#License
Is used to regenerate unicodeTools.py when a new Unicode version comes out, does not need to be packaged.
If so, shouldn't that be part of the build instead of including generated files in source?
I guess you could do that indeed.
I did a rewrite of the loading part of the file: unicodeTools.py https://paste.debian.net/992778/
Please check if that works as intended.
Note that this is for using the files externally. Feel free if you want to backport it.
/usr/share/unicode/Scripts.txt, etc. are not going to work for everyone.
The problem is that, in Debian we need to strip the duplicated files and prefer ones provided in the repository. This does not need to be in the upstream (and that's why I didn't file a pull request.)
However, if it is possible, could you separate the embedded texts from Unicode into some text files? In this way we can replace the files and symlink them to be provided by another package.
How do you guarantee that the file provided by another package is the expected version of Unicode?
Typically we use package dependency to guarantee that.
However if upstream code expects the specific Unicode version we have to do extra work to upload another version of unicode-data. On Thu, 26 Oct 2017 at 17:18 Denis Moyogo Jacquerye < [email protected]> wrote:
How do you guarantee that the file provided by another package is the expected version of Unicode?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/typesupply/defcon/issues/140#issuecomment-339604421, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEi8Z3QEq0Sp74dsH6O76EYGkANKv3qks5swE5NgaJpZM4QGPFi .
The "UnicodeData.txt" file in tools/ folder is used with the script tools/openClosedUniGenerator.py to generate not the whole unicodeTools.py module but only part of it, namely a multi-line string _openClosePairText. However, a comment also says that string has been "tweaked by hand to handle special exceptions".
This is the diff between the _openClosePairText as generated from the tools/openClosedUniGenerator.py script and the data in tools/UnicodeData.txt, and the text which is currently in the Lib/defcon/tools/unicodeTools.py:
https://gist.github.com/anthrotype/3413bb4d92b12494b68b8b14fdc6c531
I don't know why it had to be tweaked, maybe @typesupply knows.
Let me know if I'm understanding this issue correctly.
There's a UnicodeData.txt file in tools; is the problem the fact that the file is there unused, or is it that it doesn't come with an appropriate license file? What do you mean by "DFSG free"? I'm not familiar with these things so any help is welcome.
The unicodeTools.py module embeds the content of "ftp://www.unicode.org/Public/9.0.0/ucd/Scripts.txt" file from Unicode Consortium. You would prefer it to be as a separate data file, because there's already one in Debian repository as a separate package and prefer to avoid duplicating them, correct?
~
btw, this reminded me that there's a pending PR which updates it to Unicode 10 which I forgot to review https://github.com/typesupply/defcon/pull/124
I did a rewrite of the loading part of the file: unicodeTools.py
@medicalwei maybe you could send a pull request?
I don't know why it had to be tweaked, maybe @typesupply knows.
Because some open characters have closed partner characters that aren't defined in UnicodeData.txt. For example, 201D;RIGHT DOUBLE QUOTATION MARK;Pf is the closed partner to:
201C;LEFT DOUBLE QUOTATION MARK;Pi201F;DOUBLE HIGH-REVERSED-9 QUOTATION MARK;Pi201E;DOUBLE LOW-9 QUOTATION MARK;Ps
In UnicodeData.txt, 201D;RIGHT DOUBLE QUOTATION MARK;Pf only appears as a partner to 201C;LEFT DOUBLE QUOTATION MARK;Pi so I had to manually define the other relationships.
I'm open to moving the exceptions to the generator to make this more clear.
The original issue here is that the Unicode data has its own license which wasn't clearly marked here.
DFSG is the Debian Free Software Guidelines. @medicalwei 's comment was that code or content licensed with the Unicode license are acceptable for inclusion in Debian.
Debian has a policy that the same piece of code not be duplicated in Debian if possible. Now, I believe the Unicode data isn't "code" but I thought it was worth asking whether the duplication was necessary here.
Debian updated its version of unicode-data to 10.0.0 very quickly after it was released in June.
The original issue here is that the Unicode data has its own license which wasn't clearly marked here.
would it be enough to include the text of http://www.unicode.org/copyright.html#License in a file called "LICENSE" next to the unicode data files?
whether the duplication was necessary
I don't know. That data file is only used once a year, and I wouldn't like to complicate the setup too much.
Would it be ok if we placed the Scripts.txt and Blocks.txt outside the unicodeTools.py module, and put them as separate text files like the UnicodeData.txt, and then from unicodeTools.py we would have a global variable e.g. UNICODE_DATA_PATH with the path to the embedded package data files; and then you could apply a patch that replaces it with the path to Debian's /usr/share/unicode?
Would it be ok if we placed the Scripts.txt and Blocks.txt outside the unicodeTools.py module, and put them as separate text files like the UnicodeData.txt, and then from unicodeTools.py we would have a global variable e.g. UNICODE_DATA_PATH with the path to the embedded package data files; and then you could apply a patch that replaces it with the path to Debian's /usr/share/unicode?
This is fine with me. As long as the module continues to work as is, I have no opinion on where the source data is located.
Would it be ok if we placed the Scripts.txt and Blocks.txt outside the unicodeTools.py module, and put them as separate text files like the UnicodeData.txt, and then from unicodeTools.py we would have a global variable e.g. UNICODE_DATA_PATH with the path to the embedded package data files; and then you could apply a patch that replaces it with the path to Debian's /usr/share/unicode?
I think upstream can simply move them to a dedicated text files and packagers can replace the files with symbolic links. No need for a global variable. (@jbicha correct me if the policy doesn't allow this.)
But as you stated there are differences for the open-close data from the generated script. I propose doing this with a patch (diff -Naur) from the generated file. We can trigger the generation at build time, and apply the patch right after the generation.
With the new fontTools.unicodedata module in fonttools 3.20.0, I think defcon should simply use that instead of doing its own parsing of UCD data files. Everything needed should be in there, except perhaps for those open/close exceptions Tal mentioned, which can be hard-coded somewhere in unicodeTools.
As long as we have backwards compatibility with the functions in defcon I'd be very happy to ditch the UCD data parsing.
About the issue of the built-in unicodedata.category not being in sync with Unicode 10 noted by @andyclymer in #124, the right thing to do instead of parsing data files is to add unicodedata2 (https://github.com/mikekap/unicodedata2) as an install requirement to defcon.
There are pre-compiled binaries installable via pip for all python versions and platforms.
https://pypi.python.org/pypi/unicodedata2/10.0.0.post2
When unicodedata2 is importable, fontTools.unicodedata will use that for category and all the other public functions.
Based on the changes to where unicodedata is being pulled from, this needs to be looked at again to retain the exceptions, but perhaps remove the /tools completely? I'm not 100% sure what openClosedUniGenerator.py is used/needed by now. It seems we could hardcode the exceptions in unicodeTools.py, and remove the outdated unicodedata.txt and the generator. @anthrotype?