license-expression
license-expression copied to clipboard
Compress data files to save space
This PR was motivated by a discussion about PEP 639 which might recommend using this package in build tools. In that context, package size is a big concern.
The package is about 1.2 MB installed, and the majority of that is due to scancode-licensedb-index.json
. I just gzipped the data file and modified the code appropriately to save space--the json compresses to <10% of its original size and the tests all pass.
That all sounds reasonable but I don't have time at the moment (this version was super easy 😅), I can try to make those changes next week, or someone else can take over.
Actually in the context of https://discuss.python.org/t/pep-639-round-3-improving-license-clarity-with-better-package-metadata/53020/1 I think we can do better.
We can build a minimal license-expression-mini wheel that would contain a subset of the license data ... say just the essential license keys in a list of tuples with no keys.
$ wget https://raw.githubusercontent.com/nexB/license-expression/c20b3f605daefc7cd9e4dc7b34e95280f206def3/src/license_expression/data/scancode-licensedb-index.json
$ ll
total 868
drwxrwxr-x 2 foobar foobar 4096 May 10 17:56 ./
drwxrwxrwx 84 foobar foobar 12288 May 10 17:56 ../
-rw-rw-r-- 1 foobar foobar 866178 May 10 17:56 scancode-licensedb-index.json
$ python
Python 3.10.13 (main, Jan 6 2024, 18:44:10) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> j=json.load(open("scancode-licensedb-index.json"))
>>> mini=[]
>>> for l in j:
... l.pop("json")
... l.pop("yaml")
... l.pop("html")
... l.pop("license")
... mini.append(list(l.values()))
>>> with open("mini.json", "w") as o:
... o.write(json.dumps(mini, separators=(',', ':'))
...
... )
...
>>>
$ ll
total 1056
drwxrwxr-x 2 foobar foobar 4096 May 10 18:00 ./
drwxrwxrwx 84 foobar foobar 12288 May 10 17:56 ../
-rw-rw-r-- 1 foobar foobar 191708 May 10 18:00 mini.json
-rw-rw-r-- 1 foobar foobar 866178 May 10 17:56 scancode-licensedb-index.json
$ xz -z -k -9 mini.json
$ ll
total 1080
drwxrwxr-x 2 foobar foobar 4096 May 10 18:01 ./
drwxrwxrwx 84 foobar foobar 12288 May 10 17:56 ../
-rw-rw-r-- 1 foobar foobar 191708 May 10 18:00 mini.json
-rw-rw-r-- 1 foobar foobar 23704 May 10 18:00 mini.json.xz
-rw-rw-r-- 1 foobar foobar 866178 May 10 17:56 scancode-licensedb-index.json
It would be down to 23K of compressed data :) I still would want to use flot to generate multiple wheels from the same repo and keep the current wheel as-is.