license-expression icon indicating copy to clipboard operation
license-expression copied to clipboard

Compress data files to save space

Open jamestwebber opened this issue 9 months ago • 2 comments

This PR was motivated by a discussion about PEP 639 which might recommend using this package in build tools. In that context, package size is a big concern.

The package is about 1.2 MB installed, and the majority of that is due to scancode-licensedb-index.json. I just gzipped the data file and modified the code appropriately to save space--the json compresses to <10% of its original size and the tests all pass.

jamestwebber avatar May 10 '24 15:05 jamestwebber

That all sounds reasonable but I don't have time at the moment (this version was super easy 😅), I can try to make those changes next week, or someone else can take over.

jamestwebber avatar May 10 '24 15:05 jamestwebber

Actually in the context of https://discuss.python.org/t/pep-639-round-3-improving-license-clarity-with-better-package-metadata/53020/1 I think we can do better.

We can build a minimal license-expression-mini wheel that would contain a subset of the license data ... say just the essential license keys in a list of tuples with no keys.

$ wget https://raw.githubusercontent.com/nexB/license-expression/c20b3f605daefc7cd9e4dc7b34e95280f206def3/src/license_expression/data/scancode-licensedb-index.json
$ ll
total 868
drwxrwxr-x  2 foobar foobar   4096 May 10 17:56 ./
drwxrwxrwx 84 foobar foobar  12288 May 10 17:56 ../
-rw-rw-r--  1 foobar foobar 866178 May 10 17:56 scancode-licensedb-index.json
$ python
Python 3.10.13 (main, Jan  6 2024, 18:44:10) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> j=json.load(open("scancode-licensedb-index.json"))
>>> mini=[]
>>> for l in j:
...  l.pop("json")
...  l.pop("yaml")
...  l.pop("html")
...  l.pop("license")
...  mini.append(list(l.values()))
>>> with open("mini.json", "w") as o:
...  o.write(json.dumps(mini, separators=(',', ':'))
... 
... )
... 
>>> 
$ ll
total 1056
drwxrwxr-x  2 foobar foobar   4096 May 10 18:00 ./
drwxrwxrwx 84 foobar foobar  12288 May 10 17:56 ../
-rw-rw-r--  1 foobar foobar 191708 May 10 18:00 mini.json
-rw-rw-r--  1 foobar foobar 866178 May 10 17:56 scancode-licensedb-index.json
$ xz -z -k -9 mini.json 
$ ll
total 1080
drwxrwxr-x  2 foobar foobar   4096 May 10 18:01 ./
drwxrwxrwx 84 foobar foobar  12288 May 10 17:56 ../
-rw-rw-r--  1 foobar foobar 191708 May 10 18:00 mini.json
-rw-rw-r--  1 foobar foobar  23704 May 10 18:00 mini.json.xz
-rw-rw-r--  1 foobar foobar 866178 May 10 17:56 scancode-licensedb-index.json

It would be down to 23K of compressed data :) I still would want to use flot to generate multiple wheels from the same repo and keep the current wheel as-is.

pombredanne avatar May 10 '24 16:05 pombredanne