superduper
superduper copied to clipboard
Optimize the encoding efficiency of `encode()`
trafficstars
For the built-in Leaf object in superduperdb, reduce the amount of information through special references
For example
Now
from superduperdb.components.datatype import pickle_serializer
from superduperdb import Document
Document({'id': 123, 'x': pickle_serializer('This is a test')}).encode()
We get
{'id': 123,
'x': '?866cf8526595d3620d6045172fb16d1efefac4b1',
'_builds': {'pickle': {'_path': 'superduperdb/components/datatype/get_serializer',
'method': 'pickle',
'encodable': 'artifact',
'type_id': 'datatype',
'version': None,
'uuid': '6b928f3c-ccfa-43eb-96ee-ae38bd8430e3'},
'866cf8526595d3620d6045172fb16d1efefac4b1': {'_path': 'superduperdb/components/datatype/Artifact',
'uuid': 'b28469b8-cb63-4df1-972c-b17d11eb5abd',
'datatype': '?pickle',
'uri': None,
'blob': '&:blob:866cf8526595d3620d6045172fb16d1efefac4b1'}},
'_files': {},
'_blobs': {'866cf8526595d3620d6045172fb16d1efefac4b1': b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x0eThis is a test\x94.'}}
To
{'id': 123,
'x': '?866cf8526595d3620d6045172fb16d1efefac4b1',
'_builds': {'866cf8526595d3620d6045172fb16d1efefac4b1': {'_path': 'superduperdb/components/datatype/Artifact',
'uuid': 'b28469b8-cb63-4df1-972c-b17d11eb5abd',
'datatype': '&:superduperdb:datatype:pickle',
'uri': None,
'blob': '&:blob:866cf8526595d3620d6045172fb16d1efefac4b1'}},
'_files': {},
'_blobs': {'866cf8526595d3620d6045172fb16d1efefac4b1': b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x0eThis is a test\x94.'}}
Furthermore, we can even remove _builds:866cf8526595d3620d6045172fb16d1efefac4b1, because everything is built-in. As long as we have better protocol, it will eventually become xxxx.
{'id': 123,
'x': '&:protocol:{Artifact(datatype=&datatpye/pickle, blob=&:blob:866cf8526595d3620d6045172fb16d1efefac4b1)}',
'_blobs': {'866cf8526595d3620d6045172fb16d1efefac4b1': b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x0eThis is a test\x94.'}}
Ultimately, this protocol should have the following characteristics:
-
Improve information compression rate by utilizing the following mechanisms:
- db.metadata, such as &:component:
- db.artifact, such as &:blob: / &:file:
- superduperdb’s codebase, such as &:new_type:
- ...
-
The encoded information should be readable and meaningful.