superduper icon indicating copy to clipboard operation
superduper copied to clipboard

Optimize the encoding efficiency of `encode()`

Open jieguangzhou opened this issue 1 year ago • 0 comments
trafficstars

For the built-in Leaf object in superduperdb, reduce the amount of information through special references

For example

Now

from superduperdb.components.datatype import pickle_serializer
from superduperdb import Document
Document({'id': 123, 'x': pickle_serializer('This is a test')}).encode()

We get

{'id': 123,
 'x': '?866cf8526595d3620d6045172fb16d1efefac4b1',
 '_builds': {'pickle': {'_path': 'superduperdb/components/datatype/get_serializer',
   'method': 'pickle',
   'encodable': 'artifact',
   'type_id': 'datatype',
   'version': None,
   'uuid': '6b928f3c-ccfa-43eb-96ee-ae38bd8430e3'},
  '866cf8526595d3620d6045172fb16d1efefac4b1': {'_path': 'superduperdb/components/datatype/Artifact',
   'uuid': 'b28469b8-cb63-4df1-972c-b17d11eb5abd',
   'datatype': '?pickle',
   'uri': None,
   'blob': '&:blob:866cf8526595d3620d6045172fb16d1efefac4b1'}},
 '_files': {},
 '_blobs': {'866cf8526595d3620d6045172fb16d1efefac4b1': b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x0eThis is a test\x94.'}}

To


{'id': 123,
 'x': '?866cf8526595d3620d6045172fb16d1efefac4b1',
 '_builds': {'866cf8526595d3620d6045172fb16d1efefac4b1': {'_path': 'superduperdb/components/datatype/Artifact',
   'uuid': 'b28469b8-cb63-4df1-972c-b17d11eb5abd',
   'datatype': '&:superduperdb:datatype:pickle',
   'uri': None,
   'blob': '&:blob:866cf8526595d3620d6045172fb16d1efefac4b1'}},
 '_files': {},
 '_blobs': {'866cf8526595d3620d6045172fb16d1efefac4b1': b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x0eThis is a test\x94.'}}

Furthermore, we can even remove _builds:866cf8526595d3620d6045172fb16d1efefac4b1, because everything is built-in. As long as we have better protocol, it will eventually become xxxx.

 {'id': 123,
 'x': '&:protocol:{Artifact(datatype=&datatpye/pickle, blob=&:blob:866cf8526595d3620d6045172fb16d1efefac4b1)}',
 '_blobs': {'866cf8526595d3620d6045172fb16d1efefac4b1': b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x0eThis is a test\x94.'}}

Ultimately, this protocol should have the following characteristics:

  1. Improve information compression rate by utilizing the following mechanisms:

    1. db.metadata, such as &:component:
    2. db.artifact, such as &:blob: / &:file:
    3. superduperdb’s codebase, such as &:new_type:
    4. ...
  2. The encoded information should be readable and meaningful.

jieguangzhou avatar Jun 14 '24 07:06 jieguangzhou