asdf icon indicating copy to clipboard operation
asdf copied to clipboard

pickle-like behavior?

Open allefeld opened this issue 4 years ago • 7 comments

I like the approach taken by ASDF, representing as much data as possible in human-readable YAML, plus binary blobs for the rest.

What stops me from adopting it right away is the complicated steps necessary to write custom objects: Write an extension which involves a schema, an additional class, new tags, new methods etc. etc.

I understand that ASDF is targeted towards standardized data formats, hence the schemas, versions, organizations etc. However, the advantages of ASDF would also be useful in more ad-hoc contexts, where I have a data structure consisting of basic Python types as well as objects of custom classes, and I want to write that data structure to a file and read it back. Basically pickle, just without that strange file format.

In such an approach, all that would be necessary would be for the classes to implement the to_tree and from_tree methods. Or maybe not even that: Use the interface which is used by pickle (I think __reduce__?) which represents an object state as a dict, i.e. a tree. I'm guessing that would involve creating tags automatically, something like !object/somemodule.SomeClass.

If you feel that this is beyond the scope of the ASDF project, a question: Would it be possible to achieve such a behavior by writing an extension, which defines a generic class ObjectType(asdf.CustomType) etc.?

allefeld avatar Jan 28 '21 14:01 allefeld

Have you considered using pyyaml directly for this? It already includes support for serializing any pickleable object. For example:

# some_module.py
class SomeClass:
    def __init__(self, var1, var2):
        self.var1 = var1
	self.var2 = var2
import yaml

from some_module import SomeClass

instance = SomeClass("foo", 42)

with open("test.yaml", "w") as f:
    f.write(yaml.dump(instance))

with open("test.yaml") as f:
    # Specifying Loader is necessary to avoid warning about the security
    # implications of reading a file that contains !!python objects:
    read_instance = yaml.load(f.read(), Loader=yaml.Loader)
    assert read_instance.var1 == "foo"
    assert read_instance.var2 == 42

The YAML looks like this:

!!python/object:some_module.SomeClass
var1: foo
var2: 42

eslavich avatar Jan 28 '21 17:01 eslavich

Yes, I thought about that. The drawback of PyYAML is that it lacks the binary blob feature of ASDF, which means that NumPy arrays lead to lengthy base64-encoded sections. It's both less efficient and less human readable – obviously the array itself, but also the other elements which are separated by possibly thousands of lines of gibberish. And even the array metadata are absurdly complicated:

p: !!python/object/apply:numpy.core.multiarray._reconstruct
  args:
  - &id001 !!python/name:numpy.ndarray ''
  - !!python/tuple
    - 0
  - !!binary |
    Yg==
  state: !!python/tuple
  - 1
  - !!python/tuple
    - 20
    - 100
  - &id002 !!python/object/apply:numpy.dtype
    args:
    - f8
    - false
    - true
    state: !!python/tuple
    - 3
    - <
    - null
    - null
    - null
    - -1
    - -1
    - 0
  - false
  - !!binary |
    NMSqcZR7zj88dJhdt3zPPwAALo2xE9A+AACA6U5bnD4a5xkIn4fqP/Btxf/iEbA/AOs+VwNlYj+g
    Z0XPv9WdPzLAUEXkMu8/AAD2+Y9j+D7odbM7PbjWP7gk+7IaJMg/bMi8GiDRwT8szxgiZW3vPwAA
    SCHub7A+1FAZVfiL1T+gxRZ6MiqiP7Lzbl+wLtc/AAzE5fAycj9w7KoedRiyP/Ai4mHta80/LecX
    b0im4D8s9NBlHNDUPwAAgNxTCnI+AAC/TJX55z6gmX2+zqqVP4DKqHVuXqg/4P4xbmuf2z9IYL4V
    aJfoP+LNUBEv8O8/AIhZLg2VQT+vE4G3Ep/tP2zdQ9SVl9M/AD0XVf8O7T/oHYM80rvJPwAAMKRN
    K7U+zwWiWl4W6j83lY/PuaXrPzSGmbtfpd0/QMYmjfXLsT8AAHxh9ajYPgAQxlE0OSA/AADgemeP
    oD4AAOK48i/RPt5Q8AKSnOM/gIHV8cyXnz+6OGSYqG7WP6a+Z/ROD+s/Z+VJMbFh5D8kbrp4DI3i
    PwAAAACy+NE9yH2BejKL5j+SuvdabTDkP2p16h/zZN8/AFM8rnRYaz82MIf9j0bePzVSIbfmT+k/
    AABgzWqAkj5u6dtUeoLkP6kQmkZSHOs/DMffPsxG4z/WaRMX9KLqP+/u1bKVTeE/OnKCI3DH7z8A
    ALgs+TmxPlACauZjWek/AADAua3PgD7uaDE657viP1y+U5pXW9o/Drio/k5M6D+KA3JUC2HbPySC
    Y/5ixsk/XcilmYvo7T/ug6rt5LDsP9TD6+AOtcs/M5Vcps7p6j8AAAAiIipUPhKPOc09r9M/Tgar
    4/bH2z/v1moK78voP2vc06AE9uk/EI/gKmhF1T/qa8VTy1nmPzD3qXPAXrg/Iq8bcON55z/Aq91o
    Pd6FPwAAKY2Y/Qw/AAAgBrNNrz49j5o9ytvpP9AU1eO7G9w/QpiJ8mRX4D9YodYhPL3EP8Dyyew/
...

Btw., since you support ndarrays out of the box, why not also pandas.DataFrames?

allefeld avatar Jan 28 '21 18:01 allefeld

Since this library already uses pyyaml, it's probably straightforward to add an option that allows pyyaml to serialize otherwise unrecognized Python objects. Right now we deliberately avoid that by using yaml.SafeDumper.

eslavich avatar Jan 28 '21 18:01 eslavich

As for Panda DataFrames, we are trying to keep the format language neutral and putting something like that in the main standard works against that. Not that it can't be added as a well supported extension though.

perrygreenfield avatar Jan 28 '21 19:01 perrygreenfield

Right now we deliberately avoid that by using yaml.SafeDumper.

Sounds good! So is there a dumper other than SafeDumper that would do this? Or can you point me to the right place to figure out how to create such an unsafe dumper?

allefeld avatar Jan 29 '21 05:01 allefeld

Sounds good! So is there a dumper other than SafeDumper that would do this?

Yes, yaml.Dumper is the one. Our custom AsdfDumper subclasses SafeDumper:

https://github.com/asdf-format/asdf/blob/master/asdf/yamlutil.py#L36

Now that I think about this a little more, changing the dumper may not give satisfying results. The ASDF serializer only descends into into objects it understands, which means for example it wouldn't be able to handle an np.ndarray in an attribute of a mystery object:

class SomeUnrecognizedClass:
  def __init__(self):
    self.array = np.arange(10)

with AsdfFile({"unrecognized": SomeUnrecognizedClass()}) as af:
  af.write_to("test.asdf")

The resulting YAML would still include the base64-encoded version of that nested array.

To handle that case, we'd need something more like what you originally suggested, which is a special converter that receives objects not handled by other converters.

eslavich avatar Jan 30 '21 22:01 eslavich

Thanks!

Because at the moment the arrays I'm working with are relatively small, I've resorted to using PyYAML directly, with a custom representer which writes arrays neither as binary nor as base64, but as (possibly nested) YAML sequences:

def ndarray_representer(dumper, array):
    """PyYAML representer for ndarray"""
    return dumper.represent_sequence(
        'tag:yaml.org,2002:python/object/apply:numpy.array',
        [array.tolist(), str(array.dtype)])

This creates entries of the form

par: !!python/object/apply:numpy.array
- - [0.006907504562964102, 0.01672667054576321, 0.026650675402412397]
  - [0.03659748142210052, 0.046552737238023036, 0.05651202128919573]
- float64

for a 2 × 3 array.

No custom constructor is necessary; if the file is loaded by plain PyYAML, the array is recreated by a call to numpy.array.

But I might come back to ASDF and the custom Dumper approach when I have to handle larger arrays.

allefeld avatar Jan 31 '21 22:01 allefeld