umongo icon indicating copy to clipboard operation
umongo copied to clipboard

Store byte object as field

Open martinjuhasz opened this issue 8 years ago • 8 comments

Is there a way to store a python byte object as a field using umongo without relying on gridfs? In my case i want to store pretty small binary objects that get changed rarely. In pymongo the recommended way seems to use a bson field but i cannot find a related field in umongos fields.py.

martinjuhasz avatar Aug 09 '16 15:08 martinjuhasz

That's a good idea, beside this doesn't seems like something hard to add : Just create a new ByteField which would mimic marshmallow's String but with a check on byte in _deserialize instead of basestring

PR welcomed :+1:

touilleMan avatar Aug 11 '16 15:08 touilleMan

Okay, good. But what would the default serialization method do? Byte data isn't necessarily convertible into a string, right? How would someone want a byte Field to be serialized?

In my special case the byte field represents some pickled state of an object that i don't even want to be serialized and sent over my api (is there a way to exclude fields on serialization?).

class ByteField(BaseField, ma_fields.String):
    def _deserialize(self, value, attr, data):
        if isinstance(value, bytes):
            return value
        return super()._deserialize(value, attr, data)

Works fine for storing and if the stored byte field is valid utf-8 it gets converted into a string on serialization.

martinjuhasz avatar Aug 12 '16 07:08 martinjuhasz

Bytes is a valid bson type (named Binary data in mongodb types)

Beside, it seems pymongo does the convertion bytes <=> Binary data by itself:

>>> hello = 'héllo'
>>> doc_id = db.test.insert({'str': hello, 'bytes': hello.encode()})
>>> db.test.find_one(doc_id)
{'bytes': b'h\xc3\xa9llo', 'str': 'héllo', '_id': ObjectId('57ad9b0713adf23b7095fcee')}

So I think the _deserialize method should check the entry data is bytes and that's it ! pymongo will gladly take care of those bytes for us ;-)

In my special case the byte field represents some pickled state of an object that i don't even want to be serialized and sent over my api (is there a way to exclude fields on serialization?).

Yes there is ! You should use the attribute load_only for your field. This way it will never be serialize. I guess you should also use dump_only attribute as well in order for your API not to accept incoming data for this field during deserialization.

@instance.register
class MyDoc(Document):
    pickled_stuff = field.BytesField(load_only=True, dump_only=True)
    public_name = field.StrField()

# inside your POST API
payload = get_payload_from_request()
my_doc = MyDoc(**payload)
# raise ValidationError if a 'pickled_stuff' field is present
assert my_doc.pickled_stuff == None
my_doc.pickled_stuff = pickle_my_stuff()  # must return bytes
my_doc.commit()
return 200, 'Ok'

# inside your GET API
my_doc = MyDoc.find({'id': my_id})
print(my_doc)
# <... {'pickled_stuff': b'<pickled data>', 'public_name': 'test' }...>
my_doc.dump()
{'public_name': 'test'}
return 200, json.dumps(my_doc.dump())

You should also have a look at the flask example which show you how to use umongo inside an API with custom loading/dumping schema

touilleMan avatar Aug 12 '16 10:08 touilleMan

Thanks for sharing this, great stuff! So you think BytesField should try to serialize using ensure_text_type (as it does when inheriting from BaseField, ma_fields.String)? It will fail on binary data thats not utf-8 encoded, but i guess thats fine, because if you want binary data to be serialized you would have thought about encoding before storing it.

My pickled data would fail on serialization:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

martinjuhasz avatar Aug 12 '16 10:08 martinjuhasz

I think we should not try to do any string encode/decode inside umongo. This should be the user responsibility to provide bytes given there is too much suppositions on his workflow otherwise (what is the encoding the user want ? what to do with bytes that can't be decoded ? etc.). Beside this makse the implementation more straightforward and simple, so why bother ;-)

touilleMan avatar Aug 12 '16 10:08 touilleMan

Does this enhancement still need? If so, I want to get a try.

chenjr0719 avatar Mar 11 '19 06:03 chenjr0719

@touilleMan @chenjr0719 @lafrech @martinjuhasz Guys, this issue seems dead old, but this is where search engine leads you when you're looking for ways to add binary fields to uMongo document. In my opinion, it is such a shame Marshmallow does not provide users binary field. They do it on purpose, but there are really no good reasons for doing so:

  • BSON spec has Binary and Mongo supports it.
  • Binary field is needed for numerous appliances (whether it would an avatar, some hash or small blob). And this is where Mongo plays strong in terms of efficiency.
  • People are trying to store bytes either as UTF-8 encoded string, which one day will result in ultimate failure (example – b'\xd5\xce\xe1\x86\xcf'), or as base64 encoded value. Which is more reliable, but introduces inconveniences (no obvious way to check length, slice, ... without decoding first) and computation overheads.
  • Others are trying to store blobs in GridFS. Stackoverflow is full of such recommendations. Of course, it is not the use case GridFS was initially made for

Conclusion from the above: uMongo needs BinaryField. If Marshmallow guys refuse to add support for it – f*ck them, let's do it in uMongo

Unfortunately, I'm not uMongo developer and haven't dig deep into how everything works. Here is an example of BinaryField I came with:

import bson
from marshmallow import compat as ma_compat, fields as ma_fields
from umongo import fields


class BinaryField(fields.BaseField, ma_fields.Field):
    default_error_messages = {
        'invalid': 'Not a valid byte sequence.'
    }

    def _serialize(self, value, attr, data):
        return ma_compat.binary_type(value)

    def _deserialize(self, value, attr, data):
        if not isinstance(value, ma_compat.binary_type):
            self.fail('invalid')
        return value

    def _serialize_to_mongo(self, obj):
        return bson.binary.Binary(obj)

    def _deserialize_from_mongo(self, value):
        return bytes(value)

Maybe there are some obscure caveats, maybe not. This is the code I'm currently having in project and it seems to work like a charm. (I'm using Motor)

Would be nice if someone familiar with internals of uMongo could take a look

thodnev avatar Nov 08 '19 15:11 thodnev

@thodnev Thanks for the code, I'm going to use it as I need the ability to store binary data (in this case, a password salt created from os.urandom). If this code works, I can't imagine it would be too hard to add to a PR (if you haven't already).

kevinbosak avatar Jan 10 '21 17:01 kevinbosak