marshmallow icon indicating copy to clipboard operation
marshmallow copied to clipboard

Empty strings as None (revisited)

Open csingley opened this issue 7 years ago • 32 comments

I'd like to revisit this suggestion - allowing empty strings to be interpreted as None:

https://github.com/marshmallow-code/marshmallow/issues/543

I understand the sentiment of the response: "it's a bad idea to support that". However, that's not sufficient for some use cases.

For example, right now I'm trying to use marshmallow to convert an XML format where null values are sent as empty strings. I don't get to control the XML format.

I think it's entirely reasonable for a Field to interpret a null datestring (or a null decimal string) as None, and control via allow_none whether to pass that. I don't think it's reasonable to fall down and say "the NULL value defined by your format is invalid, and we're not going to allow you to express that". This just forces a lot of ugly preprocessing workarounds onto library users, who need to clutter their code with superfluous, bug-prone value = value or None type declarations... thereby mixing format definition code in the middle of application code, which makes it hard to use marshmallow to write good code for these use cases.

I feel that the right place to define a format is, well, in the Schema class definition. A patch to accomplish this is simple:

  • add an attribute to fields.Field, e.g. null_values = {None}
  • change fields.Field.deserialize() line 263 as follows: if getattr(self, 'allow_none', False) is True and value in self.null_values: return None

We can stop there, and allow users to support their nasty unhygienic formats by overriding null_values in a custom subclass... or even take it a step further and allow first-class support for text-only formats by directly setting null_values = {None, ''} for fields.Decimal, fields.DateTime, fields.Date, fields.Time, etc. By Zeus, maybe even fields.Integer could treat empty strings as None!!

I'm happy to submit a PR to that effect, but I'd like to see where the developers' heads are at vis-a-vis using marshmallow for non-JSON formats before wasting anybody's time.

Thanks, and happy new year!

csingley avatar Jan 02 '18 17:01 csingley

This would be really nice for parsing e.g. dates from HTML forms, where an empty <input type=date> (or any empty <input>) results in the empty string.

keysmashes avatar Mar 07 '18 10:03 keysmashes

Seconded. This would be pretty handy for parsing empty <input> fields.

mgd722 avatar Apr 02 '18 22:04 mgd722

I think it makes sense for marshmallow to support this use case. We should explore marshmallow's existing functionality before we add a feature to the core.

Your use case:

  • Deserializing XML
  • Deserializing an empty string to None
  • Enabling special None handling per Field instance

The general solution for customizing deserialization behavior should be a custom field.

http://marshmallow.readthedocs.io/en/latest/custom_fields.html

from marshmallow import fields

class CustomField(fields.String):
    def _deserialize(self, value, attr, obj):
        if value == '':
            return None
        return super(CustomField, self)._deserialize(value, attr, obj)

Is there anything about this that doesn't work for your use case?

deckar01 avatar Apr 03 '18 04:04 deckar01

Writing custom field classes would work, of course... but if I have to write a custom Field subclass for every single data type, I'll just go ahead and roll my own type conversion/validation module and save myself the import. Indeed, that's what I ended up doing for the project where I was evaluating marshmallow.

It's a bit of a shame, because marshmallow has somef features that make it play nice with ElementTree (by accident I suppose). This is a strength that could be capitalized upon.

Taking a step back - the meaning of an empty string is defined globally by the protocol and/or parsing library.... "emtpy string is NULL" is a common choice. To map that protocol into Python data types, you want to write that equivalence once, globally... not write the same boilerplate ten times, once per validator. That's most inelegant.

Setting my use case aside, you've got users saying "hey wait a minute, HTML forms also use 'empty string is NULL' for lots of different data types". To me that sounds like a fairly key use case for marshmallow, and not handling it in the library seems like a bit of an oversight.

Just a friendly $0.02, because I see a lot to like in marshmallow. I probably would have used it for my project if this feature existed.

csingley avatar Apr 03 '18 12:04 csingley

I don't think it is unreasonable to create a wrapper class for each of the fields you need to customize. It should be possible to create a mixin to make this clean and maintainable.

from marshmallow import fields

class NoneMixin(object):
    def _deserialize(self, value, attr, obj):
        if value == '':
            return None
        return super(NoneMixin, self)._deserialize(value, attr, obj)

class CustomStringField(NoneMixin, fields.String):
    pass

class CustomDecimalField(NoneMixin, fields.Decimal):
    pass

class CustomDateField(NoneMixin, fields.Date):
    pass

...

Would this work for the empty HTML input use case?

deckar01 avatar Apr 03 '18 13:04 deckar01

The custom field described by @deckar01 only needs to be created once. Then, in the app, it can be used in place of the String field.

Agreed, this is far from perfect.

  • The user needs to realize the feature he expected from marshmallow isn't there and a custom field is needed.
  • The user needs to write this custom field himself, probably by copying it from a GitHub issue comment.
  • In the schemas, the user can't import all fields from marshmallow.fields, he needs to also import from his own custom fields module. Cosmetic.
  • This may have side-effects if a third-part library relies on the type of the field and can't rely on inheritance (for instance to create an OpenAPI doc, the field type needs to be associated with a parameter type, and Email field should not be treated as String, so I end up registering all my custom fields manually), but this should be a corner case.
  • If the user wants to achieve this globally, he must subclass all the fields he's using (edit: NoneMixin makes this easier, but fields must be subclassed anyway).

I think @csingley's proposal makes sense.

Wouldn't it be better, however, to treat empty strings as missing rather than None. I'm thinking of the HTML form use case, for instance.

  def deserialize(self, value, attr=None, data=None)
+     if value in self.null_values:
+         value = missing
        self._validate_missing(value)
    ...

Should this be used only on deserialization or on serialization as well? See this discussion about removing nullish values from output: https://github.com/marshmallow-code/marshmallow/issues/229.

lafrech avatar Apr 03 '18 13:04 lafrech

I've got no skin in the game at this point, but from a user's point of view I think the mixin proposal is pretty reasonable (although not surprisingly I prefer my own approach... who doesn't?) Maybe a documentation enhancement outlining the NoneMixin pattern would be a good place to start?

@lafrech - re: missing vs None - that would work fine for my specific case, but these two things can sometimes require different handling. If the data protocol changes, and a new field is added, you might want to have your code do one thing if the field is missing (old protocol) vs. blank (new protocol, default value). Obviously this doesn't really apply to web forms, but I'd imagine in general you'd want to discriminate between the case where the field is empty and the case where it's simply not sent in the data stream.

Anyway, that's enough back seat driving from me! Cheers, and thanks for your work.

csingley avatar Apr 03 '18 14:04 csingley

@anowlcalledjosh Did you come up with a solution for your use case?

@mgd722 Would a custom field be an acceptable solution for your use case? https://github.com/marshmallow-code/marshmallow/issues/713#issuecomment-378125776 or https://github.com/marshmallow-code/marshmallow/issues/713#issuecomment-378247878

deckar01 avatar Apr 03 '18 14:04 deckar01

@deckar01 I ended up using the following on my schemas:

@pre_load
def allow_null_dates(self, item):
    item["date"] = item["date"] or None
    item["enddate"] = item["enddate"] or None

Having to copy-paste subclasses seems a bit boilerplate-y, and makes integration with libraries like Marshmallow-SQLAlchemy (which automatically generates Marshmallow schemas) harder. Maybe for strings that's acceptable, but since the empty string isn't a valid date, I think dates should certainly turn "" into None.

keysmashes avatar Apr 04 '18 22:04 keysmashes

Yes, a custom field could work for me but I do agree that it's rather inelegant. I ended up going with something like @anowlcalledjosh above to solve my issue. Since an empty string will never be a valid date, I'll second that (at least for dates) "" should end up as None.

mgd722 avatar Apr 09 '18 09:04 mgd722

A general solution could be to add a parameter to Field(available on all field types) to customize what input values are to be considered "null" or "missing".

DrPyser avatar Dec 19 '18 22:12 DrPyser

Before finding this issue... I just created a fields.Integer subclass that has an attribute blank_as_none. It simple simple to implement but annoying as it needs to be done for all Numeric and Date types.

I can work on patch if there is interest on that..

schettino72 avatar Feb 26 '19 00:02 schettino72

Please see my proposed solution in the PR #1165. The implementation was pretty trivial and I guess it solves a pain point for many of us...

schettino72 avatar Mar 02 '19 19:03 schettino72

A demo on the proposed custom field solution.

from marshmallow import Schema, fields

# ---------------------Baseline---------------------


class SimpleSchema(Schema):
    s = fields.String()


simple_schema = SimpleSchema()

print(simple_schema.load({'s': None}))
# > UnmarshalResult(data={}, errors={'s': ['Field may not be null.']})

print(simple_schema.load({'s': ''}))
# > UnmarshalResult(data={'s': ''}, errors={})


# ---------------------empty as None, reject None---------------------

# Now trying to deserialize empty string as None, and don't accept None


class NotEmptyString(fields.String):
    """A String field type where empty string is deserialized to None"""

    def _deserialize(self, value, attr, data):
        if value == '':
            return None
        return super()._deserialize(value, attr, data)


class NotEmptySchema(Schema):
    s = NotEmptyString()


not_empty_schema = NotEmptySchema()

print(not_empty_schema.load({'s': None}))
# UnmarshalResult(data={}, errors={'s': ['Field may not be null.']})

print(not_empty_schema.load({'s': ''}))
# > UnmarshalResult(data={'s': None}, errors={})
# Value is translated to None, but it somehow breaks the process and result in a valid None

# ---------------------empty as None, accept None---------------------

# Now trying to deserialize empty string as None, but accept None


class NotEmptyButNoneOkSchema(Schema):
    s = NotEmptyString(allow_null=True)


not_empty_but_none_ok_schema = NotEmptyButNoneOkSchema()


print(not_empty_but_none_ok_schema.load({'s': None}))
# UnmarshalResult(data={}, errors={'s': ['Field may not be null.']})
# Here we even have regression. The custom field even overrides allow_null
# option. Not what I want.

print(not_empty_but_none_ok_schema.load({'s': ''}))
# > UnmarshalResult(data={'s': None}, errors={})
# Here is what I wanted

It seems that the allow_null option is applied before the deserialization. Making it impossible to work after a custom field that deserializes as None in some cases.

antoine-gallix avatar May 21 '19 13:05 antoine-gallix

Basically, it's seems impossible to obtain the following behaviors:

  • Interpret empty strings as None, and returns error because null/None is not allowed, with the same error message as if the input was directly None.

  • Interpret empty strings as None, and accepts them because of the allow_null option in the field arguments.

antoine-gallix avatar May 21 '19 13:05 antoine-gallix

This has been a huge source of frustration for me.

For instance, I'm trying to have an email field that allows None and "" (empty string). I believe an API should be liberal in accepting its input. If a field is optional, then a missing field, a null value or an empty string value should be considered legit.

So I'm trying to have this behavior throughout my codebase. In particular with the email fields, which by default does not allow it. This is what I had to do to allow empty emails:

class Email(fields.Email):
    def __init__(self, *args, **kwargs):
        kwargs.setdefault("allow_none", True)
        super().__init__(self, *args, **kwargs)

    def _validate(self, value):
        if not value:
            return
        super()._validate(value)

    def _validated(self, value):
        # This actually does nothing
        if not value:
            return value
        return super()._validated(value.strip().lower())

    def _deserialize(self, value, attr, obj):
        if not value:  # nocov
            return value
        # Strip and lowercase the email
        # NOTE: theoretically, the username part of email is not case
        # sensitive, but in practice, nobody cares about it (Gmail for
        # instance)
        return value.strip().lower()

(not mentioning the fact that overriding private methods feels a hack, but that's the pattern chosen by the library - IMO serialize and _serialize should have been inverted)

It would be great to have a better story around empty (missing, None or "") values for fields. This is very difficult when validators are involved.

def deserialize(self, value, attr=None, data=None, **kwargs):
        # Validate required fields, deserialize, then validate
        # deserialized value
        self._validate_missing(value)
        if value is missing_:
            _miss = self.missing
            return _miss() if callable(_miss) else _miss
        if getattr(self, "allow_none", False) is True and value is None:
            return None
        output = self._deserialize(value, attr, data, **kwargs)
        self._validate(output)
        return output

The problem in deserialize is that we don't have a way to bail out of validation and redefine value (like serialize does with its get_value).

charlax avatar Jul 04 '19 17:07 charlax

(not mentioning the fact that overriding private methods feels a hack, but that's the pattern chosen by the library - IMO serialize and _serialize should have been inverted)

serialize is meant to be called by another class, so it is "public".

_serialize is underscored because it is only meant to be called by its own class. It can be called and even overridden by subclasses. Hence the underscore meaning "protected".

"private" variables are double-underscored.

https://radek.io/2011/07/21/private-protected-and-public-in-python/ https://stackoverflow.com/questions/1641219/does-python-have-private-variables-in-classes

lafrech avatar Jul 05 '19 07:07 lafrech

Let's revisit this issue post-3.0 release. I do think we want to have a story for validating empty strings.

sloria avatar Jul 08 '19 13:07 sloria

See also discussion in Django docs per https://docs.djangoproject.com/en/2.2/ref/models/fields/#field-options.

It would feel weird to deal with this in e.g. SQLAlchemy, though, so some sort of unification of dealing with empty/null logic in strings seems reasonable in Marshmallow.

taion avatar Jul 08 '19 15:07 taion

I think it would be odd to treat empty strings equivalent to None; there are use cases where None means "not provided", as distinct from "".

I think we should control None and empty/blank values separately, as in Django. We could add an allow_empty parameter to Field and have empty_values set on the field class.

class Field(...):
    empty_values = {""}
    default_error_messages = {
       ...
        "empty": "Field may not be empty.",
        "null": "Field may not be null.",
    }


class List(...):
    empty_values = ([], ())

class Dict(...):
    empty_values = ({}, )

Thoughts?

sloria avatar Sep 07 '19 21:09 sloria

You can achieve the above today with

from marshmallow import Schema, fields, validate

not_empty = validate.Length(min=1, error="Field may not be empty.")


class ArtistSchema(Schema):
    name = fields.Str(validate=not_empty)
    album_names = fields.List(fields.Str(), validate=not_empty)


ArtistSchema().load({"name": "", "album_names": []})
# marshmallow.exceptions.ValidationError: {'album_names': ['Field may not be empty.'], 'name': ['Field may not be empty.']}

So on one hand, the proposed API is redundant. On the other hand, it's a common enough use case that it may warrant first-class support.

sloria avatar Sep 07 '19 21:09 sloria

Actually, I think I was misunderstanding the OP. IIUC, the goal is to interpret "" as missing and deserialize to None.

I think a more direct and flexible way to achieve this would be to allow specifying the values that should be interpreted as missing (as suggested by @lafrech in a previous comment).

# proposed API

class ArtistSchema(Schema):
    # if input is None, "", or is missing from the input, deserialize to None
    name = fields.Str(missing=None, missing_values={None, ""})

print(ArtistSchema().load({"name": ""}))  # {'name': None}

If instead your backend models "" as missing, you just change the value of missing.

class ArtistSchema(Schema):
    name = fields.Str(missing="", missing_values={None, ""})

This would allow any field to define what it means to be required. For example:

class ArtistSchema(Schema):
    album_names = fields.List(fields.Str(), required=True, missing_values=([], ()))


ArtistSchema().load({"album_names": []})
#  ValidationError: {'album_names': ['Missing data for required field.']}

What do you think?

sloria avatar Sep 07 '19 23:09 sloria

I think a more direct and flexible way to achieve this would be to allow specifying the values that should be interpreted as missing

Indeed this approach would be more flexible. The problem is that it is very verbose. As OP work with XML or working with HTML forms, it is very common for blank string to used as missing value... In this solution you would need to add missing="", missing_values={None, ""} to all numeric and date fields, thats a lot of noise.

I have not even seen another use-case where people want other custom missing values, so maybe just dealing with the most common case would be better.

Another approach I would consider would be to specify missing_values at a global or schema level. This way it would be flexible as you wish and also reduce verbosity.

schettino72 avatar Sep 08 '19 03:09 schettino72

As OP work with XML or working with HTML forms, it is very common for blank string to used as missing value

Yes, it is common to model "" as missing, but there are also cases where None is used for missing (IIUC, this is what the OP wants).

See the relevant Django docs:

... it’s redundant to have two possible values for “no data;” the Django convention is to use the empty string, not NULL. One exception is when a CharField has both unique=True and blank=True set. In this situation, null=True is required to avoid unique constraint violations when saving multiple objects with blank values.

The proposed API cleanly supports both cases.


Another approach I would consider would be to specify missing_values at a global or schema level. This way it would be flexible as you wish and also reduce verbosity.

We could set defaults at the class level, so you could do

# proposed API

fields.Field.default_missing_values = ("", )


class ArtistSchema(Schema):
    name = fields.String(missing=None)
    dob = fields.DateTime(missing=None)


print(ArtistSchema().load({"name": "", "dob": ""}))
# => {'name': None, 'dob': None}

Edit: Use tuple instead of set for default_missing_values to avoid TypeErrors when checking against unhashable types.

sloria avatar Sep 08 '19 13:09 sloria

Thanks for picking this up

Actually, I think I was misunderstanding the OP. IIUC, the goal is to interpret "" as missing and deserialize to None.

Yes, this is just it. There are plenty of data formats where the keys are fixed, and the values are all strings; in such a scheme, the only way to represent "no data" is an empty string.

Having the "no data" be configurable on the class is a win. Cheers.

csingley avatar Sep 09 '19 17:09 csingley

@schettino72 What do you think about the default_missing_values solution above?

sloria avatar Sep 14 '19 18:09 sloria

@sloria The problem I see with that solution (if I'm understanding it correctly) is that it's incompatible with required=True. What if I want the following behavior?

schema.load({'some_field': None}) => {'some_field': None}
schema.load({'some_field': ''}) => {'some_field': None}
schema.load({}) => ValidationError: {'some_field': ['Missing data for required field.']}

The use case is differentiating between intentionally setting a field's value to null versus say accidentally botching a PUT request by leaving out a required field.

MarredCheese avatar Sep 20 '19 19:09 MarredCheese

It's seems reasonable and common to represent "no data" in different ways ("", None, [], absence), but it seems less common (and perhaps unidiomatic) to have multiple ways to represent None. None and '' are not equivalent. If you really need '' to deserialize to None, perhaps you should use a custom field.

sloria avatar Sep 21 '19 14:09 sloria

@sloria At the database level, it is common to have string fields where you allow either NULLs or zero-length strings, but not both (to have a single, consistent way of representing "no data"). With that said, I would ask, if someone is going to bother using Marshmallow to convert some kind of input to a database-ready format, why wouldn't they consider having it automatically convert ZLS to NULL or vice versa? Why is that unidiomatic?

You correctly point out that None and '' are not always equivalent, but absence and None are also not always equivalent. Absence could mean "don't change the value," while None could mean "set the value to NULL" (e.g. a merge-style PATCH request). And as I stated above, absence could also be invalid and None could be intentional. Plus, Marshmallow's existing API already implies that absent and None are not always the same by allowing the argument combination required=True, allow_none=True.

I guess what I'm getting at is that I don't see the harm in accepting this PR (assuming it works) for times when Marshmallow users don't want to conflate absent and None.

In the meantime, if anyone reading this cares to see another idea, I tweaked this code above and came up with this:

@pre_load
def replace_empty_strings_with_nones(self, data, **kwargs):
    keys = ['project', 'model', 'inServiceDate']
    for key in keys:
        if key in data.keys():
            data[key] = data[key] or None
    return data

MarredCheese avatar Sep 21 '19 19:09 MarredCheese

I've been reading through this, as I've been having this issue myself, so there's some vested interest.

The argument was brought up that None and "" need to mean two different things. For something like String, or a similar type where you would expect "", this is understandable. However, we do need to also consider things like Integer, and other similar types. Even using the Number input type with standard HTML and AJAX in accordance with https://www.w3schools.com/html/tryit.asp?filename=tryhtml_input_number , if you submit nothing in the input box, the parameter will have a value of "". This creates a "Not a valid integer." validation error message in Marshmallow, when one would expect this to be interpreted as if it were None.

For the time being, I have a workaround on String of using a custom type, but it's something that would be nice to include as an option.

BTW, as a note to the last comment: You really want to be careful about doing things that way, especially with integers, since 0 and None should ALWAYS be two different things.

wschmrdr avatar Apr 12 '20 13:04 wschmrdr