cleancat Provide CleanCat schema with a context

There should be a way to provide a CleanCat schema not only with the data we want to validate, but also the context in which we want to validate it.

Currently, you have to rely on global variables (like Flask's g and request) to determine the context you're working in, which is bad for readability, testing, separation of concerns, etc.

Aug 24 '18 13:08 wojcikstefan

What would be an interface for the context? I'm afraid we might have to introduce a breaking change in the library's API.

Current interface for field validation code is passing value in parameters. If we add another parameter like def clean(self, value, context=None):, this has to pass the context down to both super() calls and into nested fields like list item's field definition. If any of those don't accept the new param for context yet (consider if someone has a custom Field in their private code), it will fail with something like TypeError: a() takes exactly 1 argument (2 given).

And we can't put context into an attribute on the Field object since it's instantiated once during schema definition and never changes.

Oct 23 '19 16:10 tsx

Stepping back for a second, what is the intended use of CleanCat schemas?

If CleanCat schemas belong to the application layer and only perform the "shape" validation (making sure the right data types and required values are supplied, validating values against choices, filling in static default values, etc.), then they shouldn't really need a context (or am I missing a use-case?).
However, if we also want to use CleanCat schemas for "essence"/"content" validation as part of a domain layer (validating supplied values against specific user/organization permissions, picking a default value based on a specific membership setting, etc.), then we'll have to pass the context in.

Just want to make sure that we want to use CleanCat schemas for both (1) and (2)* before we discuss the exact interface.

(...) add another parameter like def clean(self, value, context=None):

Yep, this SGTM.

I'm afraid we might have to introduce a breaking change in the library's API.

If we bump the package version up to 1.0.0 and document the breaking change in the release notes, I don't have a problem with that.

Cc @thomasst @jkemp101 @mpessas.

* Also, if we want to use CleanCat schemas for both application-level and domain-level validation, then should we always use a separate schema class for each?

Oct 28 '19 07:10 wojcikstefan

Passing a context so you can validate references and making it v1.0.0 SGTM. Would like to see some examples on how this will effect reference fields. Also, we should consider making it clear that the convention to override functions should include additional args like **kwargs, so we can add more args in the future without breaking existing code.

Oct 28 '19 08:10 thomasst

Can we have an example of the problem we are trying to solve here?

Oct 28 '19 08:10 mpessas

I think the general problem is that you don't want to query the database from within a cleancat schema, e.g. validating reference fields.

Oct 28 '19 08:10 thomasst

I agree this is mixing up shape validation and content validation, but it's so tempting to do so.

Imagine this "before" state:

class MyObjectSchema(Schema):
    user_id = Integer()
    another_user_id = Integer()
    my_field = String()


@view
def create_object():
    database = ...
    try:
        my_object = MyObjectSchema(request.json()).full_clean()
    except ValidationError as e:
        raise BadRequest(e.args[0])

    if not database.user_exists(my_object['user_id']):
        raise BadRequest({'field-errors': {'user_id': 'User does not exist'}})

    if not database.user_exists(my_object['another_user_id']):
        raise BadRequest({'field-errors': {'another_user_id': 'User does not exist'}})

    save(my_object)

CleanCat gives us very structured error reporting which API clients can correlate to original fields. I would love to (a) also have this structure generated for me and (b) more declarative validation. For example, when I define a schema, I don't care that user ID is an int or a string or a more complex object. I just want to say "user".

This is how it could look like (notice how we don't re-create the error structure by hand anymore):

class UserField(Integer):
    def validate(self, value, context):
        value = super().validate(value, context)
        if not context['database'].user_exists(value)
            raise ValidationError('User does not exist')


class MyObjectSchema(Schema):
    user_id = UserField()
    another_user_id = UserField()
    my_field = String()


@view
def create_object():
    database = ...
    try:
        my_object = MyObjectSchema(request.json(), context={'database': database}).full_clean()
    except ValidationError as e:
        raise BadRequest(e.args[0])

    save(my_object)

Oct 28 '19 08:10 tsx

What cleancat is missing is support for clean_<field> methods on the Schema class. See e.g. what DRF and Django Forms do.

This allows one to provide more complex field-level validation: the Field itself only validates the "shape" of the attribute (is it a uuid? an int? etc) and the validate_<field> can do extra processing. It is very rare in Django to define a custom (ie non-reusable) form field for this type of things, such as the UserField above.

If cleancat did have support for this, one would write:

class MySchema(cleancat.Schema):

    def __init__(self, db):
        self._db = db

    def clean_user_id(self):
         if not db.user_exists(self.user_id):
             raise ValidationError('User does not exist')

This is how it could look like (notice how we don't re-create the error structure by hand anymore):

But that approach already has a potential problem: it probably fetches the user twice, once in the field validation and once (presumably) in the rest of the code — this would be quite easy to miss.

Moreover, (regarding your example) it does not allow for grouping database queries to the User table. Your first snippet could be altered to fetch all users with an IN clause and do the check if the user exists in the python land.

About the error structure, is there a reason we cannot raise a ValidationError(field_name, message) and have a generic handler set the response to a bad request with the appropriate body?

However, if we also want to use CleanCat schemas for "essence" validation as part of a domain layer

Sometimes schemas are indeed useful to validate a dictionary of fields and values (no matter if they come from an HTTP request or not). But having support for validate_<field> should help with the problem we are discussing here.

So ideally I would want to see both: cleancat schemas supporting validate_<field> methods and the ValidationError class supporting a field and an error_message argument when you have to do validation manually/without resorting to cleancat but still supporting the same error format.

Nov 01 '19 07:11 mpessas

validate_<field>

I actually didn't prioritize implementing this in cleancat because this isn't meant to do generic validation (e.g. something common like validate references -- Django doesn't do that). It's meant to handle specific validation cases that only apply to a given field. Maybe you could argue that in our cases every reference should have its explicit validation function, which makes the code less DRY, but I'd rather see if we can handle this through the field class.

Nov 01 '19 09:11 thomasst

Well, you can always have def validate_user(self) method inherited by a form and not have to deal with individual field validation.

The problem with the latter is that context means nothing per se: if I look at the method definition and see context, how do I even know what is in there?

The "context" belongs to the form, not the field.

Nov 01 '19 09:11 mpessas

Stepping back for a second, what is the intended use of CleanCat schemas?

The distinction @wojcikstefan made earlier I think is pretty important, and is why I'm against adding a context to cleancat schemas and validation.

Consider these questions:

What does an email address look like?
Does this email address belong to one of our users?
Is the currently authenticated user allowed to interact with that user?

IMO cleancat is good at answering the first, but answering the second and third should be out of scope. I'd actually prefer the "before" state in @tsx's above example because there's a clear distinction between the static "is my data the right shape" and dynamic "is the user allowed to see this other user" validation paths.

Providing an opaque context within the schema would serve to further conflate these concerns.

Jan 13 '21 14:01 AlecRosenbaum

It's fine if cleancat doesn't do that, but there should be some way to generate an error dict that contains all the validation errors for all of your 3 questions, and the nice thing about having it all in the schema is that you would just have the schema that takes care of all the validation. How do you suggest validation scenarios that require database queries and other information should be handled?

Jan 13 '21 16:01 thomasst

Ok so thinking about this a bit more and talking about it offline, here's what I've arrived at:

Basically let's copy the validators and converters pattern from attrs, but use it here to distinguish between shape validation and data validation on cleancat schemas.

So our normal clean and full_clean methods will stick around but be whittled down to doing only stateless shape validation. We'll also add a validators and converters arguments to each field that accept lists of functions. Each validator is a function that accepts a value and context, then optionally raises a ValidationError and returns None. Each converter is a similar looking function, but returns the value and converts it in the process.

The context would be a fully defined object type, not just an unstructured dictionary. Validation contexts can probably do things like share a single database session, since they shouldn't be mutating any data. Since the context is shared between validation and conversion, it would be a good place to implement request-specific caching (and prevent double lookups between fields, validators, and converters).

Here's an example with a user field:

def user_visibility_validator(value: str, context: CleanCatContext):
  repo = UserRepo(session=context.session):
  try:
    user = repo.fetch_by_id(value)
  except DoesNotExist:
    raise ValidationError
  if not user_visible_to_user(user, context.current_user):
    raise ValidationError

def user_converter(value: str, context: CleanCatContext) -> User:
  repo = UserRepo(session=context.session):
  return repo.fetch_by_id(value)

class UserField(cleancat.String):
  def clean(self, value):
    super().clean(value)
    if len(value) != EXPECTED_USER_ID_LENGTH or not value.startswith('user_'):
      raise ValidationError

class MySchema(cleancat.Schema):
  user = UserField(
    validators=[user_visibility_validator], converters=[user_converter]
  )

# somewhere in a view
schema = MySchema({'user': ...})
schema.full_clean()
with atomic() as session:
  # calls all validators, then calls all converters
  schema.validate(
    context=CleanCatContext(current_user=current_user, session=session)
  )

Jan 19 '21 15:01 AlecRosenbaum

Looks good.

I can't immediately think of any cases, but are there cases where you would need the old data or other fields from the schema to run the validator (i.e. where you want raw access to the schema instance)?
Is there anything special to CleanCatContext that it needs its own type, or could this be any object? Is the idea that cleancat provides a base class and you would subclass it for most use cases?

Jan 19 '21 15:01 thomasst

If you need to validate across fields it would probably make sense to just extend schema.validate.

My thought with CleanCatContext was that cleancat provides a base class which is subclassed for most use cases.

Jan 19 '21 15:01 AlecRosenbaum