PynamoDB Can't use the discriminator as a primary key in indexes.

Can't use the discriminator as a primary key in indexes.

Open abcydybcy opened this issue 2 years ago • 11 comments

Hi, I've run into a problem with PynamoDB: I wanted to list all objects of a given type, but avoid using scan. I can't use an index, in which the discriminator is the primary key.

Example output (notice the first line).

BUG:pynamodb.connection.base:Calling Query with arguments {'TableName': 'test-table-1234', 'KeyConditionExpression': '#0 = :0', 'FilterExpression': '#0 IN (:1, :2)', 'IndexName': 'type_index', 'ExpressionAttributeNames': {'#0': 'type'}, 'ExpressionAttributeValues': {':0': {'S': 'user'}, ':1': {'S': 'user'}, ':2': {'S': 'group'}}, 'ReturnConsumedCapacity': 'TOTAL'}
DEBUG:botocore.hooks:Event before-parameter-build.dynamodb.Query: calling handler <function generate_idempotent_uuid at 0x103777a60>
DEBUG:botocore.hooks:Event before-parameter-build.dynamodb.Query: calling handler <function block_endpoint_discovery_required_operations at 0x1036c4d30>
DEBUG:botocore.auth:Calculating signature using v4 auth.
DEBUG:botocore.auth:CanonicalRequest:
POST
/

content-type:application/x-amz-json-1.0
host:dynamodb.eu-west-1.amazonaws.com
x-amz-date:20211129T163717Z
x-amz-security-token: ---

content-type;host;x-amz-date;x-amz-security-token;x-amz-target
a59ab9b0c3b122603f12e5cfe46020899ce8049b56f790b4f57624d06dbd4874
DEBUG:botocore.auth:StringToSign:
AWS4-HMAC-SHA256
20211129T163717Z
20211129/eu-west-1/dynamodb/aws4_request
b0edb2380f43edb04697e68de7150e6743491d52b379e6df5512eee3ab1f704b
DEBUG:botocore.auth:Signature:
df460887d908a2db232bcb69042424b37f6b227eb53a77dad760069a711c9be2
DEBUG:botocore.httpsession:Certificate path: /Users/f0ff/Code/tameshi/framework/venv/lib/python3.8/site-packages/botocore/cacert.pem
DEBUG:urllib3.connectionpool:https://dynamodb.eu-west-1.amazonaws.com:443 "POST / HTTP/1.1" 400 161
DEBUG:pynamodb.connection.base:Calling DeleteTable with arguments {'TableName': 'test-table-1234'}
DEBUG:botocore.hooks:Event before-parameter-build.dynamodb.DeleteTable: calling handler <function generate_idempotent_uuid at 0x103777a60>
DEBUG:botocore.hooks:Event before-parameter-build.dynamodb.DeleteTable: calling handler <function block_endpoint_discovery_required_operations at 0x1036c4d30>
DEBUG:botocore.auth:Calculating signature using v4 auth.
DEBUG:botocore.auth:CanonicalRequest:
POST
/

content-type:application/x-amz-json-1.0
host:dynamodb.eu-west-1.amazonaws.com
x-amz-date:20211129T163717Z
x-amz-security-token: ---
x-amz-target:DynamoDB_20120810.DeleteTable

content-type;host;x-amz-date;x-amz-security-token;x-amz-target
06d198ac62ba8fda04ad1ea0035c6f0d13afa201e775f979a968d05da3334522
DEBUG:botocore.auth:StringToSign:
AWS4-HMAC-SHA256
20211129T163717Z
20211129/eu-west-1/dynamodb/aws4_request
364ed68b24bcc49e0afbc665c3a01e40374cd0691b5ecd7b2cb7722a44dcd778
DEBUG:botocore.auth:Signature:
94262ec8d585ab0cd4bfad5d9e5dcc3c429af853f0d74f566253d835d026d133
DEBUG:botocore.httpsession:Certificate path: /Users/f0ff/Code/tameshi/framework/venv/lib/python3.8/site-packages/botocore/cacert.pem
DEBUG:urllib3.connectionpool:https://dynamodb.eu-west-1.amazonaws.com:443 "POST / HTTP/1.1" 200 332
Traceback (most recent call last):
  File "/Users/f0ff/Code/tameshi/framework/venv/lib/python3.8/site-packages/pynamodb/connection/base.py", line 1333, in query
    return self.dispatch(QUERY, operation_kwargs, settings)
  File "/Users/f0ff/Code/tameshi/framework/venv/lib/python3.8/site-packages/pynamodb/connection/base.py", line 329, in dispatch
    data = self._make_api_call(operation_name, operation_kwargs, settings)
  File "/Users/f0ff/Code/tameshi/framework/venv/lib/python3.8/site-packages/pynamodb/connection/base.py", line 440, in _make_api_call
    raise VerboseClientError(botocore_expected_format, operation_name, verbose_properties)
pynamodb.exceptions.VerboseClientError: An error occurred (ValidationException) on request (V50EJSD64U1GCHQ6B499PCNVEFVV4KQNSO5AEMVJF66Q9ASUAAJG) on table (test-table-1234) when calling the Query operation: Filter Expression can only contain non-primary key attributes: Primary key attribute: type

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bug.py", line 49, in <module>
    print([user for user in User.type_index.query("user")])
  File "bug.py", line 49, in <listcomp>
    print([user for user in User.type_index.query("user")])
  File "/Users/f0ff/Code/tameshi/framework/venv/lib/python3.8/site-packages/pynamodb/pagination.py", line 194, in __next__
    self._get_next_page()
  File "/Users/f0ff/Code/tameshi/framework/venv/lib/python3.8/site-packages/pynamodb/pagination.py", line 179, in _get_next_page
    page = next(self.page_iter)
  File "/Users/f0ff/Code/tameshi/framework/venv/lib/python3.8/site-packages/pynamodb/pagination.py", line 115, in __next__
    page = self._operation(*self._args, settings=self._settings, **self._kwargs)
  File "/Users/f0ff/Code/tameshi/framework/venv/lib/python3.8/site-packages/pynamodb/connection/table.py", line 270, in query
    return self.connection.query(
  File "/Users/f0ff/Code/tameshi/framework/venv/lib/python3.8/site-packages/pynamodb/connection/base.py", line 1335, in query
    raise QueryError("Failed to query items: {}".format(e), e)
pynamodb.exceptions.QueryError: Failed to query items: An error occurred (ValidationException) on request (V50EJSD64U1GCHQ6B499PCNVEFVV4KQNSO5AEMVJF66Q9ASUAAJG) on table (test-table-1234) when calling the Query operation: Filter Expression can only contain non-primary key attributes: Primary key attribute: type

Code to recreate:

import logging
from pynamodb.models import Model
from pynamodb.attributes import UnicodeAttribute, DiscriminatorAttribute
from pynamodb.indexes import GlobalSecondaryIndex, IncludeProjection


class TypeIndex(GlobalSecondaryIndex):
    class Meta:
        index_name = "type_index"
        read_capacity_units = 1
        write_capacity_units = 1
        region = "eu-west-1"
        projection = IncludeProjection(["type", "path"])

    type = UnicodeAttribute(hash_key=True)


class CoreModel(Model):
    class Meta:
        table_name = "test-table-1234"
        read_capacity_units = 2
        write_capacity_units = 1
        region = "eu-west-1"

    path = UnicodeAttribute(hash_key=True)
    type = DiscriminatorAttribute()

    type_index = TypeIndex()


class User(CoreModel, discriminator="user"):
    pass


class Group(CoreModel, discriminator="group"):
    pass


if __name__ == "__main__":
    CoreModel.create_table(wait=True)

    # Load some test data
    user1 = User("user.user1")
    user1.save()

    logging.basicConfig(level=logging.DEBUG)

    try:
        print([user for user in User.type_index.query("user")])
    finally:
        CoreModel.delete_table()

Nov 29 '21 16:11 abcydybcy

The design isn't 'best practice' since it'll limit your horizontal scaling (the set of discriminator values is limited so they wouldn't be good hash keys), but I agree it shouldn't result in an API error.

The syntax User.type_index.query("user") is also a little awkward. I wouldn't expect you'd need to repeat "user" yet hash_key has no default so you obviously have to.

Some ideas:

If discriminator is a hash key (of the index), do not add to filter conditions but raise a local exception if hash_key != discriminator.
If discriminator is a range key (of the index):
- add key condition
- raise a local exception if range_key is passed to query

@jpinner-lyft WDYT

Nov 29 '21 18:11 ikonst

@ikonst @abcydybcy I just ran into this issue, as well.

If Pynamodb really wants to prevent this behavior, I suggest throwing an exception on Model._get_schema() when encountering the behavior.

If you want to allow it at the user's discretion, I would suggesting adding a boolean kwarg for 'add_discriminator_attr_to_filter_condition' to pynamodb.models.Model.query that could be passed from pynamodb.indexes.Index.query. When True, just skip

        # If this class has a discriminator attribute, filter the query to only return instances of this class.
        discriminator_attr = cls._get_discriminator_attribute()
        if discriminator_attr:
            filter_condition &= discriminator_attr.is_in(*discriminator_attr.get_registered_subclasses(cls))

Nov 13 '22 10:11 ilanjb

👋 @ilanjb

Notice that @abcydybcy

has a GSI where the type S attribute is the partition key
is querying that GSI, not the model

You cannot have keys in a FilterExpression. Instead you have to provide them in a KeyConditionExpression:

for partition keys, it can only be an equivalence (e.g. no "is_in")
for sort keys, it can only be comparison or prefix matching (e.g. no "is_in")

(And you can obviously design a GSI with a single hash key value, and then query for the discriminator attribute in FilterExpression, but that'd be equivalent to simply doing a table scan, except that you'd be paying for a GSI and having a hot shard.)

Nov 13 '22 21:11 ikonst

Ahh, I think I see what you're saying, though maybe you meant the inverse? i.e. "When False, just skip"?

Nov 13 '22 21:11 ikonst

@ikonst yes I'm talking abouy querrying a GSI that has a "discriminator_attr" as the hash or sort key.

false / true are relative. Just a way to control whether on not the the discriminator_attr is added to the filter condition.

Another option would be to have cls._get_discriminator_attribute() resolve to None on the parent class, since addding all of the possible discriminator_values does not help the query in any way.

Nov 14 '22 07:11 ilanjb

@ikonst want me to create an MR?

Nov 24 '22 14:11 ilanjb

For the discussion, let's assume:

class MyIndex(GlobalSecondaryIndex):
  hk = UnicodeAttribute(hash_key=True)
  cls = DiscriminatorAttribute(range_key=True)


class MyBase(Model):
   hk = UnicodeAttribute(hash_key=True)
   cls = DiscriminatorAttribute()

   index = MyIndex()


class MyFoo(MyBase, discriminator='foo'):
   ...


class MyBar(MyBase, discriminator='bar'):
   ...


class MyBarChild(MyBar, discriminator='bar_child'):
   ...

I took a closer look and I found that we intentionally fail deserialization when the discriminator value is unknown. The alternative would be to deserialize as MyBase, and that'd mean reading an item with unknown discriminator value "baz" as MyBase. Currently unknown values raise a ValueError on deserialization. Also, a discriminator attribute currently cannot be null=True so we cannot have the semantic where unset discriminator = base model.

Why am I mentioning this? Because it means that filtering only for "known" models is important to avoid ValueError: Unknown discriminator value: baz on deserialization.

Potential approaches:

We can add a do_not_add_discriminator_attr_to_filter_conditions: bool parameter to the query method (or something shorter, but with those semantics). It's an "escape hatch", for sure, and somebody could take advantage of it, but if possible we should try and have a higher-level solution (since the whole motivation behind PynamoDB and it's polymorphism is to provide a higher-level abstraction).

We could change

-if discriminator_attr:
+if discriminator_attr and not discriminator_attr.is_hash_key and not discriminator_attr.is_range_key:
    filter_condition &= discriminator_attr.is_in(*discriminator_attr.get_registered_subclasses(cls))

but it means that querying on the shared index would retrieve "foo"s, "bar"s and unknown "baz"s (which would raise ValueError on deserialization). This only a problem for a discriminator that's an index range key, since a hash key by its nature requires specifying a single value.

The user could have a wrapper like this to iterate "safely".

def iterator_ignoring_errors(gen):
  while True:
    try:
      yield next(gen)
    except StopIteration:
      break
    except ValueError:
      pass

items = MyBase.my_index.query('some_hash_key')
for item in iterator_ignoring_errors(items):
  ...

(Note: It's not ideal to except ValueError - we could probably create a specific exception if that's to be a pattern.)

We could expect callers to specify the range_key_condition themselves:
```
MyBase.my_index.query('some_hash_key', MyBase.cls == MyFoo)
```
It would be up to the caller to iterate "safely".

We could do (3) but require a range_key_condition to be specified and have that specific form (equality) when the discriminator is a range key, i.e. raise a local error when it's not specified without issuing an API call.

if discriminator_attr.is_hash_key:
  pass
elif discriminator_attr.is_range_key:
  if not isinstance(range_key_condition, Comparison):
    raise ValueError('Range key condition must be a comparison')
  subclasses = discriminator_attr.get_registered_subclasses(cls)
  subclass = discriminator_attr.deserialize(range_key_condition.values[1])
  if subclass not in subclasses:
    raise ValueError(f'Range key condition must match exactly one of the registered subclasses of {cls.__name__}')
else:
  filter_condition &= discriminator_attr.is_in(*discriminator_attr.get_registered_subclasses(cls))

We could form the range_key_condition within Model.query:

if discriminator_attr:
  if discriminator_attr.is_hash_key:
    pass
  elif discriminator_attr.is_range_key:
    if cls._discriminator is not None and range_key_condition is None:
      subclasses = discriminator_attr.get_registered_subclasses(cls)
      if len(subclasses) != 1:
        raise RuntimeError("...")
      range_key_condition = discriminator_attr == subclasses[0]
 else:
   filter_condition &= discriminator_attr.is_in(*discriminator_attr.get_registered_subclasses(cls))

such that:

MyBase.my_index.query('some hash key')  # returns all types, potentially raises on iteration
MyFoo.my_index.query('some hash key')  # yields MyFoos only
MyBarChild.my_index.query('some hash key')  # yields MyBarChilds only
MyBar.my_index.query('some hash key')  # raises RuntimeError since there's MyBarChild

# We allow the "escape hatch" / "foot gun" of specifying the range key condition yourself
# The following case is nonsensical since if any "bar"s are found,
# none will be compatible with MyFoo and all will fail to deserialize.
MyFoo.my_index.query('some hash key', MyBase.cls == MyBar)

As in previous approaches, there are times when this approach will result in no condition and it's up to the caller to iterate "safely".

Nov 27 '22 06:11 ikonst

P.S. currently it's only possible to do all of this by hacking

 class MyIndex(GlobalSecondaryIndex):
   hk = UnicodeAttribute(hash_key=True)
-  cls = DiscriminatorAttribute(range_key=True)
+  cls = UnicodeAttribute(range_key=True)

since DiscriminatorAttribute intentionally does not accept a hash_key or range_key parameter.

Another potential approach is to decide this isn't acceptable and raise an error in this condition (as @ilanjb mentions here). I've yet to see good motivation to use a discriminator as a hash or range key.

Nov 27 '22 06:11 ikonst

@ikonst I noticed I neglected to thank you for your thoughtful reply. Thanks! If you get a chance, just raising an error would definitely be a good enough solution.

Apr 09 '23 07:04 ilanjb

@ilanjb Interesting issue just came up in #1178. I'm thinking perhaps it's be best to:

removing the filter expression
make PynamoDB not halt result iteration on "Unknown discriminator value" errors

The case of an unknown discriminator value is intended to be an edge case, not something that occurs often, so the "filter expression" wouldn't be a meaningful optimization.

Apr 26 '23 11:04 ikonst

PynamoDB PynamoDB copied to clipboard

Can't use the discriminator as a primary key in indexes.

PynamoDB
PynamoDB copied to clipboard