PynamoDB icon indicating copy to clipboard operation
PynamoDB copied to clipboard

Single Corrupt Record Breaks Scan

Open SteggyLeggy opened this issue 1 year ago • 5 comments

We currently have an issue that our current API doesn't support the same number of fields that we support within our DynamoDB records. So we have a lot of employees manually editing/creating records within the DynamoDB Tables by hand.

Occasionally they forget to add required fields, or enter the wrong data for a field. These employees are in a separate department so sometimes they're not aware of new fields added recently.

Some of our tables only contain a few hundred records, and our production systems routinely scan the full table, using PynamoDB. This means that a single corrupt record (i.e. missing a field) raises an exception and we cannot do any work required on the remaining valid records.

Is there a way in PynamoDB to defer deserialization until later on, i.e. allow the scan to happen, but deserialize each record individually so that we can catch the exception and ignore the problem records?

Would really appreciate some help on this, we have some scope to help improve PynamoDB in this scenario if there currently isn't a way to support/skip over problem records currently.

SteggyLeggy avatar Jul 18 '22 11:07 SteggyLeggy

Hi, I don't know if it applies to your use case but there is a possibility to override the default behavior of the PynamoDB model to accept None for specific attribute (by setting up "null=True"): https://pynamodb.readthedocs.io/en/latest/tutorial.html?highlight=null#defining-model-attributes You can then iterate over gathered records and find out which are corrupted.

mateuszciosek avatar Jul 18 '22 14:07 mateuszciosek

Thanks for your suggestion @mateuszciosek but unfortunately it isn't just missing fields that is the problem.

We have more complex Map fields that aren't formatted correctly when edited manually in the aws console.

Trying to make the pynamoDB model less strict isn't really what we're after. More just letting us skip over the dodgy records.

Thanks again for your help though.

SteggyLeggy avatar Jul 18 '22 16:07 SteggyLeggy

That's a great question. Perhaps we can add a try: bool = False parameter that'll cause the iteration to be over Union[Model, DeserializationFailure] where DeserializationFailure could include:

  • raw_data: Dict[str, Any]
  • deserialization_exception: Exception

Another thing I was thinking about, is to allow models to be defined as lazily deserializable, mostly for deserialization performance of non-corrupt models, but it might help here too.

ikonst avatar Jul 18 '22 17:07 ikonst

@ikonst that sounds like a good idea, then in the loop I can just check what type the iterated item is.

Was wondering for a short term fix if I could get away with overriding from_raw_data with my model class. And then put a try ... except around the call to the super().from_raw_data(...)

SteggyLeggy avatar Jul 20 '22 08:07 SteggyLeggy

Yeah, that might work. Make sure you keep some way in your model to indicate that it wasn't initialized.

ikonst avatar Jul 20 '22 17:07 ikonst