mwdb-core icon indicating copy to clipboard operation
mwdb-core copied to clipboard

Non-string attributes are allowed but not supported by schema

Open yankovs opened this issue 1 year ago • 3 comments

Environment information

  • MWDB version (from /about): 2.10.3
  • Installation method:
    • [ ] mwdb.cert.pl service
    • [ ] From PyPi (pip install mwdb-core)
    • [X] From docker-compose
    • [ ] Other (please explain)
  • Plugins installed:

Behaviour the bug (what happened?)

Attribute value should be a string, as defined in the schema: https://github.com/CERT-Polska/mwdb-core/blob/ee5cb1235d6b282bcd2c543128e0a1cda0ae0315/mwdb/schema/metakey.py#L25-L31 However, the introduction of the rich attributes feature opened the possibility for attributes that aren't strings. To do this, all we need to do is define a rich attribute template of the form {{value}}. Then we can put in the value section of the attribute some non-string value, for example the number 8, and what will happen is that the object will have the integer value 8 as an attribute value. Not a string but a number. This can be verified by the query functionality, you can treat this attribute as a number in a query and use feature like number ranges for it. This also works for float, so a value of 5.76 works as expected when doing this trick.

The problem here is that as seen in the schema, attribute values are expected to be strings and this can lead to weird behavior when using this trick. Let's say I define an attribute called vt-detections for samples, this is naturally an integer. And let's say I use this trick to make it treated like an integer. Then, when a sample comes in and it has 0 VT detections, it will fail the schema validation. In validate_value we have value==0 thus not value == True and this results in a nice error message. Despite 0 being a legitimate value in this case.

Expected behaviour

I think that instead of enforcing users not to use non-string values, this should be embraced and added as a system feature. I would do the following:

  • Change this attribute value validation to a switch statement based on the type returned by isinstance. Have different validators for strings, integers, floats, and rich attributes
  • Add support for non-strings in the UI, right now when you manually add an attribute through the UI it only has the option of string or object

Also, maybe add to an each attribute its (user defined) type in the DB to prevent the user from having multiple different types. For example not to have some attribute sometimes as a string and sometimes as an integer, as this makes a difference in searching.

Reproduction Steps

  • Add a new rich attribute with the template {{value}}
  • Try to add it to a sample through the UI as an object with the value 0
  • Be greeted with a validation error

yankovs avatar Feb 03 '24 13:02 yankovs

@yankovs Actually they're supported but by Attribute API (https://github.com/CERT-Polska/mwdb-core/blob/master/mwdb/schema/attribute.py#L25-L32). And range queries for integer values are already supported by attribute.*:... query field!

Metakey API you're referencing is an old thing that should not be used and will be removed in next major version.

Add support for non-strings in the UI, right now when you manually add an attribute through the UI it only has the option of string or object

Object is actually the non-string type you're looking for. I see that right now UI is pretty confusing and it's too easy to add numerical IDs both as strings and numbers.

psrok1 avatar Feb 19 '24 18:02 psrok1

Ah and I see another problem. Yes, 0 as a number is considered an empty value

image

psrok1 avatar Feb 23 '24 17:02 psrok1

@psrok1 Thanks for taking the time to look into this! Your solution solves the issue I presented in this issue.

However, I still think that in general a better validation/typing model should take place for attributes. Let's say I want to have an attribute for PE files called compilation_timestamp. With current MWDB my options are basically either to save a timestamp (integer) or a formatted string of this timestamp (string). With both approaches I don't have:

  • validation of the date, even basic one
  • ability to work with date queries, doing attribute.compilation_timestamp>=1970-01-01 makes no sense

The issue is more complex when taking a look at a json attribute. This compilation timestamp might come as a key-value pair in a more general case:

{
    ...
    "compilation_timestamp": ...
    ...
}

yankovs avatar Mar 01 '24 09:03 yankovs