streampipes icon indicating copy to clipboard operation
streampipes copied to clipboard

Handling null values in JSON input

Open obermeier opened this issue 2 years ago • 6 comments

Apache StreamPipes version

dev (current development state)

Affected StreamPipes components

Backend, Connect, Processing Elements, UI

What happened?

Some of my JSON data consumed by an input adapter have null values. E.g. {"name": 1, "age": null}

This leads to a NPE when guessing the schema.

In my local SP version I just replaced the null value after parsing with a java.util.Optional and a custom runtime type. This works most of the time especially if the field datatype is string.

But I think this is not a complete solution since the behavior in processors and output adapters is not defined.

What do you think how this kind of data should be handed?

How to reproduce?

No response

Expected behavior

No response

Additional technical information

No response

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

obermeier avatar Dec 07 '22 16:12 obermeier

Hi Stefan, thanks for opening the Issue. I think this is a great question. I guess we have to destiguish between the 'guess schema' phase and 'runtime' phase:

  • Guess schema:
    • If a value is null we have the problem that the data type can not be inferred. So I would say we have two options, either remove the property, or notify the user to select a data type manually. (I think I would prefer the second option, because the user can decide what should happen. This would require changes in the API and the UI)
  • Runtime:
    • Here I see three options to deal with missing data. Either remove the whole event, remove the missing value or provide a default value. I would prefere the second option. We recently implemented something similar for the data lake. The problem with this solution is that processing elements expect that events are complete (have values for all properties). If we start to remove property values from events, then we need a mechanism to deal with incomplete events. Therefore, we have to adapt the API for processing elements.

What are your thoughts?

tenthe avatar Dec 07 '22 18:12 tenthe

Hi Philipp,

thank you for the classification and the interesting thoughts.

  • Related to the guess schema topic I prefer the second version too.
    This raises interesting questions related to the runtime part.

  • Related to the runtime topic I like the second option too but have no clear idea how to handle the missing values.
    I did some experiments with a custom processor and realized that some of my external systems require all properties of the json object. In this case I could reconstruct the null values from the schema.?. If properties are removed at runtime an extensions of the output adapters could be useful which adds the missing values for external systems which need this properties.

    Because of this two problems I came up with “default value” (Optional value) solution. But I am not sure what good default values could be especially if they should be ignored in the processing. Using an optional attribute (Which seems to exists in the schema!?) seems to me a promising but seems to have many implications. Checks in the processing components handling in the semantic layer …

What do you think about this ideas?

obermeier avatar Dec 08 '22 08:12 obermeier

thanks for reporting this @obermeier! I think this raises the question how want to handle null values in general, doesn't it?

Thanks a lot @tenthe for sketching some concepts to tackle this problem already. In terms of the schema guessing part, I agree with you both that option two is the cleanest option and also my preferred one. In that sense, I would prefer (speaking from an user or high level perspective) the third option (providing a default value) at runtime. After all, if we are already asking the user to choose the correct data type, we should go that route completely and handle null values correctly. If we don't want to do that, we could just go with option one for schema guessing, imho. That being said, this issue gets somehow a larger one right? So I would suggest to create a discussion where we can brainstorm how we solve this technically.

What are your thoughts? Any other opinions?

bossenti avatar Dec 08 '22 18:12 bossenti

Thanks the two of you.

@obermeier I also like the idea of optional values. But as you stated this has multiple implications that we should discuss. I a m not sure if we support this already in the event schema. There is a field required in the class EventProperty but I do not know if we can use this field for this purpose or if we need a new attribute for that.

@bossenti I totally agree with you. This is a bigger topic that we should discuss in detail to come up with a good solution. +1 to start a discussion to collect all the implications of the changes and come up with a strategy to implement it.

tenthe avatar Dec 09 '22 10:12 tenthe

I totally agree with you two (@tenthe @bossenti )! Discussion +1

obermeier avatar Dec 10 '22 10:12 obermeier

@tenthe @obermeier I've started a discussion thread (#860) and summarized our discussion so far. Feel free to adjust anything or add your thoughts

bossenti avatar Dec 10 '22 12:12 bossenti