enrich
enrich copied to clipboard
Add incomplete events
This PR adds incomplete events to Enrich flow.
Before, enriching an event was resulting in either a BadRow or an EnrichedEvent. What's new is that now a BadRow can have an EnrichedEvent associated with it, which is the incomplete event.
All the changes are related to the fact that now we don't short circuit the processing as soon as there is an error but we keep going with what is valid.
As a result we're not using ValidatedNel and EitherT any more.
To make things easier to follow, I reorganized the flow in 3 steps:
- Mapping and validating the input.
- Running the enrichments. This step now takes care of validating the enrichments contexts.
- Validating the atomic fields lengths.
Thanks for the review @istreeter ! I will address your feedback.
I do think it has made the code harder to read compared with how it was before
Yeah I feel the same :smiling_face_with_tear: But the logic has become more complicated, so I'm afraid it's an inevitable outcome. Now after each step we need to check if there are bad rows and the value of emitIncomplete.
The point about Ior is that it has a Monad on the right hand side
I'm not sure that we will benefit from it, as we need to check for bad rows after each step, and not only at the end. But I'll try Ior to see if it can simplify the workflow.
Hi @istreeter , I addressed your feedback, please have a look. I agree that having Ior[BadRow, EnrichedEvent] as the return type of enrichEvent looks nicer. Then I tried to use Ior everywhere in the flow but I didn't like it. One problem is that to get the errors you need to pattern match and to have case Ior.Left whereas we know already that we will never be in this case.
for { a <- doSomethingThatMightYieldBadRows(collectorPayload) b <- doSomethingElseThatMightYieldBadRows(a) event <- andSomethingElse(b) } yield event ...so you accumulate failures on the left hand side, while continuing to process the event on the right hand side.
I might be missing something because I don't see how this can work. Before executing doSomethingElseThatMightYieldBadRows we need to check the errors that are in a (not everyone will have incomplete events activated), how would that work? I want to be able to accumulate the errors and to continue the processing (so no short circuiting with IorT).
An alternative that I envisaged was to use IorT and to have a Ref[F, List[BadRow]] . We would pass it to each function, along with emitIncomplete. Then inside each function, when there is an error, if emitIncomplete is set to false, we return a Left which short circuis the rest of the processing. If set to true, we add the bad row to the Ref and we return Right, which will continue the processing. And at the end we look at what is inside the Ref to decide what to return. Do you think that it would be nicer? We can jump on a call if it's unclear.
Hey @benjben
Before executing doSomethingElseThatMightYieldBadRows we need to check the errors that are in a (not everyone will have incomplete events activated), how would that work?
Imagine for a second that everyone has incomplete events activated. Would that remove the need to check errors after each step, and therefore remove the part that makes it messy?
Under what circumstances would someone have this feature disabled? And in those cases, could we continue to process the event, but simply not emit it in the final step?
(Not saying this is the right solution, but it's worth exploring while we're trying to find options available to us).
Hey @istreeter
Imagine for a second that everyone has incomplete events activated. Would that remove the need to check errors after each step, and therefore remove the part that makes it messy?
Indeed that would remove the need for that and would make the flow look nicer.
Under what circumstances would someone have this feature disabled?
This feature will cost money as it will require an additional stream and additional apps to run, so I can imagine customers not wanting it.
@colmsnowplow should we consider that once this feature exists we will activate it for all our customers ?
And in those cases, could we continue to process the event, but simply not emit it in the final step?
For sure we can but that would waste resources. That's not very satisfying to force features only to make the code look nicer.
I think I can offer a point of view here.
First just to double check I'm reading it right:
Before executing doSomethingElseThatMightYieldBadRows we need to check the errors that are in a (not everyone will have incomplete events activated), how would that work?
I'm not 100% clear why we need to check the errors, but it sounds like it's so that we can decide not to continue processing the event if we hit an error. Just stating this in case it's not correct. :)
Under what circumstances would someone have this feature disabled?
This could happen! An example - one of the customers in the research uses the JS enrichment for custom bot filtering. They specifically don't want that data to go to the db. First version of this they will most likely want to just disable it. Some customers might simply not want the additional cost of an extra stream and an extra loader.
And in those cases, could we continue to process the event, but simply not emit it in the final step?
This would be acceptable. But perhaps we can find something cleaner. (Not sure as haven't yet carved out the time to read this PR properly)
@colmsnowplow should we consider that once this feature exists we will activate it for all our customers ?
I think we were editing at the same time, think I answered that at least at first I don't think we can. @stanch, FYI, think you'll agree. :)
I'm not 100% sure I understand the code enough to weigh in heavily, but I think at this point it is worth considering the other option - the goal is to choose the right design, so if there's a trade-off like this I think we should consider it.
On this point:
For sure we can but that would waste resources. That's not very satisfying to force features only to make the code look nicer.
I wouldn't say it's just about making the code look nice. It's about making the code easier to understand and easier to work with. Which in turn reduces risk of bugs.
I don't know which option is better but I do think that simpler code has tangible value, and it's worth consideration. If one option is margnially less efficient but significantly less complex, IMO it is often the better choice. (But I don't know enough to say if that applies here)
I had already approved it, but it looks even better now ✅