[dont merge to main yet] Don't flush after step completes
The motivation for this PR is to avoid flushing every time we complete a step. There should only be two times we flush to disk/upload:
- When we reach the graphObjectBufferThresholdInBytes of data in memory
- All steps are done.
Also, I don't think is optimal to divide by step the data we update, that would generate way more unnecessary uploads.
Tried it on dev, on instance jupiterone-integration-dev: We went from 432 uploads in a single job, to 3.
Looks good to me. Let's make an alpha version.
@Gonzalo-Avalos-Ribas I think a lot of our changes in this were made to figure out which _types we should mark as partial if an upload fails. What if we just iterated over the entities/relationships after an upload fails and mark those as partial instead of trying to infer it from the step. I think it might simplify some of the other code where we could get rid of things like stepsInvolvedInUploads
@Gonzalo-Avalos-Ribas I think a lot of our changes in this were made to figure out which _types we should mark as partial if an upload fails. What if we just iterated over the entities/relationships after an upload fails and mark those as partial instead of trying to infer it from the step. I think it might simplify some of the other code where we could get rid of things like
stepsInvolvedInUploads
@zemberdotnet Actually the changes are for failing the steps that have graphObjects in the uploads that fail - But marking them as partials could also work. How do the steps show in the event logs if they are marked as partial? How can we mark an entity as partial during the execution?