atum icon indicating copy to clipboard operation
atum copied to clipboard

Atum redesign

Open yruslan opened this issue 4 years ago • 6 comments

Background

Currently, Atum relies on the global state of a Spark Application. This complicates the usage of Atum for jobs that are slightly more complicated than just a pipeline of a single dataframe. If there are several dataframes and several reads/writes and not every read and write is associated with control measurements, Atum will try to process all dataframes as if all require measurements.

The current workaround for such use cases is disableControlMeasuresTracking() method that is invoked before writing a dataframe that does not require control measurements.

Feature

  • [ ] Control measurements should be attached to a dataframe, not to the Spark session. E.g., to turn on control measurements users should do df.enableControlMeasuresTracking() instead of spark. enableControlMeasuresTracking(). Same for switching off control measurements.
  • [ ] The measurements should happen to the dataframe it was initialized and the derived ones. Other dataframes shouldn't be affected.
  • [ ] Checkpoints and other housekeeping information should not be kept in the global state.
  • [ ] Adding metadata should be done as dataframe implicits (e.g. df.setAdditionalInfo(...)).
  • [ ] Atum should keep checkpoints for each registered dataframe separately.
  • [ ] Atum plugins should have an event that guaranteed to be sent last. Atum should guarantee that no more events are sent after that.

Additonal context

After the new design is confirmed this issue can be converted to epic and all subitems to tasks.

yruslan avatar Apr 17 '20 07:04 yruslan

Makes sense.

lokm01 avatar Apr 17 '20 07:04 lokm01

I would also proporse to redesign some parts so that they are immutable and functional style. What do you think?

AdrianOlosutean avatar Apr 17 '20 12:04 AdrianOlosutean

Absolutely.

lokm01 avatar Apr 18 '20 06:04 lokm01

Not sure about the last one like its described, particularly in regard to the changes above. If the ATUM would be "attached" to a dataset, it would make sense to send a "last message" on that dataframe. But I am not sure there would be something to hook such an event reliably to. 🤔

benedeki avatar Apr 20 '20 07:04 benedeki

Yeah, it would probably be hard to implement an event that is sent last per dataset. But an event that is sent last during the lifetime of the application could be useful.

yruslan avatar Apr 20 '20 11:04 yruslan

Fields such as Country and others should be made optional and only functional ones should be mandatory to include

AdrianOlosutean avatar Jul 02 '20 08:07 AdrianOlosutean