gleam
gleam copied to clipboard
Broadcast variables
Hi,
I'm interested in switching away from Spark, but many use cases I've encountered with MapReduce, Spark, and Flink, benefit massively from broadcasting variables for example a map of reference data.
Does Gleam already support this? If not, how would you add such behaviour? I'd be happy to contribute if you can point me at how conceptually this would be achieved.
G
Gleam does not support this yet. I was thinking to refactor the dataset.Map() function a bit to support an interface, with before() and after() functions. The broadcast variables can be initialized and persisted there, talking to an external key-value store.
Users could use something like https://github.com/boltdb/bolt to avoid setting up external stores just to broadcast maps.
How much work would be involved with this refactoring? How would one ensure that the before() and after() are not required for each Map() defined?
How would one ensure that the work done in before() is not repeated for each call to the Map() function? For example in Apache Spark users have the mapPartitions() function which is called once for each partition of the data allowing expensive objects to be set up there. Additionally broadcast objects are only copied over to each executor once. These are both useful properties for a broadcast variable.
Apache Flink maintains an external store, rocksdb, for managing state.