gleam icon indicating copy to clipboard operation
gleam copied to clipboard

Broadcast variables

Open guyneedham opened this issue 7 years ago • 2 comments

Hi,

I'm interested in switching away from Spark, but many use cases I've encountered with MapReduce, Spark, and Flink, benefit massively from broadcasting variables for example a map of reference data.

Does Gleam already support this? If not, how would you add such behaviour? I'd be happy to contribute if you can point me at how conceptually this would be achieved.

G

guyneedham avatar Jan 04 '18 11:01 guyneedham

Gleam does not support this yet. I was thinking to refactor the dataset.Map() function a bit to support an interface, with before() and after() functions. The broadcast variables can be initialized and persisted there, talking to an external key-value store.

chrislusf avatar Jan 04 '18 21:01 chrislusf

Users could use something like https://github.com/boltdb/bolt to avoid setting up external stores just to broadcast maps.

How much work would be involved with this refactoring? How would one ensure that the before() and after() are not required for each Map() defined?

How would one ensure that the work done in before() is not repeated for each call to the Map() function? For example in Apache Spark users have the mapPartitions() function which is called once for each partition of the data allowing expensive objects to be set up there. Additionally broadcast objects are only copied over to each executor once. These are both useful properties for a broadcast variable.

Apache Flink maintains an external store, rocksdb, for managing state.

guyneedham avatar Jan 05 '18 14:01 guyneedham