gleam
gleam copied to clipboard
How to prioritize executors running on different agents when allocating jobs?
I am wondering if there is a way in gleam to better distribute jobs across available agents rather than just executors. Consider the following example:
- read a text file containing (key, value) pairs, with many keys being repeated
- partition by keys, process the value and transform it to something
- write values to disk distributed evenly by keys across agents
Let's say there are 6 possible keys, ["key1", "key2", ... , "key6"], and I am running 3 agents on different machines. Ideally, I would like to have the following distribution after my flow: agent1: only has values with two different keys agent2: only has values with two different keys agent3: only has values with two different keys
For the example, consider the following flow:
const NUM_AGENTS = 3
...
f := flow.New("example").
Read(file.Txt("data/*", 4)).
Map("process the values", ProcessValues).
PartitionByKey("shard by key", NUM_AGENTS).
Map("write to local file on agent's filesystem", WriteToFile)
...
Using this flow, I don't get the results I'm looking for. Instead, gleam simply finds available executors, regardless of the agent they're running on, so I might get something that looks like this:
agent1: has values with three different keys agent2: has values with 1 key agent3: has values with two different keys
Is there a way to prioritize available agents without running jobs on them? In particular, I would like the partition to shard all the 6 keys into 2, 2, and 2 running on separate agents and only use the executors available on those agents.