gleam icon indicating copy to clipboard operation
gleam copied to clipboard

How to prioritize executors running on different agents when allocating jobs?

Open svmhdvn opened this issue 6 years ago • 0 comments

I am wondering if there is a way in gleam to better distribute jobs across available agents rather than just executors. Consider the following example:

  1. read a text file containing (key, value) pairs, with many keys being repeated
  2. partition by keys, process the value and transform it to something
  3. write values to disk distributed evenly by keys across agents

Let's say there are 6 possible keys, ["key1", "key2", ... , "key6"], and I am running 3 agents on different machines. Ideally, I would like to have the following distribution after my flow: agent1: only has values with two different keys agent2: only has values with two different keys agent3: only has values with two different keys

For the example, consider the following flow:


const NUM_AGENTS = 3
...
    f := flow.New("example").
        Read(file.Txt("data/*", 4)).
        Map("process the values", ProcessValues).
        PartitionByKey("shard by key", NUM_AGENTS).
        Map("write to local file on agent's filesystem", WriteToFile)
...

Using this flow, I don't get the results I'm looking for. Instead, gleam simply finds available executors, regardless of the agent they're running on, so I might get something that looks like this:

agent1: has values with three different keys agent2: has values with 1 key agent3: has values with two different keys

Is there a way to prioritize available agents without running jobs on them? In particular, I would like the partition to shard all the 6 keys into 2, 2, and 2 running on separate agents and only use the executors available on those agents.

svmhdvn avatar Nov 20 '18 16:11 svmhdvn