elvish icon indicating copy to clipboard operation
elvish copied to clipboard

Feed stdin to all code blocks in run-parallel

Open simonrouse9461 opened this issue 4 years ago • 7 comments

Sometimes, it's convenient to reuse data from a pipeline in multiple parallel jobs, for example

cat data.csv | <some preprocessing> | run-parallel {
    each [line]{
        <some job>
    }
} {
    <some other job>
} {
    # bypass to screen
    cat
}

However, this piece of code won't work because stdin only goes to an arbitrary one of the code blocks in run-parallel. I feel it makes more sense to fork the stdin to all code blocks simultaneously.

simonrouse9461 avatar Jul 15 '21 20:07 simonrouse9461

What a coincidence - just yesterday I thought about "multipipes" (as I named it temporarily) as an improved concept of pipes allowing one to reuse arbitrarily created/deleted streams (or variables) in arbitrary stages of the pipeline. This borrows some ideas from synchronous programming, ECS (entity-component-system) systems, error handling using ephemeral (non-reified) optionals handled in compile time, etc.

Your issue seems to be about stream multiplexing (or publishing as in pub-sub if you want to get a copy of all the data in each parallel worker) which is basically what would happen at each multipipe boundary.

I wanted to make a nice write up of the multipipe concept but didn't have time to do that yet. So take this as a short sneak peek :wink:.

dumblob avatar Jul 16 '21 08:07 dumblob

@simonrouse9461, How do you expect your proposal to work given that byte streams have no message boundaries? Consider that the cat parallel job will be reading from the stdin stream in unknown sizes, but even if it read from the byte stream one byte at a time it would still interact in unpredictable ways with all the other readers. I don't see any way for this proposal to produce useful, predictable, behavior. It seems to me the only practical approach is an explicit conversion from bytes to values prior to the run-parallel command. For example, this will distribute handling of the lines in data.csv to each of two jobs:

from-lines < data.csv | run-parallel ^
    { each [l]{ echo "Job 1:"$l } } ^
    { each [l]{ echo "Job 2:"$l } }

krader1961 avatar Jul 19 '21 04:07 krader1961

Sure, it'd need to be "values" in the pipe (and not bytes) as I understood the proposal.

dumblob avatar Jul 19 '21 07:07 dumblob

@krader1961 I think the current behavior is to read one line at a time already? If you try this:

while (sleep 1) { echo "a line\nanother line\nyet another line" } | run-parallel { 
    each [x]{
        echo value: $x
    }
}

the output will be:

value: a line
value: another line
value: yet another line
value: a line
value: another line
value: yet another line
value: a line
value: another line
value: yet another line
...

To me, the only problem is that it doesn't support multiple parallel blocks.

simonrouse9461 avatar Jul 19 '21 17:07 simonrouse9461

@simonrouse9461, Yes, the each command reads the byte stream one line at a time. But that isn't true for other commands and definitely isn't true for external commands like cat. Even if you restricted yourself to using each there would need to be a mutex that restricts reading from the byte stream to a single parallel job at any given instant. It's simpler and more efficient to use from-lines as I showed in my prior comment to convert the byte stream to a value stream. The value stream can be read efficiently by each run-parallel job.

krader1961 avatar Jul 19 '21 23:07 krader1961

@simonrouse9461: Do you still feel like the example code block in your first comment should work? As opposed to leveraging the from-lines command and have each code block run in parallel reading from the value channel? If you still believe your original example code block should work I would be interested in how you think commands other than each (both builtin and external) would be modified to produce deterministic behavior.

krader1961 avatar Sep 08 '21 03:09 krader1961

@xiaq, I think this should be closed as "working as intended". There is no way to efficiently, or safely, read lines from the byte stream in parallel. Possibly the only thing to do with respect to this issue is include something like my example using from-lines in the documentation. However, it's far from clear where to place such an example and whether it would be useful for future Elvish users who don't understand how byte streams work.

krader1961 avatar Jun 13 '22 01:06 krader1961