CLI Indexer: silently ignore brokenpipe signal
This PR makes the cli indexer silently ignore the broken pipe signal (SIGPIPE) if the output is sent to a Unix pipe and not entirely consumed, e.g.
$> warcio index not-too-small.warc.gz | head
...
Traceback (most recent call last):
File ".../bin/warcio", line 33, in <module>
sys.exit(load_entry_point('warcio==1.7.4', 'console_scripts', 'warcio')())
File ".../lib/python3.9/site-packages/warcio-1.7.4-py3.9.egg/warcio/cli.py", line 55, in main
File ".../lib/python3.9/site-packages/warcio-1.7.4-py3.9.egg/warcio/cli.py", line 68, in indexer
File ".../lib/python3.9/site-packages/warcio-1.7.4-py3.9.egg/warcio/indexer.py", line 33, in process_all
File ".../lib/python3.9/site-packages/warcio-1.7.4-py3.9.egg/warcio/indexer.py", line 41, in process_one
File ".../lib/python3.9/site-packages/warcio-1.7.4-py3.9.egg/warcio/indexer.py", line 53, in process_index_entry
File ".../lib/python3.9/site-packages/warcio-1.7.4-py3.9.egg/warcio/indexer.py", line 87, in _write_line
BrokenPipeError: [Errno 32] Broken pipe
# alternatively
$> warcio index -o >(head) not-too-small.warc.gz
Sebastian, most Unix things do print the EPIPE, and it's sometimes very helpful when the indexing process has a separate error logfile from the other end of the pipe.
What situation do you have in mind for making it a silent failure?
What situation do you have in mind for making it a silent failure?
Using the indexer output as input for head or grep.
Of course, the indexer could show a short message about the "end of pipe" signal. But a stack trace isn't really useful, a broken pipe signal is hardly an issue of the indexer.
most Unix things do print the EPIPE
Good point. Yes, should exit with EPIPE. I'll update the PR. Thanks!
Usually (unless configured otherwise, eg. set -o pipefail for the bash), the shell ignores this error code for commands in the pipeline except the last one. But I'm not aware of any Unix tool (cat, sort, grep, etc.) which "prints" a message that it received a SIGPIPE. In any case, a stack trace is hardly useful as the reason for the SIGPIPE is outside warcio.
I think I'm remembering EPIPE being useful for things like scp and other network tools. But you're right, normal on-node tools like head have always been silent about EPIPE.
This is kind of similar to the situation of ^C being a silent situation for everything but Python.
Yes, Python is special here:
- SIGPIPE (see man 7 pipe is used to avoid errors ahead of the write call
- but Python explicitly ignores the signal (see signal.SIGPIPE) in order to handle write errors as ordinary Python exceptions
But also Unix tools handling report the SIGPIPE in their exit value as 128 + 13 = 141 (SIGPIPE is defined as 13 in /usr/include/asm-generic/signal.h):
bash -c '/bin/seq 50000 | head >/dev/null; echo ${PIPESTATUS[@]} "=>" $?'
141 0 => 0
If the shell is configured with pipefail, the entire pipeline fails:
bash -c 'set -o pipefail; /bin/seq 50000 | head >/dev/null; echo ${PIPESTATUS[@]} "=>" $?'
141 0 => 141
But I'm still not sure whether the exit value 141 is appropriate for a Python program - that was the last point I thought about. Btw., the post effective handling of the SIGPIPE signal provides a good summary of the problem.
Note: I'm also fine with a short error message BrokenPipeError: [Errno 32] Broken pipe but without the long stacktrace.