zsh-async
zsh-async copied to clipboard
Notify feature for non-interactive scripts is unreliable on large number of jobs [1000+]
I have created a batch conversion script to convert thousands of songs in zsh using this package. I am noticing however that that after several hundred jobs the notify feature fails and I get stuck in an infinite loop waiting for a job to finish, that is actually already done.
I know the job is finished because I am using the 'watch' utility to watch the subprocess to completion. I am using a setup more complex but similar to your example. I am using a loop mechanism to only process '$(nproc) * 2' jobs at a time as to not overload the cpu, so when this bug occurs the script can't continue to the next batch and I have to kill it to continue.
I have tried killing the worker after each iteration and starting a new one at the beginning of the loop but the same behavior continues. I have also tried modifying the script to run without limiting the number of jobs and the same behavior still occurs after 200-300 jobs.
This is a big pain because once it crosses the threshold, the script constantly fails and I can only process 2 or 3 interations on the next try before it fails again (my script checks for files that are already converted so this still constitutes as a job even though no conversion takes place).
Here is the relevant code section
readonly -i cores=$(( $(nproc) * 2 ))
for (( i=1, j=cores; i <= (${#files} + cores); i+=cores, j+=cores )); do
if [[ $j -gt ${#files} ]]; then
j=${#files}
fi
async_start_worker async_conversion -n ### < --- same behavior even if this is outside the loop
async_register_callback async_conversion completed_callback
for (( goo=i; goo <= j; goo++ )); do
if [[ $goo -le ${#files} ]]; then
async_job async_conversion convert ${files[$goo]}
fi
done
while (( COMPLETED_JOBS < j )); do ### <---- infinite loop starts here
echo $COMPLETED_JOBS -- $j
sleep 0.001
done
async_stop_worker async_conversion
done
Hi @Asumerodi and thanks for your report, I find it a bit hard to understand exactly how this code is executing, could you post me a more complete sample?
Also, what version of zsh are you running? There are problems on at least 5.0.2 and 5.0.8 that I can't do anything about. Furthermore, the use of kill-signals (before zsh 5.2) is not optimal for a lot of simultaneous job execution.
One thing to keep in mind is that eventually the shell will not be able to create more forks, so there's a limit to how many async_jobs can run simultaneously, especially if your job creates new forks.
You could try changing this line to redirect errors inside the worker to a file, e.g. exec 2>> /tmp/err.log
(note double >> for append). This might give some helpful hint as to what is going wrong.
Sorry I've got nothing concrete for you yet, need to understand the problem you're having better first 😄.
Re: complete sample. You don't really need to share your business logic, code that runs with a "noop" function would help.
Actually, I might in fact understand the problem here. Since you're using async
in a script, ZLE is not available, and thus -n
(notify) will only work via kill
.
I recommend you change a few bits:
test_large_number_of_jobs() {
readonly -i cores=4
readonly -i files=1000
integer COMPLETED_JOBS=0
integer i j goo
completed_callback() {
print cb $@
COMPLETED_JOBS=$((COMPLETED_JOBS + 1))
}
convert() {
print $1
}
async_start_worker async_conversion
for (( i=1, j=cores; i <= (files + cores); i+=cores, j+=cores )); do
print $i $j
if [[ $j -gt $files ]]; then
j=$files
fi
for (( goo=i; goo <= j; goo++ )); do
print $i $j $goo
if [[ $goo -le $files ]]; then
async_job async_conversion convert $goo
fi
done
while (( COMPLETED_JOBS < j )); do
sleep 0.001
async_process_results async_conversion completed_callback
echo $COMPLETED_JOBS -- $j
done
done
async_stop_worker async_conversion
}
I changed some things to keep it simple when writing the test function, but the biggest change is that we don't use -n
for notify in the worker and we call async_process_results async_conversion completed_callback
manually in the loop. The way your code was structured there was never any need for the notify-part 😄.
I know the code is ugly and was even reluctant to post in that form, and I appreciate the work around. I will certainly be using it for now. Zsh version is 5.3.1 on Arch Linux so unless there is a regression in this version I dunno if what you mention applies. Essentially the loop is running however many cores are detected on the system times 2, jobs. On my system that would be 16 jobs running concurrently, once all 16 finish, 16 more are started until all conversions are finished.
the only other meaningful code in the script besides option parsing is this function which handles conversion.
convert () {
local -r file=$1
if [[ -n ${outputdir} ]]; then
local -r outdir=${outputdir[3]} # <--- user defined directory to put converted files
else
local -r outdir=$MUSICDIR/ALAC # <--- if not specified on command line use default
fi
local -r outfile=${outdir}/${file%flac}m4a
# error check unimportant to execution
if [[ -n ${musicdir} ]] && [[ $(dirname ${file%/*}) != "." ]]; then
mkdir -p ${outdir}/${file%/*}
else
mkdir -p ${outdir}
fi
# the print calls are just for the async_process_results
if ! [[ -f ${outfile} ]]; then
if ! ffmpeg -i ${file} -vn -c:a alac ${outfile}; then
print -u2 "'${outfile##*/}'"
return 1
fi
print "'${outfile##*/}'"
else
print -u2 "'${outfile##*/}': exists"
return 3
fi
return 0
}
sorry if it still hard to follow this was just a brain dump that I want to fix up after I get it working correctly.
I know the code is ugly and was even reluctant to post in that form, and I appreciate the work around.
Don't be so hard on yourself, it is fine 😄.
Zsh version is 5.3.1 on Arch Linux so unless there is a regression in this version I dunno if what you mention applies.
No, you're fine in that regard. I honestly didn't consider at first that this was running as a script (without ZLE).
the only other meaningful code in the script besides option parsing is this function which handles conversion.
Thanks for posting it. Now, I don't see anything obvious, however, would you mind changing $(dirname ${file%/*})
to ${${file:h}%/*}
to see if it has any effect? I'm curious to see if not using a subshell here would help (fewer forks).
I really wish I could make the notification (-n
) feature more reliable when run in scripts but kill-signals are not reliable in zsh, at least not when received in rapid succession, some will be lost along the way 😞.
There is one way to utilize the ZLE watcher in scripts as well, but you'd have to start an interactive shell to accomplish that, there's an example of how to do it in async_test.zsh#L482-L512.
PS. Out of curiosity, did you try the exec 2>> /tmp/err.log
to see if it spat out any errors?
I did try appending to '/tmp/err.log' on the line you mentioned however after the script gets stuck in the loop there is no error in that file its just empty. I also tried changing that 'dirname' to a parameter expansion as you suggested and am still getting the same behavior. Fortunately though your work around to call async_process_results
directly works perfectly so at least I can get the job done, next step is to generalize the script to convert to arbitrary formats.
Thanks for testing those cases @Asumerodi, it means I can cross some of my suspicions off the list. I'll try to look into this more deeply when I have some extra time. In the meantime, I'm glad that the work-around helped 🙂. Good luck with your script!
@Asumerodi it may be 2 years too late, but I might have a fix for your original issue in #45. If you're still using async for these kinds of things, feel free to give it a spin. If not, just thought I'd let you know.
Update 1: It turns out zpty -w $worker "$@"$'\0'
failed after too many jobs have been sent.
Behavior
@mafredri I may have encounter similar problem in my async zsh prompt. I have made 22 sections in the prompt async before. But I found 3 of the sections are never rendered in my prompt.
What I've digged
I tried to echo in function _async_send_job
and I'm sure zpty -w $worker "$@"$'\0'
is called and the job is passed into the worker. Later, I tried enabling debug in GH-45, and I found no info about the last three section in the debug log.
_async_send_job() {
echo "_async_send_job: $@"
echo "_async_send_job: $@" >> /tmp/sz.log
zpty -w $worker "$@"$'\0'
}
_async_send_job: _async_eval setopt EXTENDED_GLOB
_async_send_job: sz::async_wrapper sz::vcs vcs·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::package package·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::node node·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::ruby ruby·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::elm elm·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::elixir elixir·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::xcode xcode·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::swift swift·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::golang golang·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::php php·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::rust rust·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::haskell haskell·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::julia julia·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::vagrant vagrant·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::docker docker·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::aws aws·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::conda conda·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::pyenv pyenv·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::dotnet dotnet·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::ember ember·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::kubectl kubectl·\|·prompt·\|·
_async_send_job: sz::async_wrapper sz::terraform terraform·\|·prompt·\|·
The keywords ember
, kubectl
, terraform
didn't appear in the debug log assigned by ASYNC_DEBUG_WORKER_STDERR=/tmp/sz.log
.
To check if the worker is killed in my prompt code. I tried intercept the function async_stop_worker
.
async_stop_worker() {
echo "----------------" >> /tmp/sz.log
echo "Stopping $@" >> /tmp/sz.log
And it's only killed and printed once in precmd
hook, which is the expected behavior.
Seems a bug in zpty
?
Reproduce the bug
Load my prompt spacezsh and enable all async section by modify default setting in default.zsh
.
https://github.com/laggardkernel/spacezsh-prompt/blob/31ea7c92519efc8248330eb8d83bb9f17e002c85/presets/default.zsh#L9-L45
if [ -z "$SPACESHIP_PROMPT_ORDER" ]; then
SPACESHIP_PROMPT_ORDER=(
user # Username section
host # Hostname section
dir # Current directory section
vcs::async # Version control system section (git, hg, svn)
# git::async # deprecated
# hg::async # deprecated
package::async # Package version
node::async # Node.js section
ruby::async # Ruby section
elm::async # Elm section
elixir::async # Elixir section
xcode::async # Xcode section
swift::async # Swift section
golang::async # Go section
php::async # PHP section
rust::async # Rust section
haskell::async # Haskell Stack section
julia::async # Julia section
vagrant::async # Vagrant section
docker::async # Docker section
aws::async # Amazon Web Services section
venv # virtualenv section
conda::async # conda virtualenv section
pyenv::async # Pyenv section
dotnet::async # .NET section
ember::async # Ember.js section
kubectl::async # Kubectl context section
terraform::async # Terraform workspace section
line_sep # Line break
# vi_mode # deprecated
char # Prompt character, with vi-mode indicator integrated
)
fi
Temporary Solution
Cause not all the section is time-consuming, I'll make the section async as needed depending on the execution time of the async job I get in the callback function.