cwltool
cwltool copied to clipboard
regression of cwltool performance by 5-7 minutes around 1st of june
Our PGAPX tool (ncbi/pgap on GitHub) uses cwltool as a CWL runner. We noticed that around 1st of June this year or TeamCity builds are running 5-7 minutes longer.
We tracked the issue to slow down in transition between calling a workflow and command line tool. For example:
[2023-09-29 01:38:51] DEBUG [job cluster_blastp_wnode_3] initial work dir {}
[2023-09-29 01:40:22] INFO [job cluster_blastp_wnode_3] /pgap/output/debug/tmp-outdir/kf1wjucm$ cluster_blastp_wnode \
See 1.5 min delay that used to be significantly smaller. The cwltool versions of "slow" and "fast" runs are correspondingly:
$ docker run -it ncbi/pgap-dev:2023-05-15.build4593 cwltool --version
/root/venv/bin/cwltool 3.1.20230513155734
$ docker run -it ncbi/pgap-dev:2023-06-06.build4641 cwltool --version
/root/venv/bin/cwltool 3.1.20230601100705
We directly tested that reversion of cwltool version to May value fixes the performance degradation issue.
I have cwltool logs before and after and an Excel file comparing stage execution between the two, but I do not see a way to attach those to the ticket, please let me know how can I deliver them to you.
@azat-badretdin Make a zip archive, then you can drag-and-drop that into a comment here.
If you know that your workflow has no errors and is CWL v1.2 only, then --fast-parser should speed up the startup.
Do you have nodejs installed in your container?
If you know that your workflow has no errors and is CWL v1.2 only, then --fast-parser should speed up the startup.
We did transition to CWL v1.2 some time ago. Thanks for the tip, Michael!
Do you have nodejs installed in your container?
Yes. In our May release we had
$ docker run ncbi/pgap:2023-05-17.build6771 nodejs --version
v6.14.3
in our pending release we have
$ docker run ncbi/pgap-dev:2023-09-28.build4972 nodejs --version
v6.17.1
newer version.
Make a zip archive, then you can drag-and-drop that into a comment here. cwl-issue-269.tar.zip
Note that the suspect here is the volume of referenced input data, not parsing per se.
We opened our internal JIRA ticket PGAPX-1208 to investigate cwltool --fast-parser suggestion
We directly tested that reversion of cwltool version to May value fixes the performance degradation issue.
Everything else was the same? I'm seeing a lot of differences between the two docker images :-)
The workflows inside docker image are the same. The binaries are different, not much, my point is that I do not see differences between [job X] execution times. If you filter the Excel file by job you won't see any large changes. All the time differences are on higher level.
Here is a better attempt at naming steps, I also added the order of execution of stages (steps/workflows/jobs)
All increment happens in workflows/steps transitions, all job duration differences are within 1 sec
Reminder: in case of government shutdown that might happen today at midnight, I probably won't be legally able to comment here (given that I am officially government at this account) and I will certainly not be able to actually use government equipment to answer questions here.