cwltool icon indicating copy to clipboard operation
cwltool copied to clipboard

regression of cwltool performance by 5-7 minutes around 1st of june

Open azat-badretdin opened this issue 2 years ago • 9 comments

Our PGAPX tool (ncbi/pgap on GitHub) uses cwltool as a CWL runner. We noticed that around 1st of June this year or TeamCity builds are running 5-7 minutes longer.

We tracked the issue to slow down in transition between calling a workflow and command line tool. For example:

[2023-09-29 01:38:51] DEBUG [job cluster_blastp_wnode_3] initial work dir {}
[2023-09-29 01:40:22] INFO [job cluster_blastp_wnode_3] /pgap/output/debug/tmp-outdir/kf1wjucm$ cluster_blastp_wnode \

See 1.5 min delay that used to be significantly smaller. The cwltool versions of "slow" and "fast" runs are correspondingly:

$ docker run -it ncbi/pgap-dev:2023-05-15.build4593 cwltool --version
/root/venv/bin/cwltool 3.1.20230513155734
$ docker run -it ncbi/pgap-dev:2023-06-06.build4641 cwltool --version
/root/venv/bin/cwltool 3.1.20230601100705

We directly tested that reversion of cwltool version to May value fixes the performance degradation issue.

azat-badretdin avatar Sep 29 '23 14:09 azat-badretdin

I have cwltool logs before and after and an Excel file comparing stage execution between the two, but I do not see a way to attach those to the ticket, please let me know how can I deliver them to you.

azat-badretdin avatar Sep 29 '23 14:09 azat-badretdin

@azat-badretdin Make a zip archive, then you can drag-and-drop that into a comment here.

If you know that your workflow has no errors and is CWL v1.2 only, then --fast-parser should speed up the startup.

Do you have nodejs installed in your container?

mr-c avatar Sep 29 '23 14:09 mr-c

If you know that your workflow has no errors and is CWL v1.2 only, then --fast-parser should speed up the startup.

We did transition to CWL v1.2 some time ago. Thanks for the tip, Michael!

Do you have nodejs installed in your container?

Yes. In our May release we had

$ docker run ncbi/pgap:2023-05-17.build6771 nodejs --version   
v6.14.3

in our pending release we have

$ docker run ncbi/pgap-dev:2023-09-28.build4972 nodejs --version
v6.17.1

newer version.

Make a zip archive, then you can drag-and-drop that into a comment here. cwl-issue-269.tar.zip

azat-badretdin avatar Sep 29 '23 15:09 azat-badretdin

Note that the suspect here is the volume of referenced input data, not parsing per se.

azat-badretdin avatar Sep 29 '23 15:09 azat-badretdin

We opened our internal JIRA ticket PGAPX-1208 to investigate cwltool --fast-parser suggestion

azat-badretdin avatar Sep 29 '23 15:09 azat-badretdin

We directly tested that reversion of cwltool version to May value fixes the performance degradation issue.

Everything else was the same? I'm seeing a lot of differences between the two docker images :-)

mr-c avatar Sep 29 '23 17:09 mr-c

The workflows inside docker image are the same. The binaries are different, not much, my point is that I do not see differences between [job X] execution times. If you filter the Excel file by job you won't see any large changes. All the time differences are on higher level.

azat-badretdin avatar Sep 29 '23 18:09 azat-badretdin

Here is a better attempt at naming steps, I also added the order of execution of stages (steps/workflows/jobs)

comparison.2b.xlsx

All increment happens in workflows/steps transitions, all job duration differences are within 1 sec

azat-badretdin avatar Sep 30 '23 15:09 azat-badretdin

Reminder: in case of government shutdown that might happen today at midnight, I probably won't be legally able to comment here (given that I am officially government at this account) and I will certainly not be able to actually use government equipment to answer questions here.

azat-badretdin avatar Sep 30 '23 15:09 azat-badretdin