iwc icon indicating copy to clipboard operation
iwc copied to clipboard

adding microbiome analysis workflows to IWC with test data

Open EngyNasr opened this issue 1 year ago • 48 comments

I tried to reduce the test-data in this PR, hope it works.

Thanks a lot, Engy <3

EngyNasr avatar Mar 07 '23 12:03 EngyNasr

Thanks, can you add a README, Changelog and dockstore.yml files ? (https://github.com/galaxyproject/iwc/blob/main/workflows/README.md#structure-of-the-directory)

mvdbeek avatar Mar 07 '23 14:03 mvdbeek

@mvdbeek did I miss something else ?

Thanks a lot for helping me out :)

EngyNasr avatar Mar 10 '23 09:03 EngyNasr

@wm75 Can you help me revising and merging this PR Thanks a lot <3

EngyNasr avatar Mar 14 '23 14:03 EngyNasr

To do as discussed with @wm75 :

1- remove "latest" from workflow names 2- keep only one file to test for all the workflows, except the pre-processing 3- pre-processing tests we can use the txt file to compare content 4- make a readme inside each folder as well 5- make the changelog only inside each folder 6- remove the big docker.yml in the main folder and keep only the one in every folder

EngyNasr avatar Mar 27 '23 08:03 EngyNasr

@bebatut I have added the 5 workflows for the single samples run and the 4 workflows for the collection of samples run so in total 9 workflows with their test data, I have chosen the minimum size sample data which contain VFs, Contigs, etc. the maximum file size is 50Mb, but the other files are either in Bytes or Kbs

EngyNasr avatar May 31 '23 20:05 EngyNasr

@EngyNasr @bebatut I don't see much value in offering the single-sample workflows, when collection-based flavors exist that could be run with 1-element collections. Having the single-sample WFs published would just mean more maintenance and synchronization efforts. Or is there anything that can be achieved with the single-sample versions, that the collection-based versions don't do?

wm75 avatar Jun 13 '23 14:06 wm75

@EngyNasr can you please run some json reformatter tool over your workflows. Single-line JSON is just not very nice to review and prevents meaningful diffs. Use, e.g., python3 -m json.tool collection-version/pathogen-detection-nanopore-pre-processing-collection/Pathogen-Detection-Nanopore-Pre-Processing-collection.ga collection-version/pathogen-detection-nanopore-pre-processing-collection/Pathogen-Detection-Nanopore-Pre-Processing-collection_pretty.ga or any other tool you like.

wm75 avatar Jun 13 '23 14:06 wm75

@EngyNasr @bebatut I don't see much value in offering the single-sample workflows, when collection-based flavors exist that could be run with 1-element collections. Having the single-sample WFs published would just mean more maintenance and synchronization efforts. Or is there anything that can be achieved with the single-sample versions, that the collection-based versions don't do?

it was just the old way we used to do the analysis and we use these workflows in the current training material, thats why we wanted to have both as two versions of the workflow. but definitely they are useless now since the collection version does the same exact job, but it will never take a single file it has to always be a collection

EngyNasr avatar Jun 14 '23 09:06 EngyNasr

This tool should also work here: toolshed.g2.bx.psu.edu/repos/iuc/krakentools_extract_kraken_reads/krakentools_extract_kraken_reads/1.2+galaxy0 Can use fastq.gz

paulzierep avatar Jun 15 '23 13:06 paulzierep

I need help in tests @wm75 @paulzierep @bebatut :

I noticed that the tool id for these tools is not like the rest of the tools e.g. toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_getfastabed/2.30.0+galaxy1

Is there a way to solve that for these tests to succeed?

EngyNasr avatar Jun 22 '23 11:06 EngyNasr

@EngyNasr once https://github.com/galaxyproject/planemo/pull/1377 is merged and a new version is released it should work.

mvdbeek avatar Jun 28 '23 14:06 mvdbeek

@EngyNasr once galaxyproject/planemo#1377 is merged and a new version is released it should work.

thank you so much :)

EngyNasr avatar Jun 28 '23 15:06 EngyNasr

@EngyNasr once galaxyproject/planemo#1377 is merged and a new version is released it should work.

@mvdbeek, Is it possible that it is also done for FILTER_EMPTY_DATASETS ?

EngyNasr avatar Jun 28 '23 15:06 EngyNasr

I think it was closed by mistake

bebatut avatar Jun 29 '23 11:06 bebatut

Looks like it worked and you only need to work on your test assertions.

mvdbeek avatar Jun 29 '23 15:06 mvdbeek

@EngyNasr two questions on the latest preprocessing version:

  • why is the host ID now not a WF parameter anymore? You could use the "Map parameter value" expression tool to map a user-selected species name to the required taxid.
  • is there a specific reason why you're using fastqc, when you're already running fastp? MultiQC can also visualize fastp JSON output dataset and it doesn't look too different from what two FastQC results (before and after trimming).

wm75 avatar Jun 30 '23 14:06 wm75

The Pathogen-Detection-Nanopore-All-Samples-Analysis WF needs much better annotations and input dataset labels to make it understandable what it is good for. Right now, understanding the purpose without knowing the tutorial is really hard. Even the README file only talks about VF genes, but many people wouldn't know what that is.

wm75 avatar Jun 30 '23 15:06 wm75

The VFs accessions branch of that WF also seems to have quite some room for simplification:

  • Replace parts of text has a repeat element so steps 14, 17 and 19 can be collapsed into a single step (step 21 most likely too, maybe even 15 and 23)

Another step that looks like you could drop it: 13 (if at step 8 you don't include the header) A step I do not fully understand is 20: when will clustalw produce an empty output?

wm75 avatar Jun 30 '23 15:06 wm75

@EngyNasr two questions on the latest preprocessing version:

* why is the host ID now not a WF parameter anymore? You could use the "Map parameter value" expression tool to map a user-selected species name to the required taxid.

* is there a specific reason why you're using fastqc, when you're already running fastp? MultiQC can also visualize fastp JSON output dataset and it doesn't look too different from what two FastQC results (before and after trimming).
  • Regarding the host ID it still exists, I marked it in bold below:

    "tool_state": "{"exclude": true, "fastq_output": true, "include_children": true, "include_parents": true, "library": {"type": "single", "current_case": 0, "input_1": {"class": "RuntimeValue"}}, "max": "100000000", "report": {"class": "RuntimeValue"}, "results": {"class": "RuntimeValue"}, "taxid": "9031 9606 9913"

We can add the Map parameter value to a new version, along with other changes we also have

  • yeah true one can remove it, I just wanted to show also the HTML output of Fastqc to the sample since the Fastp html looks more like a summary

EngyNasr avatar Jul 04 '23 19:07 EngyNasr

The Pathogen-Detection-Nanopore-All-Samples-Analysis WF needs much better annotations and input dataset labels to make it understandable what it is good for. Right now, understanding the purpose without knowing the tutorial is really hard. Even the README file only talks about VF genes, but many people wouldn't know what that is.

I am working on that now, is there a specific recommendation for the annotations?

EngyNasr avatar Jul 04 '23 19:07 EngyNasr

The VFs accessions branch of that WF also seems to have quite some room for simplification:

* Replace parts of text has a repeat element so steps 14, 17 and 19 can be collapsed into a single step (step 21 most likely too, maybe even 15 and 23)

Another step that looks like you could drop it: 13 (if at step 8 you don't include the header) A step I do not fully understand is 20: when will clustalw produce an empty output?

I used to get errors within the workflow runs when I did that, sometimes the replace didnot sense the columns and when I separated them the workflow worked correctly, I will try it again

EngyNasr avatar Jul 04 '23 19:07 EngyNasr

Looks like it worked and you only need to work on your test assertions.

@mvdbeek I can break down the failed tests into two problems:

  1. the last html workflow output still shows a failed because of the Filter Empty dataset and Build list tools
  2. the failed ones due to the assertion difference is because that the test is checking a different output than the one I have in the test file. For example: in the test I am asserting Spike3Barcode10, but the test fails because the test is comparing the test file (Spike3Barcode10) to Spike3Barcode12 which is a different output file

EngyNasr avatar Jul 05 '23 03:07 EngyNasr

@wm75 @mvdbeek

Now I have updated all the workflows such that all inputs and outputs names include no spaces and they now include underscores instead and all have low caps, I have also made sure that the file checked here by the tests is the same as the file I include in testing. so the issue of tests assertion is solved

However, I still get an error, I need your help please

EngyNasr avatar Aug 03 '23 13:08 EngyNasr

This is the error that I get:

Loading database information...Failed attempt to allocate 65426688000bytes;
you may not have enough free memory to load this database.
If your computer has enough RAM, perhaps reducing memory usage from
other programs could help you load this database?
classify: unable to allocate hash table memory

With this tool: toolshed.g2.bx.psu.edu/repos/iuc/kraken2/kraken2/2.1.1+galaxy1:. @EngyNasr you see this error in the error report here on github.

On EU we allocate 70GB of memory https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/files/galaxy/tpv/tools.yml#L631 . Not sure how much ORG is allocating.

@mvdbeek what is the procedure here? Following up with Nate or is there a way how we can run those tests again EU? I will try to move the kraken2 memory allocation to the shared TPV database, but it would be nice to have a procedure in place in those cases.

bgruening avatar Aug 06 '23 07:08 bgruening

mh, shared TPV is using 64GB https://github.com/galaxyproject/tpv-shared-database/blob/main/tools.yml#L1307

bgruening avatar Aug 06 '23 07:08 bgruening

This is the error that I get:

Loading database information...Failed attempt to allocate 65426688000bytes;
you may not have enough free memory to load this database.
If your computer has enough RAM, perhaps reducing memory usage from
other programs could help you load this database?
classify: unable to allocate hash table memory

With this tool: toolshed.g2.bx.psu.edu/repos/iuc/kraken2/kraken2/2.1.1+galaxy1:. @EngyNasr you see this error in the error report here on github.

On EU we allocate 70GB of memory https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/files/galaxy/tpv/tools.yml#L631 . Not sure how much ORG is allocating.

@mvdbeek what is the procedure here? Following up with Nate or is there a way how we can run those tests again EU? I will try to move the kraken2 memory allocation to the shared TPV database, but it would be nice to have a procedure in place in those cases.

Thank you so much @bgruening, I have seen it now. In a previous push (https://github.com/galaxyproject/iwc/pull/182/commits/b4a671c0e150ce550882b1f6253ef2b390360be4) I tried using Kalamari as I am using it in the preprocessing workflow and it is working, but I got the same error in Github for the taxonomy profiling. currently, I used MinusB database and i have a failure, I will try another database now

In all cases I explained in the taxonomy profiling readme that the database can be chosen and changed easily on Galaxy based on the input datasets

EngyNasr avatar Aug 07 '23 11:08 EngyNasr

In this last commit:

  • I returned back the database for the taxonomy profiling to StandardPF
  • I reduced the test datasets samples size, kept only some of the reads that will include the virulence factor, detected later on by the workflows, and that to make tests runs faster (thanks to @wm75 :))

EngyNasr avatar Aug 10 '23 11:08 EngyNasr

what is the procedure here?

My preferred approach would be to create a smaller kraken database, I see we did this for the tool tests as well. The other option needs some development time that I don't have currently, which is to send jobs to some remote resource (TES comes to mind, but I don't know if we can get that much memory. But this makes local development, debugging and maintenance near impossible, so I would much prefer the smaller database.

mvdbeek avatar Aug 11 '23 07:08 mvdbeek

but with the smaller DB, the WF would not be production-ready? I guess an "easy" way to shrink the DB size for this case here would be to exclude human data, i.e. to build a StandardPFminusHuman DB, but how would we make this avalailable to users of the workflow on any Galaxy instance? Via cvmfs maybe?

wm75 avatar Aug 11 '23 07:08 wm75

but with the smaller DB, the WF would not be production-ready

Shouldn't the DB be selectable ?

Via cvmfs maybe?

that, or we can provide extra data via planemo

mvdbeek avatar Aug 11 '23 07:08 mvdbeek