tools-iuc icon indicating copy to clipboard operation
tools-iuc copied to clipboard

GetOrganelle tool

Open o-william-white opened this issue 2 years ago • 31 comments

Hello,

I have tried to create an .xml wrapper for GetOrganelle https://github.com/Kinggerm/GetOrganelle. I think this is a useful tool and it would be great if it was widely accessible on Galaxy.

At the moment I have included the main input parameters and a selection of more advanced parameters that are commonly used. There are many other parameters and I was not sure how important these would be to the average user. However, I am more than happy to keep adding to this as I am keen to learn how to develop galaxy tools. In terms of output, I only kept the main assembly output files and not included the output from the seed and spades output.

There are a couple of issues I am already aware of. For example, the --reduce-reads-for-coverage is coded as an integer in the xml, but you can also set this as inf on the command line to denote no limit. Also, if you select the anonymous organelle type, you must specifiy -s and --genes inputs. I will work on this next.

I created this pull request to let you know I am working on this and perhaps get some initial feedback.

Best wishes Ollie

FOR CONTRIBUTOR:

  • [x] - I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • [x] - License permits unrestricted use (educational + commercial)
  • [x] - This PR adds a new tool or tool collection
  • [ ] - This PR updates an existing tool or tool collection
  • [ ] - This PR does something else (explain below)

o-william-white avatar Mar 15 '22 22:03 o-william-white

Seems that the tool also fails without the profile with:

############################################################################
ERROR: default embplant_pt,embplant_mt database not added yet!

Install it by: get_organelle_config.py -a embplant_pt,embplant_mt
or
Install all types by: get_organelle_config.py -a all

Probably on you local system you have the data bases in your home dir.. which is why it works for you.

I previously reported this here: https://github.com/Kinggerm/GetOrganelle/issues/64

So the seed DBs need to be downloaded before running the tool. The preferable way would be to store them in a data table (https://galaxyproject.org/admin/tools/data-tables/). For now the data table could be managed manually,

So the Galaxy admin would need to initialize the db with: get_organelle_config.py --use-version 0.0.1 --config-dir GALAXY_TOOL_DATA_DIR/get_organelle/0.0.1/ -a all and create an entry in the data table specifying the version and path. Alternatively one could also think about a data manager tool .. but this may come later (if needed).

Problem is that the seed databases are quite large .. I will do some experiments.

There is also small test data that we might use https://github.com/Kinggerm/GetOrganelle/issues/64#issuecomment-763482665

Last thing that I noticed is that you created the PR from your master branch .. on the long run branches would be better :)

bernt-matthias avatar Mar 19 '22 11:03 bernt-matthias

Hi @bernt-matthias

I understand now, yes I had installed the get organelle seed databses locally. Are the seed databases large enough to be problematic for usage on galaxy?

Regarding the test data, I am already using the smallest test data availble from get organelle, specifically example 1 from here https://github.com/Kinggerm/GetOrganelle/wiki/Examples

Apologies I didn't realise I shouldn't create a pull request from the master branch. If I understand correctly, should I create a new branch and create a new PR from there?

I will also make the changes suggested above this week, specifically:

  • add output filters
  • add inf option for max reads

Thanks for your pateince with this

Cheers Ollie

o-william-white avatar Mar 21 '22 11:03 o-william-white

I understand now, yes I had installed the get organelle seed databses locally. Are the seed databases large enough to be problematic for usage on galaxy?

They are not problematic for Galaxy, but too large for this repo :( We have a limit of 1MB per file and the seed databases are much larger.

Not sure if we can downsample one of the given ones. Such that still results are produced? I also asked at the upstream issue.

Maybe we can just proceed and add a data table and use the original (large) seed database for now just to get the tool and tests finished for now and try to reduce test data later...

Regarding the test data, I am already using the smallest test data availble from get organelle, specifically example 1 from here https://github.com/Kinggerm/GetOrganelle/wiki/Examples

Then its fine.

Apologies I didn't realise I shouldn't create a pull request from the master branch. If I understand correctly, should I create a new branch and create a new PR from there?

Its fine for now. But after the PR is merged you will need to fix/reset your branches (don't do it before since all changes will be lost):

git fetch --all
git reset --hard upstream/master
git push -f origin master

(assuming that your remotes are called origin and master)

I will also make the changes suggested above this week, specifically:

* add output filters

* add `inf` option for max reads

Wonderful.

bernt-matthias avatar Mar 21 '22 11:03 bernt-matthias

Hi @bernt-matthias

Apologies for my slow reply on this.

I have added output filters for the advanced settings and the inf option for --reduce-reads-for-coverage.

However, I have tested the .xml and when I set an advanced option in galaxy (for example changing the word size or spades kmer sizes), it does not change the command, instead using only the simple input paramters.

Can you see where I might be going wrong https://github.com/o-william-white/tools-iuc/blob/master/tools/getorganelle/get_organelle_from_reads.xml

I have been looking for the source of the error this afternoon but can't find it. I expect I have missed something simple!

Best wishes Ollie

o-william-white avatar May 04 '22 17:05 o-william-white

I think there is now version 1.7.6.1 available :)

Thanks @o-william-white for working on it.

bgruening avatar May 07 '22 19:05 bgruening

I think there is now version 1.7.6.1 available :)

Thanks @o-william-white for working on it.

Indeed. And it seems that the authors provided a minimal test database which is just a bit larger than 1mb.

bernt-matthias avatar May 09 '22 08:05 bernt-matthias

Hi @bernt-matthias and @bgruening

Thanks both for the comments, pleased this is making progress. I upgraded the version to 1.7.6.1

And it seems that the authors provided a minimal test database which is just a bit larger than 1mb.

Is the this the test data discussed here https://github.com/Kinggerm/GetOrganelle/wiki/Example-1? I used this data in my test.

Is there anything else needed for this tool? @bernt-matthias mentioned a data manager for reference datasets, I can have a look at this next, perhaps you could recommend an example and I will try to replicate for get organelle.

Best wishes Ollie

o-william-white avatar May 09 '22 12:05 o-william-white

Is the this the test data discussed here https://github.com/Kinggerm/GetOrganelle/wiki/Example-1? I used this data in my test.

See discussion here https://github.com/Kinggerm/GetOrganelle/issues/64 .. inparticular at the end

Is there anything else needed for this tool? @bernt-matthias mentioned a data manager for reference datasets, I can have a look at this next, perhaps you could recommend an example and I will try to replicate for get organelle.

Most data managers in https://github.com/galaxyproject/tools-iuc/tree/master/data_managers/ will be a good example .. maybe https://github.com/galaxyproject/tools-iuc/tree/master/data_managers/data_manager_mitos .. just because it also considers mitogenomes

bernt-matthias avatar May 09 '22 14:05 bernt-matthias

Hi @bernt-matthias

Thanks again for the additional details. If I understand correctly, the discussion at the end of https://github.com/Kinggerm/GetOrganelle/issues/64 refers to the seed and label database used to generate the assembly. It does indeed look a lot smaller in size to the orginal seed and label databases.

The orginal seed/label databse is 1.3M and the smaller databse is 292K.

Would it be better to use the smaller seed/label database in data manager?

However, the data used in the test from here https://github.com/Kinggerm/GetOrganelle/wiki/Example-1 is still the same.

7.8M    test-data/Arabidopsis_simulated.1.fq.gz
8.1M    test-data/Arabidopsis_simulated.2.fq.gz

o-william-white avatar May 10 '22 12:05 o-william-white

The data manager should only download the real seed/label databases.

The smaller/minimal seed/label databases should be used for the tests only. The idea would be to create a data table containing an entry for these databases.

You need to create

  • tool_data_table_conf.xml.sample files
  • tool_data_table_conf.xml.test
  • and mock data table (referenced in the .test file .. as here .loc
  • and use them in a test like this https://github.com/galaxyproject/tools-iuc/blob/8b58f3f03d6430689d228029bb2eb46c16cfff23/tools/mitos/mitos2.xml#L169

See for instance here: https://github.com/galaxyproject/tools-iuc/tree/master/tools/mitos

bernt-matthias avatar May 10 '22 17:05 bernt-matthias

Hi @bernt-matthias

Thanks again for the additional details. If I understand correctly, the discussion at the end of Kinggerm/GetOrganelle#64 refers to the seed and label database used to generate the assembly. It does indeed look a lot smaller in size to the orginal seed and label databases.

The orginal seed/label databse is 1.3M and the smaller databse is 292K.

Would it be better to use the smaller seed/label database in data manager?

However, the data used in the test from here https://github.com/Kinggerm/GetOrganelle/wiki/Example-1 is still the same.

7.8M    test-data/Arabidopsis_simulated.1.fq.gz
8.1M    test-data/Arabidopsis_simulated.2.fq.gz

Actually, if you want the test data to be as small as possible, the top 20k reads (in the file size of 1Mb * 2) of the test-data/Arabidopsis_simulated set are enough to generate a successful complete genome assembly. The only difference from usual is that the tool will tune the parameters to an extreme for shallow coverage and generate some warnings indicating that the coverage is too shallow. Let me know if you need that smallest dataset, I can upload it to a public somewhere or you may simply get it through gunzip -c Arabidopsis_simulated.1.fq.gz|head -n 40000 > Arabidopsis_simulated.10k.1.fq

Kinggerm avatar May 10 '22 20:05 Kinggerm

Thanks @Kinggerm .. these are excellent suggestions. This should reduce the test data size sufficiently. We just try to have test data as small as possible and enforce a "strict" limit of 1MB per file. Otherwise to much test data will accumulate over time .. but we are also working on strategies to have test data in external repos like zenodo .. but this will need time to implement.

bernt-matthias avatar May 11 '22 16:05 bernt-matthias

Hi @bernt-matthias and @Kinggerm,

Thanks both for the comments. I have used the small input data set and seed/label databases as suggested. I made the aditional scripts required and my planemo test passed succesfully.

I only kept the embplant_pt and embplant_mt databses as these seem to be the minimimum requirement.

The test data directory is the following size.

2.2M    test-data/

Best wishes Ollie

o-william-white avatar May 16 '22 08:05 o-william-white

Great. I will have a look. One thing I noticed: You should also add a sample .loc file like here https://github.com/galaxyproject/tools-iuc/blob/master/tools/mitos/mitos.loc.sample

Another question is if the reference datasets are/will be versioned (and specific tool versions require reference data version), then we would need a version column. Do you have plans in this direction @Kinggerm ?

bernt-matthias avatar May 16 '22 10:05 bernt-matthias

Great. I will have a look. One thing I noticed: You should also add a sample .loc file like here https://github.com/galaxyproject/tools-iuc/blob/master/tools/mitos/mitos.loc.sample

Another question is if the reference datasets are/will be versioned (and specific tool versions require reference data version), then we would need a version column. Do you have plans in this direction @Kinggerm ?

Yes, the GetOrganelleDB was versioned already. But currently, we only have two main versions (0.0.0 and 0.0.1), where GetOrganelle <1.7.3 only works with GetOrganelleDB 0.0.0 while GetOrganelle 1.7.3+ work with both GetOrganelleDB 0.0.0 and GetOrganelleDB 0.0.1. I have plans for updating GetOrganelleDB but I cannot do it very soon. So 0.0.1 will last long.

Kinggerm avatar May 16 '22 20:05 Kinggerm

Yes, the GetOrganelleDB was versioned already. But currently, we only have two main versions (0.0.0 and 0.0.1), where GetOrganelle <1.7.3 only works with GetOrganelleDB 0.0.0 while GetOrganelle 1.7.3+ work with both GetOrganelleDB 0.0.0 and GetOrganelleDB 0.0.1. I have plans for updating GetOrganelleDB but I cannot do it very soon. So 0.0.1 will last long.

Thanks for the clarification.

bernt-matthias avatar May 18 '22 08:05 bernt-matthias

Great. I will have a look. One thing I noticed: You should also add a sample .loc file like here https://github.com/galaxyproject/tools-iuc/blob/master/tools/mitos/mitos.loc.sample Another question is if the reference datasets are/will be versioned (and specific tool versions require reference data version), then we would need a version column. Do you have plans in this direction @Kinggerm ?

Yes, the GetOrganelleDB was versioned already. But currently, we only have two main versions (0.0.0 and 0.0.1), where GetOrganelle <1.7.3 only works with GetOrganelleDB 0.0.0 while GetOrganelle 1.7.3+ work with both GetOrganelleDB 0.0.0 and GetOrganelleDB 0.0.1. I have plans for updating GetOrganelleDB but I cannot do it very soon. So 0.0.1 will last long.

I think this is mostly ready. But as you pointed out that the version might change, I think it does not hurt to add a version column in the loc file. What do you think?

bgruening avatar May 23 '22 11:05 bgruening

Hello @bernt-matthias and @bgruening,

Thanks both for the feedback and apologies for the delay in getting back to you. I have made the following changes:

  • added a getorganelle.loc.sample file
  • added a column for the get organelle database version to the test-data/getorganelle.loc file

I was not sure if tool_data_table_conf.xml.sample and tool_data_table_conf.xml.test needed to be adjusted to reflect this.

If you have any further comments or suggestions, please do let me know.

Best wishes Ollie

o-william-white avatar Jul 04 '22 11:07 o-william-white

image

I tried to test the tool in my local galaxy but it cannot find the reference data. I am assuming this is something that needs to be set up with the galaxy tool data manager?

o-william-white avatar Jul 04 '22 11:07 o-william-white

image

I tried to test the tool in my local galaxy but it cannot find the reference data. I am assuming this is something that needs to be set up with the galaxy tool data manager?

Actually it should be possible to edit the .loc file and entries should show up. During startup the log should show how many entries are loaded from each data table.

You can also see the content of the data tables in the admin interface.

bernt-matthias avatar Jul 04 '22 12:07 bernt-matthias

Hello again,

I upated the main xml to include the macro as suggested.

I tried to remove the column for type from test-data/getorganelle.loc, tool_data_table_conf.xml.sample and tool_data_table_conf.xml.test but when I did, I got the error below with planemo test

parameter 'config_dir': requires a value, but no legal values defined

Can you suggest how to edit the files to remove the type column, just in case I am missing something obvious?

o-william-white avatar Jul 04 '22 21:07 o-william-white

Hi @bernt-matthias,

Thanks for the suggestions, I made the changes and moved getorganelle.loc.sample to a directory called tools-data. But when I test the tool using planemo I get the same error below

galaxy.tool_util.verify.interactor.RunToolException: parameter 'config_dir': requires a value, but no legal values defined

I assume it is an issue with my data table and locating the reference data.

I refer to the data tables in the command section: --config-dir '$config_dir.fields.path'

and in the inputs section:

<param argument="--config-dir" label="Reference data" type="select" help="Contact the administrator of this Galaxy instance if you miss reference data">
            <options from_data_table="getorganelle">
                <filter type="static_value" value="getorganelle" column="2"/>
            </options>
            <validator message="No reference annotation is available for getorganelle" type="no_options" />
        </param>

I tried to follow the mitos2 example to get the tool working with data tables but the options may not be relavant to get organelle. Do you have any suggestions on how to resolve the error?

Best wishes Ollie

o-william-white avatar Jul 05 '22 09:07 o-william-white

Remove the filter <filter type="static_value" value="getorganelle" column="2"/> .. its removing all options that do not have the text getorganelle in the 3rd column.

bernt-matthias avatar Jul 05 '22 10:07 bernt-matthias

Remove the filter .. its removing all options that do not have the text getorganelle in the 3rd column.

Thanks @bernt-matthias this fixed the issue, I accepted all suggested changes.

Actually it should be possible to edit the .loc file and entries should show up. During startup the log should show how many entries are loaded from each data table.

When I load the tool in my local galaxy I still get the message that no reference data can be found. I am not sure how to edit the loc file to do this. I tried adding a file called tool-data/getorganelle.loc which looked like this:

getorganelle_refdata    getorganelle seed and label data        /home/owhite/.GetOrganelle/     0.0.1

Where ```/home/owhite/.GetOrganelle/```` is the path to my locally installed get organelle database. Can you suggest where I might be going wrong?

In general, are there any other edits that need to be made to the getorganelle tool before I start to focus on the data manager?

Cheers Ollie

o-william-white avatar Jul 05 '22 11:07 o-william-white

Did you restart after editing the loc file?

Can you execute find tool-data -name "getorganelle.loc" .. I guess the loc file is just at a different location. If I understand it correctly each tool version has its own loc file .. and the contents are merged during startup of Galaxy.

bernt-matthias avatar Jul 05 '22 11:07 bernt-matthias

Yes I restart galaxy each time I edit the files.

When I run find tool-data -name "getorganelle.loc" from the same directory as the getorganelle xml, I get the following output tool-data/getorganelle.loc

Note that for my local galaxy, I have the xml files in a separate folder within my tools directory. Below is the section block from my config/tool_conf.xml file with custom tools that I have worked on:

<section name="MyTools" id="mTools">
    <tool file="myTools/at_content.xml" />
    <tool file="myTools/split_fasta.xml"/>
    <tool file="myTools/get_organelle_from_reads.xml" />
    <tool file="myTools/count_kmers.xml" />
  </section>

o-william-white avatar Jul 05 '22 13:07 o-william-white

Hi @bernt-matthias

Thanks for the update.

The only when tag needed will be for anonymous (move the two parameters here .. I assume that they are only useful for this case?).

The -s and --genes options that are required when -F anonym are also useful when this parameter is not selected. So I thought it might be could to keep these variables separate.

Is it possible/sensible to have the same parameters being set a two different place in an xml?

As it stands, if someone were to set -F anonym without the -s and --genes options, there is a good error meesage to catch this.

get_organelle_from_reads.py -1 Arabidopsis_simulated.1.fq.gz -2 Arabidopsis_simulated.2.fq.gz -o test -F ano                                nym

############################################################################
ERROR: "-s" and "--genes" must be specified when "-F anonym"!


The easiest thing to do would be to ignore the anonym setting.

Best wishes Ollie

o-william-white avatar Aug 05 '22 13:08 o-william-white

So the two parameters are mandatory in the anonymous case and optional otherwise?

bernt-matthias avatar Aug 05 '22 14:08 bernt-matthias

So the two parameters are mandatory in the anonymous case and optional otherwise?

Yes exactly

o-william-white avatar Aug 05 '22 17:08 o-william-white

Then a conditional might be the best solution https://docs.galaxyproject.org/en/latest/dev/schema.html#tool-inputs-conditional

bernt-matthias avatar Aug 06 '22 07:08 bernt-matthias