tools-iuc
tools-iuc copied to clipboard
GetOrganelle tool
Hello,
I have tried to create an .xml wrapper for GetOrganelle https://github.com/Kinggerm/GetOrganelle. I think this is a useful tool and it would be great if it was widely accessible on Galaxy.
At the moment I have included the main input parameters and a selection of more advanced parameters that are commonly used. There are many other parameters and I was not sure how important these would be to the average user. However, I am more than happy to keep adding to this as I am keen to learn how to develop galaxy tools. In terms of output, I only kept the main assembly output files and not included the output from the seed and spades output.
There are a couple of issues I am already aware of. For example, the --reduce-reads-for-coverage is coded as an integer in the xml, but you can also set this as inf on the command line to denote no limit. Also, if you select the anonymous organelle type, you must specifiy -s and --genes inputs. I will work on this next.
I created this pull request to let you know I am working on this and perhaps get some initial feedback.
Best wishes Ollie
FOR CONTRIBUTOR:
- [x] - I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
- [x] - License permits unrestricted use (educational + commercial)
- [x] - This PR adds a new tool or tool collection
- [ ] - This PR updates an existing tool or tool collection
- [ ] - This PR does something else (explain below)
Seems that the tool also fails without the profile with:
############################################################################
ERROR: default embplant_pt,embplant_mt database not added yet!
Install it by: get_organelle_config.py -a embplant_pt,embplant_mt
or
Install all types by: get_organelle_config.py -a all
Probably on you local system you have the data bases in your home dir.. which is why it works for you.
I previously reported this here: https://github.com/Kinggerm/GetOrganelle/issues/64
So the seed DBs need to be downloaded before running the tool. The preferable way would be to store them in a data table (https://galaxyproject.org/admin/tools/data-tables/). For now the data table could be managed manually,
So the Galaxy admin would need to initialize the db with: get_organelle_config.py --use-version 0.0.1 --config-dir GALAXY_TOOL_DATA_DIR/get_organelle/0.0.1/ -a all
and create an entry in the data table specifying the version and path.
Alternatively one could also think about a data manager tool .. but this may come later (if needed).
Problem is that the seed databases are quite large .. I will do some experiments.
There is also small test data that we might use https://github.com/Kinggerm/GetOrganelle/issues/64#issuecomment-763482665
Last thing that I noticed is that you created the PR from your master
branch .. on the long run branches would be better :)
Hi @bernt-matthias
I understand now, yes I had installed the get organelle seed databses locally. Are the seed databases large enough to be problematic for usage on galaxy?
Regarding the test data, I am already using the smallest test data availble from get organelle, specifically example 1 from here https://github.com/Kinggerm/GetOrganelle/wiki/Examples
Apologies I didn't realise I shouldn't create a pull request from the master branch. If I understand correctly, should I create a new branch and create a new PR from there?
I will also make the changes suggested above this week, specifically:
- add output filters
- add
inf
option for max reads
Thanks for your pateince with this
Cheers Ollie
I understand now, yes I had installed the get organelle seed databses locally. Are the seed databases large enough to be problematic for usage on galaxy?
They are not problematic for Galaxy, but too large for this repo :( We have a limit of 1MB per file and the seed databases are much larger.
Not sure if we can downsample one of the given ones. Such that still results are produced? I also asked at the upstream issue.
Maybe we can just proceed and add a data table and use the original (large) seed database for now just to get the tool and tests finished for now and try to reduce test data later...
Regarding the test data, I am already using the smallest test data availble from get organelle, specifically example 1 from here https://github.com/Kinggerm/GetOrganelle/wiki/Examples
Then its fine.
Apologies I didn't realise I shouldn't create a pull request from the master branch. If I understand correctly, should I create a new branch and create a new PR from there?
Its fine for now. But after the PR is merged you will need to fix/reset your branches (don't do it before since all changes will be lost):
git fetch --all
git reset --hard upstream/master
git push -f origin master
(assuming that your remotes are called origin
and master
)
I will also make the changes suggested above this week, specifically:
* add output filters * add `inf` option for max reads
Wonderful.
Hi @bernt-matthias
Apologies for my slow reply on this.
I have added output filters for the advanced settings and the inf option for --reduce-reads-for-coverage.
However, I have tested the .xml and when I set an advanced option in galaxy (for example changing the word size or spades kmer sizes), it does not change the command, instead using only the simple input paramters.
Can you see where I might be going wrong https://github.com/o-william-white/tools-iuc/blob/master/tools/getorganelle/get_organelle_from_reads.xml
I have been looking for the source of the error this afternoon but can't find it. I expect I have missed something simple!
Best wishes Ollie
I think there is now version 1.7.6.1 available :)
Thanks @o-william-white for working on it.
I think there is now version 1.7.6.1 available :)
Thanks @o-william-white for working on it.
Indeed. And it seems that the authors provided a minimal test database which is just a bit larger than 1mb.
Hi @bernt-matthias and @bgruening
Thanks both for the comments, pleased this is making progress. I upgraded the version to 1.7.6.1
And it seems that the authors provided a minimal test database which is just a bit larger than 1mb.
Is the this the test data discussed here https://github.com/Kinggerm/GetOrganelle/wiki/Example-1? I used this data in my test.
Is there anything else needed for this tool? @bernt-matthias mentioned a data manager for reference datasets, I can have a look at this next, perhaps you could recommend an example and I will try to replicate for get organelle.
Best wishes Ollie
Is the this the test data discussed here https://github.com/Kinggerm/GetOrganelle/wiki/Example-1? I used this data in my test.
See discussion here https://github.com/Kinggerm/GetOrganelle/issues/64 .. inparticular at the end
Is there anything else needed for this tool? @bernt-matthias mentioned a data manager for reference datasets, I can have a look at this next, perhaps you could recommend an example and I will try to replicate for get organelle.
Most data managers in https://github.com/galaxyproject/tools-iuc/tree/master/data_managers/ will be a good example .. maybe https://github.com/galaxyproject/tools-iuc/tree/master/data_managers/data_manager_mitos .. just because it also considers mitogenomes
Hi @bernt-matthias
Thanks again for the additional details. If I understand correctly, the discussion at the end of https://github.com/Kinggerm/GetOrganelle/issues/64 refers to the seed and label database used to generate the assembly. It does indeed look a lot smaller in size to the orginal seed and label databases.
The orginal seed/label databse is 1.3M and the smaller databse is 292K.
Would it be better to use the smaller seed/label database in data manager?
However, the data used in the test from here https://github.com/Kinggerm/GetOrganelle/wiki/Example-1 is still the same.
7.8M test-data/Arabidopsis_simulated.1.fq.gz
8.1M test-data/Arabidopsis_simulated.2.fq.gz
The data manager should only download the real seed/label databases.
The smaller/minimal seed/label databases should be used for the tests only. The idea would be to create a data table containing an entry for these databases.
You need to create
-
tool_data_table_conf.xml.sample
files -
tool_data_table_conf.xml.test
- and mock data table (referenced in the
.test
file .. as here .loc - and use them in a test like this https://github.com/galaxyproject/tools-iuc/blob/8b58f3f03d6430689d228029bb2eb46c16cfff23/tools/mitos/mitos2.xml#L169
See for instance here: https://github.com/galaxyproject/tools-iuc/tree/master/tools/mitos
Hi @bernt-matthias
Thanks again for the additional details. If I understand correctly, the discussion at the end of Kinggerm/GetOrganelle#64 refers to the seed and label database used to generate the assembly. It does indeed look a lot smaller in size to the orginal seed and label databases.
The orginal seed/label databse is 1.3M and the smaller databse is 292K.
Would it be better to use the smaller seed/label database in data manager?
However, the data used in the test from here https://github.com/Kinggerm/GetOrganelle/wiki/Example-1 is still the same.
7.8M test-data/Arabidopsis_simulated.1.fq.gz 8.1M test-data/Arabidopsis_simulated.2.fq.gz
Actually, if you want the test data to be as small as possible, the top 20k reads (in the file size of 1Mb * 2) of the test-data/Arabidopsis_simulated
set are enough to generate a successful complete genome assembly. The only difference from usual is that the tool will tune the parameters to an extreme for shallow coverage and generate some warnings indicating that the coverage is too shallow.
Let me know if you need that smallest dataset, I can upload it to a public somewhere or you may simply get it through gunzip -c Arabidopsis_simulated.1.fq.gz|head -n 40000 > Arabidopsis_simulated.10k.1.fq
Thanks @Kinggerm .. these are excellent suggestions. This should reduce the test data size sufficiently. We just try to have test data as small as possible and enforce a "strict" limit of 1MB per file. Otherwise to much test data will accumulate over time .. but we are also working on strategies to have test data in external repos like zenodo .. but this will need time to implement.
Hi @bernt-matthias and @Kinggerm,
Thanks both for the comments. I have used the small input data set and seed/label databases as suggested. I made the aditional scripts required and my planemo test passed succesfully.
I only kept the embplant_pt and embplant_mt databses as these seem to be the minimimum requirement.
The test data directory is the following size.
2.2M test-data/
Best wishes Ollie
Great. I will have a look. One thing I noticed: You should also add a sample .loc
file like here https://github.com/galaxyproject/tools-iuc/blob/master/tools/mitos/mitos.loc.sample
Another question is if the reference datasets are/will be versioned (and specific tool versions require reference data version), then we would need a version column. Do you have plans in this direction @Kinggerm ?
Great. I will have a look. One thing I noticed: You should also add a sample
.loc
file like here https://github.com/galaxyproject/tools-iuc/blob/master/tools/mitos/mitos.loc.sampleAnother question is if the reference datasets are/will be versioned (and specific tool versions require reference data version), then we would need a version column. Do you have plans in this direction @Kinggerm ?
Yes, the GetOrganelleDB was versioned already.
But currently, we only have two main versions (0.0.0
and 0.0.1
), where
GetOrganelle <1.7.3
only works with GetOrganelleDB 0.0.0
while GetOrganelle 1.7.3+
work with both GetOrganelleDB 0.0.0
and GetOrganelleDB 0.0.1
. I have plans for updating GetOrganelleDB but I cannot do it very soon. So 0.0.1
will last long.
Yes, the GetOrganelleDB was versioned already. But currently, we only have two main versions (
0.0.0
and0.0.1
), whereGetOrganelle <1.7.3
only works withGetOrganelleDB 0.0.0
whileGetOrganelle 1.7.3+
work with bothGetOrganelleDB 0.0.0
andGetOrganelleDB 0.0.1
. I have plans for updating GetOrganelleDB but I cannot do it very soon. So0.0.1
will last long.
Thanks for the clarification.
Great. I will have a look. One thing I noticed: You should also add a sample
.loc
file like here https://github.com/galaxyproject/tools-iuc/blob/master/tools/mitos/mitos.loc.sample Another question is if the reference datasets are/will be versioned (and specific tool versions require reference data version), then we would need a version column. Do you have plans in this direction @Kinggerm ?Yes, the GetOrganelleDB was versioned already. But currently, we only have two main versions (
0.0.0
and0.0.1
), whereGetOrganelle <1.7.3
only works withGetOrganelleDB 0.0.0
whileGetOrganelle 1.7.3+
work with bothGetOrganelleDB 0.0.0
andGetOrganelleDB 0.0.1
. I have plans for updating GetOrganelleDB but I cannot do it very soon. So0.0.1
will last long.
I think this is mostly ready. But as you pointed out that the version might change, I think it does not hurt to add a version column in the loc file. What do you think?
Hello @bernt-matthias and @bgruening,
Thanks both for the feedback and apologies for the delay in getting back to you. I have made the following changes:
- added a
getorganelle.loc.sample
file - added a column for the get organelle database version to the
test-data/getorganelle.loc
file
I was not sure if tool_data_table_conf.xml.sample
and tool_data_table_conf.xml.test
needed to be adjusted to reflect this.
If you have any further comments or suggestions, please do let me know.
Best wishes Ollie
I tried to test the tool in my local galaxy but it cannot find the reference data. I am assuming this is something that needs to be set up with the galaxy tool data manager?
I tried to test the tool in my local galaxy but it cannot find the reference data. I am assuming this is something that needs to be set up with the galaxy tool data manager?
Actually it should be possible to edit the .loc file and entries should show up. During startup the log should show how many entries are loaded from each data table.
You can also see the content of the data tables in the admin interface.
Hello again,
I upated the main xml to include the macro as suggested.
I tried to remove the column for type from test-data/getorganelle.loc
, tool_data_table_conf.xml.sample
and tool_data_table_conf.xml.test
but when I did, I got the error below with planemo test
parameter 'config_dir': requires a value, but no legal values defined
Can you suggest how to edit the files to remove the type column, just in case I am missing something obvious?
Hi @bernt-matthias,
Thanks for the suggestions, I made the changes and moved getorganelle.loc.sample
to a directory called tools-data
. But when I test the tool using planemo I get the same error below
galaxy.tool_util.verify.interactor.RunToolException: parameter 'config_dir': requires a value, but no legal values defined
I assume it is an issue with my data table and locating the reference data.
I refer to the data tables in the command section:
--config-dir '$config_dir.fields.path'
and in the inputs section:
<param argument="--config-dir" label="Reference data" type="select" help="Contact the administrator of this Galaxy instance if you miss reference data">
<options from_data_table="getorganelle">
<filter type="static_value" value="getorganelle" column="2"/>
</options>
<validator message="No reference annotation is available for getorganelle" type="no_options" />
</param>
I tried to follow the mitos2 example to get the tool working with data tables but the options may not be relavant to get organelle. Do you have any suggestions on how to resolve the error?
Best wishes Ollie
Remove the filter <filter type="static_value" value="getorganelle" column="2"/>
.. its removing all options that do not have the text getorganelle
in the 3rd column.
Remove the filter
.. its removing all options that do not have the text getorganelle in the 3rd column.
Thanks @bernt-matthias this fixed the issue, I accepted all suggested changes.
Actually it should be possible to edit the .loc file and entries should show up. During startup the log should show how many entries are loaded from each data table.
When I load the tool in my local galaxy I still get the message that no reference data can be found. I am not sure how to edit the loc file to do this. I tried adding a file called tool-data/getorganelle.loc
which looked like this:
getorganelle_refdata getorganelle seed and label data /home/owhite/.GetOrganelle/ 0.0.1
Where ```/home/owhite/.GetOrganelle/```` is the path to my locally installed get organelle database. Can you suggest where I might be going wrong?
In general, are there any other edits that need to be made to the getorganelle tool before I start to focus on the data manager?
Cheers Ollie
Did you restart after editing the loc file?
Can you execute find tool-data -name "getorganelle.loc"
.. I guess the loc file is just at a different location. If I understand it correctly each tool version has its own loc file .. and the contents are merged during startup of Galaxy.
Yes I restart galaxy each time I edit the files.
When I run find tool-data -name "getorganelle.loc"
from the same directory as the getorganelle xml, I get the following output tool-data/getorganelle.loc
Note that for my local galaxy, I have the xml files in a separate folder within my tools directory. Below is the section block from my config/tool_conf.xml
file with custom tools that I have worked on:
<section name="MyTools" id="mTools">
<tool file="myTools/at_content.xml" />
<tool file="myTools/split_fasta.xml"/>
<tool file="myTools/get_organelle_from_reads.xml" />
<tool file="myTools/count_kmers.xml" />
</section>
Hi @bernt-matthias
Thanks for the update.
The only when tag needed will be for anonymous (move the two parameters here .. I assume that they are only useful for this case?).
The -s
and --genes
options that are required when -F anonym
are also useful when this parameter is not selected. So I thought it might be could to keep these variables separate.
Is it possible/sensible to have the same parameters being set a two different place in an xml?
As it stands, if someone were to set -F anonym
without the -s
and --genes
options, there is a good error meesage to catch this.
get_organelle_from_reads.py -1 Arabidopsis_simulated.1.fq.gz -2 Arabidopsis_simulated.2.fq.gz -o test -F ano nym
############################################################################
ERROR: "-s" and "--genes" must be specified when "-F anonym"!
The easiest thing to do would be to ignore the anonym setting.
Best wishes Ollie
So the two parameters are mandatory in the anonymous case and optional otherwise?
So the two parameters are mandatory in the anonymous case and optional otherwise?
Yes exactly
Then a conditional might be the best solution https://docs.galaxyproject.org/en/latest/dev/schema.html#tool-inputs-conditional