planemo icon indicating copy to clipboard operation
planemo copied to clipboard

Planemo test should not require the use of the Galaxy upload tool

Open gregvonkuster opened this issue 7 years ago • 6 comments

The primary process that planemo uses to test a Galaxy tool is to upload the tool inputs to the specified Galaxy test instance using the Galaxy upload tool. At least this is the approach defined within the travis.yml file in the tools-iuc repository as far as I can tell, and tools must be housed in that repository if they are installed on Galaxy main.

This general approach for planemo testing works in most cases, but newer tools are getting more sophisticated, and using the Galaxy upload tool to define tool inputs for testing is not always justified. An example is the following tool which produces an output with the IdeasPre datatype defined here: https://github.com/galaxyproject/galaxy/pull/5437.

https://testtoolshed.g2.bx.psu.edu/view/greg/ideas/475fa65d5138

The above tool takes a fairly complex input dataset that is produced by the following tool.

https://testtoolshed.g2.bx.psu.edu/view/greg/ideas_preprocessor/f7563bb242fc

There is no realistic scenario where a composite dataset with the IdeasPre datatype will be uploaded to a Galaxy instance using the upload tool. All realistic scenarios have the output of the ideas_preprocessor tool being used as the input to the ideas tool.

However, in order to add a functional test to the ideas tool, the IdeasPre datatype must be upload-able to Galaxy using the upload tool. This restriction has some undesirable effects.

  1. The entry for the datatype in datatypes.comf.xml.sample must have display_in_upload=True set, even though the dataset will likely never be uploaded into Galaxy via the upload tool.
  2. If the IdeasPre datatype does not have to be uploaded, the class could subclass from the Html class and could be as simple as this this:
class IdeasPre(Html):
    """
    This datatype defines the input format required by IDEAS:
    https://academic.oup.com/nar/article/44/14/6721/2468150
    The IDEAS preprocessor tool produces an output using this
    format.  The extra_files_path of the primary input dataset
    contains the following files and directories.
    - chromosome_windows.txt (optional)
    - chromosomes.bed (optional)
    - IDEAS_input_config.txt
    - tmp directory containing a number of compressed bed files.
    """

    composite_type = None
    allow_datatype_change = False
    file_ext = 'ideaspre'

    def set_meta(self, dataset, **kwd):
        Html.set_meta(self, dataset, **kwd)
        for fname in os.listdir(dataset.extra_files_path):
            if fname.startswith("chromosomes"):
                dataset.metadata.chrom_bed = os.path.join(dataset.extra_files_path, fname)
            elif fname.startswith("chromosome_windows"):
                dataset.metadata.chrom_windows = os.path.join(dataset.extra_files_path, fname)
            elif fname.startswith("IDEAS_input_config"):
                dataset.metadata.input_config = os.path.join(dataset.extra_files_path, fname)
            elif fname.startswith("tmp"):
                dataset.metadata.tmp_archive = os.path.join(dataset.extra_files_path, fname)

However, in order for the datatype to be upload-able, the class must subclass from the Rgenetics class and look like this:

class IdeasPre(Rgenetics):
    """
    This datatype defines the input format required by IDEAS:
    https://academic.oup.com/nar/article/44/14/6721/2468150
    The IDEAS preprocessor tool produces an output using this
    format.  The extra_files_path of the primary input dataset
    contains the following files and directories.
    - chromosome_windows.txt (optional)
    - chromosomes.bed (optional)
    - IDEAS_input_config.txt
    - compressed archived tmp directory containing a number of compressed bed files.
    """

    MetadataElement(name="base_name", desc="Base name for this dataset", default='IDEASData', readonly=True, set_in_upload=True)
    MetadataElement(name="chrom_bed", desc="Bed file specifying window positions", default=None, readonly=True)
    MetadataElement(name="chrom_windows", desc="Chromosome window positions", default=None, readonly=True)
    MetadataElement(name="input_config", desc="IDEAS input config", default=None, readonly=True)
    MetadataElement(name="tmp_archive", desc="Compressed archive of compressed bed files", default=None, readonly=True)

    composite_type = 'auto_primary_file'
    allow_datatype_change = False
    file_ext = 'ideaspre'

    def __init__(self, **kwd):
        Html.__init__(self, **kwd)
        self.add_composite_file('chromosome_windows.txt', description='Chromosome window positions', is_binary=False, optional=True)
        self.add_composite_file('chromosomes.bed', description='Bed file specifying window positions', is_binary=False, optional=True)
        self.add_composite_file('IDEAS_input_config.txt', description='IDEAS input config', is_binary=False)
        self.add_composite_file('tmp.tar.gz', description='Compressed archive of compressed bed files', is_binary=True)

    def set_meta(self, dataset, **kwd):
        Html.set_meta(self, dataset, **kwd)
        for fname in os.listdir(dataset.extra_files_path):
            if fname.startswith("chromosomes"):
                dataset.metadata.chrom_bed = os.path.join(dataset.extra_files_path, fname)
            elif fname.startswith("chromosome_windows"):
                dataset.metadata.chrom_windows = os.path.join(dataset.extra_files_path, fname)
            elif fname.startswith("IDEAS_input_config"):
                dataset.metadata.input_config = os.path.join(dataset.extra_files_path, fname)
            elif fname.startswith("tmp"):
                dataset.metadata.tmp_archive = os.path.join(dataset.extra_files_path, fname)
        self.regenerate_primary_file(dataset)

    def generate_primary_file(self, dataset=None):
        rval = ['<html><head></head><body>']
        rval.append('<h3>Files prepared for IDEAS</h3>')
        rval.append('<ul>')
        for composite_name, composite_file in self.get_composite_files(dataset=dataset).items():
            fn = composite_name
            rval.append('<li><a href="%s>%s</a></li>' % (fn, fn))
        rval.append('</ul></body></html>\n')
        return "\n".join(rval)

    def regenerate_primary_file(self, dataset):
        # Cannot do this until we are setting metadata.
        rval = ['<html><head></head><body>']
        rval.append('<h3>Files prepared for IDEAS</h3>')
        rval.append('<ul>')
        for fname in os.listdir(dataset.extra_files_path):
            fn = os.path.split(fname)[-1]
            rval.append('<li><a href="%s">%s</a></li>' % (fn, fn))
        rval.append('</ul></body></html>')
        with open(dataset.file_name, 'w') as f:
            f.write("\n".join(rval))
            f.write('\n')
  1. Html datatypes allow for any number of files to be associated with the primary dataset, and directories are possible 1 level deep. So if the IdeasPre datatype could be sub-classed from Html, the primary dataset could look like the following. The tmp link is a directory that when clicked will display the files within the directory. This provides nice transparency for this datatype.
Files prepared for IDEAS

    chromosome_windows.txt
    chromosomes.bed
    IDEAS_input_config.txt
    tmp/
        E001-H3K9ac.bed.gz
        ...N number of compressed bed files...

However, the composite datatypes do not support an unknown number of files like those displayed in the tmp directory above. So the tmp directory must be archived so that it can be defined as a single composite file. But the Galaxy upload tool requires a user to manually set the datatype to tar when uploading a tar file. However, the following entry in a tool function test will throw an exception because the datatype (i.e., ftype) is not allowed to be set for this tag.

<composite_data value='ideas_test1/tmp.tar' ftype="tar"/>

So the archive must be compressed so that the above tag can look like this:

<composite_data value='ideas_test1/tmp.tar.gz'/>

Forcing the directory of files to be a compressed archive is not ideal because it does not allow the user to see the contents.

A nice alternative to uploading tool inputs when testing might be to allow for a tool to define its inputs as the outputs produced by another tool. Then planemo test could run that tool first to acquire the outputs rather than uploading them.

gregvonkuster avatar Feb 01 '18 19:02 gregvonkuster

I agree broadly that a variety of different uploads scenarios should be supported that aren't currently supported. But the upload tool is how we get data into Galaxy - for libraries, for API tests, even for the upload 2.0 I'm working on it is still backed by a tool (https://github.com/galaxyproject/galaxy/pull/5220). I'd just say that it should be easier to upload a variety of different types to Galaxy - not that it shouldn't use a tool.

I'll also add that I don't like the idea of chaining tools together to run tool tests - there should a workflow test for that. I understand I haven't really moved on say https://github.com/galaxyproject/planemo/issues/685 but I really should and I think I prefer that vastly to developing new tool chaining XML syntaxes just for testing.

jmchilton avatar Feb 01 '18 19:02 jmchilton

@jmchilton Yes, I agree that the support for planemo workflow testing would be ideal. I've looked at it in the past but was not able to get it working. However, in this particular case, @nekrut has told @yuzhang123 that these particular tools can be hosted on Galaxy main, so they must be housed within the tools-iuc repo (at least that is my current understanding). And in order to be accepted into the tools-iuc repo, they must pass functional tests. So if the planemo workflow testing is going to be the approach here, then the tools-iuc travis.yml would have to be configured to use that, which I'm hopeful it could.

gregvonkuster avatar Feb 01 '18 19:02 gregvonkuster

can be hosted on Galaxy main, so they must be housed within the tools-iuc repo

I do not think this is true, we have many non-iuc (and non-devteam) tools on Main

martenson avatar Feb 01 '18 19:02 martenson

Ah, this is good to know. However, @nekrut did tell me this recently, so I'll double check with him.

gregvonkuster avatar Feb 01 '18 20:02 gregvonkuster

I have another use case for this, testing BLAST tools with a database in the user's history - which cannot be uploaded (currently), only created by running makeblastdb within Galaxy. Again, a workflow test would solve this. https://github.com/peterjc/galaxy_blast/issues/3

peterjc avatar Feb 02 '18 11:02 peterjc

This is related to this issue https://github.com/galaxyproject/planemo/issues/685

gregvonkuster avatar Aug 24 '18 14:08 gregvonkuster