galaxy icon indicating copy to clipboard operation
galaxy copied to clipboard

add path safe element_identifier

Open bernt-matthias opened this issue 5 years ago • 7 comments

As a follow up from discussions in https://github.com/galaxyproject/galaxy/pull/7372 I would like to suggest and discuss the following convenience function for tool developers.

Currently tool developers often use something like $safe_id = re.sub('[^\w\-\s]', '_', str($element_identifier)) for creating links to input data. This is often done because the input file name is used by programs to encode metadata, e.g. sample names. Such a replacement is only neccessary if the output of these programs (file names or file contents) contains these file names. Hence, by the replacement of the element identifiers changes the metadata that is output of the programs, which might be confusing to users or downstream tools.

So I currently implemented a function that just replaces the only character that must not be used in file names (even if quoted) .. in a bash (would be different on windows systems). In well programmed software this might be sufficient, but I guess often we can not make this assumption. Hence the replacement of [^\w\-\s] would be an alternative implementation or additional function.

Note that here https://github.com/galaxyproject/galaxy/blob/c82749688e06bdd3df496416e9a03c513bccdac6/lib/galaxy/util/init.py#L580 is already a function that implements this (besides allowing '-') and in my tests is (to my surprise) even a wee bit faster than the regular expression.

Looking forward to comments or reviews.

TODOs:

  • [ ] tests (using this in test/functional/tools/identifier... will only be useful once its possible in tests to supply the identifier/name in test data inputs)
  • [ ] documentation

bernt-matthias avatar May 12 '19 14:05 bernt-matthias

Thanks for the comments. I would like to continue the discussion a bit, since I'm really unsure what the best solution is.

For the discussion I would like to cite a comment of @mdvbeek (in https://github.com/galaxyproject/galaxy/pull/7372) about which I thought quite a bit:

So the input dataset needs to be named in certain way in the history ? I don't think we should be forcing users to do that.

This may seem disconnected to the question of naming files, but if you consider that the identifiers -- and thereby the filenames -- are metadata (sample names seem to be the main use case) they are connected. So a replacement suggested here https://planemo.readthedocs.io/en/latest/writing_advanced.html?highlight=element_identifier#processing-identifiers (and implemented in many toos) silently changes the metadata and influences the the output which may cause confusion. Hence, in my opinion the replacement should be as unintrusive as possible -- maybe it shouldn't be done at al.

As a side note, the "Note" at the end of the planemo docs section is contradictory to the statement above. Also I think that it should not be implemented at all.

There's more than just that, ;, ", ', $, (, ) and plenty more that shouldn't be allowed. For a first try I'd probably go with valid_filename_chars in galaxy.util

By implementing the current more extreme solution I wanted to create some discussion on the current state of the art that is implemented in many tools. What would be the technical reason for not allowing these characters? As long as the file name is in single quotes every character except '/' is allowed (in a linux bash), but I would need to replace ' with '"'"' in addition. So the suggested solution would be fine from the Galaxy side. Of course. the called programs might have problems with some of the special characters (e.g. @ in perl scripts). But I would consider this as a bug in the actual programs which should be reported upstream or fixed in the bioconda recipe. Alternatively the tool might just return an error telling the user that the program can not use certain characters and suggest the user to rename the samples.

bernt-matthias avatar May 12 '19 17:05 bernt-matthias

What would be the technical reason for not allowing these characters?

It'll either not work on the command line or be a security hazard. Tool authors can still choose custom or no sanitization of the element identifiers (for instance if you just template them into a config file).

This may seem disconnected to the question of naming files, but if you consider that the identifiers -- and thereby the filenames -- are metadata (sample names seem to be the main use case) they are connected. So a replacement suggested here https://planemo.readthedocs.io/en/latest/writing_advanced.html?highlight=element_identifier#processing-identifiers (and implemented in many toos) silently changes the metadata and influences the the output which may cause confusion.

I don't think anyone disputes this. But as long as the element identifier appears in a command line it needs to be escaped, there's simply no way around it that I can see.

mvdbeek avatar May 15 '19 14:05 mvdbeek

My reasoning was that there is the policy to use single quotes for filenames etc.. This removes the special meaning of every character (except for / and '), see https://www.gnu.org/software/bash/manual/html_node/Single-Quotes.html#Single-Quotes, which should lead to working tool scripts. .. but you are right that it might cause problems in called programs.

I think I will just add a function to replace `'[^\w-\s]' -- which is what tool developers currently do most often -- and add a bit of discussion to the planemo docs (at least highlighting that it should be documented).

Deal? :)

bernt-matthias avatar May 15 '19 16:05 bernt-matthias

yep, sounds good.

mvdbeek avatar May 15 '19 16:05 mvdbeek

How about this :) Note that compared to my previous suggestion I used instead of \s (see https://github.com/galaxyproject/tools-iuc/issues/2417)

Somehow I have the feeling that it would be handy to converge this with:

https://github.com/galaxyproject/galaxy/blob/c82749688e06bdd3df496416e9a03c513bccdac6/lib/galaxy/util/init.py#L580

bernt-matthias avatar May 15 '19 17:05 bernt-matthias

Somehow I have the feeling that it would be handy to converge this with:

absolutely, that'd be good.

mvdbeek avatar May 27 '19 07:05 mvdbeek

Heya .. just rediscovered this old one with @bgruening .. thought a bit about this.

How about to extend the <sanitizer> tag such that it works on data inputs, e.g. by modifying the element identifier (which is not a good idea I guess), or by modifying the safe_element_identifier function?

bernt-matthias avatar Feb 19 '21 13:02 bernt-matthias