galaxy
galaxy copied to clipboard
add path safe element_identifier
As a follow up from discussions in https://github.com/galaxyproject/galaxy/pull/7372 I would like to suggest and discuss the following convenience function for tool developers.
Currently tool developers often use something like $safe_id = re.sub('[^\w\-\s]', '_', str($element_identifier)) for creating links to input data. This is often done because the input file name is used by programs to encode metadata, e.g. sample names. Such a replacement is only neccessary if the output of these programs (file names or file contents) contains these file names. Hence, by the replacement of the element identifiers changes the metadata that is output of the programs, which might be confusing to users or downstream tools.
So I currently implemented a function that just replaces the only character that must not be used in file names (even if quoted) .. in a bash (would be different on windows systems). In well programmed software this might be sufficient, but I guess often we can not make this assumption. Hence the replacement of [^\w\-\s] would be an alternative implementation or additional function.
Note that here https://github.com/galaxyproject/galaxy/blob/c82749688e06bdd3df496416e9a03c513bccdac6/lib/galaxy/util/init.py#L580 is already a function that implements this (besides allowing '-') and in my tests is (to my surprise) even a wee bit faster than the regular expression.
Looking forward to comments or reviews.
TODOs:
- [ ] tests (using this in test/functional/tools/identifier... will only be useful once its possible in tests to supply the identifier/name in test data inputs)
- [ ] documentation
Thanks for the comments. I would like to continue the discussion a bit, since I'm really unsure what the best solution is.
For the discussion I would like to cite a comment of @mdvbeek (in https://github.com/galaxyproject/galaxy/pull/7372) about which I thought quite a bit:
So the input dataset needs to be named in certain way in the history ? I don't think we should be forcing users to do that.
This may seem disconnected to the question of naming files, but if you consider that the identifiers -- and thereby the filenames -- are metadata (sample names seem to be the main use case) they are connected. So a replacement suggested here https://planemo.readthedocs.io/en/latest/writing_advanced.html?highlight=element_identifier#processing-identifiers (and implemented in many toos) silently changes the metadata and influences the the output which may cause confusion. Hence, in my opinion the replacement should be as unintrusive as possible -- maybe it shouldn't be done at al.
As a side note, the "Note" at the end of the planemo docs section is contradictory to the statement above. Also I think that it should not be implemented at all.
There's more than just that,
;, ", ', $, (, )and plenty more that shouldn't be allowed. For a first try I'd probably go with valid_filename_chars in galaxy.util
By implementing the current more extreme solution I wanted to create some discussion on the current state of the art that is implemented in many tools. What would be the technical reason for not allowing these characters? As long as the file name is in single quotes every character except '/' is allowed (in a linux bash), but I would need to replace ' with '"'"' in addition. So the suggested solution would be fine from the Galaxy side. Of course. the called programs might have problems with some of the special characters (e.g. @ in perl scripts). But I would consider this as a bug in the actual programs which should be reported upstream or fixed in the bioconda recipe. Alternatively the tool might just return an error telling the user that the program can not use certain characters and suggest the user to rename the samples.
What would be the technical reason for not allowing these characters?
It'll either not work on the command line or be a security hazard. Tool authors can still choose custom or no sanitization of the element identifiers (for instance if you just template them into a config file).
This may seem disconnected to the question of naming files, but if you consider that the identifiers -- and thereby the filenames -- are metadata (sample names seem to be the main use case) they are connected. So a replacement suggested here https://planemo.readthedocs.io/en/latest/writing_advanced.html?highlight=element_identifier#processing-identifiers (and implemented in many toos) silently changes the metadata and influences the the output which may cause confusion.
I don't think anyone disputes this. But as long as the element identifier appears in a command line it needs to be escaped, there's simply no way around it that I can see.
My reasoning was that there is the policy to use single quotes for filenames etc.. This removes the special meaning of every character (except for / and '), see https://www.gnu.org/software/bash/manual/html_node/Single-Quotes.html#Single-Quotes, which should lead to working tool scripts. .. but you are right that it might cause problems in called programs.
I think I will just add a function to replace `'[^\w-\s]' -- which is what tool developers currently do most often -- and add a bit of discussion to the planemo docs (at least highlighting that it should be documented).
Deal? :)
yep, sounds good.
How about this :) Note that compared to my previous suggestion I used instead of \s (see https://github.com/galaxyproject/tools-iuc/issues/2417)
Somehow I have the feeling that it would be handy to converge this with:
https://github.com/galaxyproject/galaxy/blob/c82749688e06bdd3df496416e9a03c513bccdac6/lib/galaxy/util/init.py#L580
Somehow I have the feeling that it would be handy to converge this with:
absolutely, that'd be good.
Heya .. just rediscovered this old one with @bgruening .. thought a bit about this.
How about to extend the <sanitizer> tag such that it works on data inputs, e.g. by modifying the element identifier (which is not a good idea I guess), or by modifying the safe_element_identifier function?