tools-iuc
tools-iuc copied to clipboard
[Column remove by heading]: Tool fails when file contains non-ascii characters
I am writing a tutorial on data manipulation, using a dataset that contains lots of names of people, so many accented characters. Using this tool always fails with the following error message:
Traceback (most recent call last):
File "/cvmfs/main.galaxyproject.org/shed_tools/toolshed.g2.bx.psu.edu/repos/iuc/column_remove_by_header/372967836e98/column_remove_by_header/column_remove_by_header.py", line 25, in <module>
header = fh.readline().strip( '\r\n' )
File "/usr/local/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 128: ordinal not in range(128)
This should be dependent on the locale settings of the machine the code is running on, but that, of course, doesn't help. I'll try fixing it in the wrapper until tomorrow. Is https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/bgruening/column_arrange_by_header/bg_column_arrange_by_header/0.2 also affected by this?
On second thought: it's not too difficult to fix the encoding issue on the Python side, but you'd need to be able also to specify the column name in the tool interface and I don't know of a way to make this work for non-ASCII characters (sanitization would always prevent that). What could be made working is dropping or keeping only some named columns with just ASCII symbols in their name (e.g. if you wanted to clean a file of weird columns that would likely create issues further on in an analysis). Is that what you are looking for @shiltemann ?
on third thought one could allow unicode escapes in the tool interface text field and decode them on the python side. Then if your users are willing to type Voil\xe0
they could drop a Voilà
column from their data.
@wm75 in my case the column headers don't have non-ASCII characters, just the data in the columns. Could we just set the locale in the wrapper before we call awk?
That's an easy case then, but I think I already have a proper general solution ready. Going to open a PR soon.
thanks a lot @wm75! And just to answer your earlier question: yes the column_arrange_by_header
tool has the same issue.
Should be addressed in https://github.com/galaxyproject/tools-iuc/pull/4662
Thanks for checking the other tool. Might fix that too if I have a bit of spare time again.
How about using a column select parameter?