tools-iuc [Column remove by heading]: Tool fails when file contains non-ascii characters

I am writing a tutorial on data manipulation, using a dataset that contains lots of names of people, so many accented characters. Using this tool always fails with the following error message:

Traceback (most recent call last):
  File "/cvmfs/main.galaxyproject.org/shed_tools/toolshed.g2.bx.psu.edu/repos/iuc/column_remove_by_header/372967836e98/column_remove_by_header/column_remove_by_header.py", line 25, in <module>
    header = fh.readline().strip( '\r\n' )
  File "/usr/local/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 128: ordinal not in range(128)

Jul 11 '22 15:07 shiltemann

This should be dependent on the locale settings of the machine the code is running on, but that, of course, doesn't help. I'll try fixing it in the wrapper until tomorrow. Is https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/bgruening/column_arrange_by_header/bg_column_arrange_by_header/0.2 also affected by this?

Jul 11 '22 16:07 wm75

On second thought: it's not too difficult to fix the encoding issue on the Python side, but you'd need to be able also to specify the column name in the tool interface and I don't know of a way to make this work for non-ASCII characters (sanitization would always prevent that). What could be made working is dropping or keeping only some named columns with just ASCII symbols in their name (e.g. if you wanted to clean a file of weird columns that would likely create issues further on in an analysis). Is that what you are looking for @shiltemann ?

Jul 11 '22 20:07 wm75

on third thought one could allow unicode escapes in the tool interface text field and decode them on the python side. Then if your users are willing to type Voil\xe0 they could drop a Voilà column from their data.

Jul 11 '22 21:07 wm75

@wm75 in my case the column headers don't have non-ASCII characters, just the data in the columns. Could we just set the locale in the wrapper before we call awk?

Jul 12 '22 08:07 shiltemann

That's an easy case then, but I think I already have a proper general solution ready. Going to open a PR soon.

Jul 12 '22 08:07 wm75

thanks a lot @wm75! And just to answer your earlier question: yes the column_arrange_by_header tool has the same issue.

Jul 12 '22 09:07 shiltemann

Should be addressed in https://github.com/galaxyproject/tools-iuc/pull/4662

Thanks for checking the other tool. Might fix that too if I have a bit of spare time again.

Jul 12 '22 12:07 wm75

How about using a column select parameter?

Jul 15 '22 15:07 bernt-matthias

tools-iuc tools-iuc copied to clipboard

[Column remove by heading]: Tool fails when file contains non-ascii characters

tools-iuc
tools-iuc copied to clipboard