tools-iuc icon indicating copy to clipboard operation
tools-iuc copied to clipboard

[Column remove by heading]: Tool fails when file contains non-ascii characters

Open shiltemann opened this issue 2 years ago • 8 comments

I am writing a tutorial on data manipulation, using a dataset that contains lots of names of people, so many accented characters. Using this tool always fails with the following error message:

Traceback (most recent call last):
  File "/cvmfs/main.galaxyproject.org/shed_tools/toolshed.g2.bx.psu.edu/repos/iuc/column_remove_by_header/372967836e98/column_remove_by_header/column_remove_by_header.py", line 25, in <module>
    header = fh.readline().strip( '\r\n' )
  File "/usr/local/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 128: ordinal not in range(128)

shiltemann avatar Jul 11 '22 15:07 shiltemann

This should be dependent on the locale settings of the machine the code is running on, but that, of course, doesn't help. I'll try fixing it in the wrapper until tomorrow. Is https://usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/bgruening/column_arrange_by_header/bg_column_arrange_by_header/0.2 also affected by this?

wm75 avatar Jul 11 '22 16:07 wm75

On second thought: it's not too difficult to fix the encoding issue on the Python side, but you'd need to be able also to specify the column name in the tool interface and I don't know of a way to make this work for non-ASCII characters (sanitization would always prevent that). What could be made working is dropping or keeping only some named columns with just ASCII symbols in their name (e.g. if you wanted to clean a file of weird columns that would likely create issues further on in an analysis). Is that what you are looking for @shiltemann ?

wm75 avatar Jul 11 '22 20:07 wm75

on third thought one could allow unicode escapes in the tool interface text field and decode them on the python side. Then if your users are willing to type Voil\xe0 they could drop a Voilà column from their data.

wm75 avatar Jul 11 '22 21:07 wm75

@wm75 in my case the column headers don't have non-ASCII characters, just the data in the columns. Could we just set the locale in the wrapper before we call awk?

shiltemann avatar Jul 12 '22 08:07 shiltemann

That's an easy case then, but I think I already have a proper general solution ready. Going to open a PR soon.

wm75 avatar Jul 12 '22 08:07 wm75

thanks a lot @wm75! And just to answer your earlier question: yes the column_arrange_by_header tool has the same issue.

shiltemann avatar Jul 12 '22 09:07 shiltemann

Should be addressed in https://github.com/galaxyproject/tools-iuc/pull/4662

Thanks for checking the other tool. Might fix that too if I have a bit of spare time again.

wm75 avatar Jul 12 '22 12:07 wm75

How about using a column select parameter?

bernt-matthias avatar Jul 15 '22 15:07 bernt-matthias