orange3 icon indicating copy to clipboard operation
orange3 copied to clipboard

TAB/CSV files read differently on different operating systems

Open PrimozGodec opened this issue 3 years ago • 0 comments

What's wrong? Data from TAB/CSV files with ' characters after delimiter (e.g. at the beginning of the sentence) are sometimes misread on Linux/macOS. Look at the next section for details.

How can we reproduce the problem?

  1. Open File widget
  2. Load https://github.com/biolab/orange3-text/blob/master/orangecontrib/text/datasets/book-excerpts.tab file
  3. Connect the Data Table and observe that in row 133 (in the Text column), the ' character at the beginning of the text is missing (it is present in the TAB file). The error only appears on Linux/macOS; it works correctly on Windows (but there might be the case that it would fail on Windows too).

The reason is that Sniffer (https://github.com/biolab/orange3/blob/99b196651b206b448a9c3120a34954179578d876/Orange/data/io.py#L147-L152) recognizes ' as quote character of CSV and hence this character is removed while reading. In the same case," is recognized as a delimiter on Windows (so it works correctly there).

What's your environment?

  • Operating system: macOS/Linux
  • Orange version: master/3.32

PrimozGodec avatar Jul 20 '22 13:07 PrimozGodec