orange3
orange3 copied to clipboard
TAB/CSV files read differently on different operating systems
What's wrong?
Data from TAB/CSV files with ' characters after delimiter (e.g. at the beginning of the sentence) are sometimes misread on Linux/macOS. Look at the next section for details.
How can we reproduce the problem?
- Open File widget
- Load https://github.com/biolab/orange3-text/blob/master/orangecontrib/text/datasets/book-excerpts.tab file
- Connect the Data Table and observe that in row 133 (in the Text column), the
'character at the beginning of the text is missing (it is present in the TAB file). The error only appears on Linux/macOS; it works correctly on Windows (but there might be the case that it would fail on Windows too).
The reason is that Sniffer (https://github.com/biolab/orange3/blob/99b196651b206b448a9c3120a34954179578d876/Orange/data/io.py#L147-L152) recognizes ' as quote character of CSV and hence this character is removed while reading. In the same case," is recognized as a delimiter on Windows (so it works correctly there).
What's your environment?
- Operating system: macOS/Linux
- Orange version: master/3.32