checklistbank icon indicating copy to clipboard operation
checklistbank copied to clipboard

Error when loading Simple Tab Delimited File into Postgres

Open ocourch opened this issue 1 year ago • 2 comments

Hi, I've been following the instructions here to copy the Simple Tab Delimited File to a Postgres instance. When running the command, I first got this error:

postgres=# \copy backbone FROM 'C:\Users\oscar\Downloads\simple.txt'
ERROR:  character with byte sequence 0x81 in encoding "WIN1252" has no equivalent in encoding "UTF8"
CONTEXT:  COPY backbone, line 737

I checked that line and the file, and it contained some Cyrillic characters that, when removed, shifted the error down the file:

postgres=# \copy backbone FROM 'C:\Users\oscar\Downloads\simple.txt'
ERROR:  character with byte sequence 0x9d in encoding "WIN1252" has no equivalent in encoding "UTF8"
CONTEXT:  COPY backbone, line 4735

Is this a known issue when attempting to do this in Windows? I've already made sure the Postgres instance and the file are all UTF-8, but I still get the error. Is there a version of the file that copies correctly on Windows?

ocourch avatar Dec 05 '23 23:12 ocourch

The file is in UTF-8 encoding:

$ head -n 737 simple.txt | tail -n 1
7518485 7895461 \N      f       ACCEPTED        GENUS   {}      7ddf754f-d193-4cc9-b351-99906754a03b    SOURCE  181472999       4  8376456  7434778 7378538 7895461 7518485 \N      35789   Abdullaevia Suleymanov, 1965    Abdullaevia     Abdullaevia     \N      \N \N       Suleymanov      1965    \N      \N      Suleymanov, I. S. (1965). О полиморфиноидных формах и о формах с ячеистым строением стенки среди агглютинированных фораминифер - About polymorphinoid forms and forms with a cellular structure of the wall among agglutinated foraminifera. Доклады Академии наук Узбекской ССР - Reports of the Academy of Sciences of the Uzbek SSR, 7: 46–48. https://www.marinespecies.org/foraminifera/aphia.php?p=sourcedetails&id=286417    {}

$ head -n 737 simple.txt | tail -n 1 | head -c 263 | tail -c 2
л

$ head -n 737 simple.txt | tail -n 1 | head -c 263 | tail -c 2 | hexdump -C
0000000 d0 bb

And the UTF-8 encoding of л is d0 bb

I don't think we have a Windows system with PostgreSQL installed, but it is probably something to do with the client configuration.

According to https://stackoverflow.com/questions/68627853/psql-copy-statement-error-says-character-is-encoded-win1252-despite-file-seeming via https://stackoverflow.com/questions/20952893/postgresql-encoding-problems-on-windows-when-using-psql-command-line-utility it is the terminal ("cmd") that needs to be configured. If this works we can add a note to our documentation.

MattBlissett avatar Dec 06 '23 09:12 MattBlissett

This worked! Thank you for the help. I had actually tried these commands previously but ran psql through the shortcut to the .bat rather than from the same cmd window. Following the stack overflow and calling the .bat from the same window allowed me to copy successfully.

C:\Users\oscar>"\Program Files\PostgreSQL\16\scripts\runpsql.bat"

I appreciate the quick help!

ocourch avatar Dec 06 '23 20:12 ocourch