pgloader
pgloader copied to clipboard
Performance: slow CSV uploads
Hi, i have to import big csv into postgres. The file is a csv and it have only 5 columns but a lot of rows; the size is around 20 GB. The pgloader import to postgresql takes 4h38m for a file with 20 GB size. This is my pgloader settings for the import is
LOAD CSV
FROM {{FILEPATH}}
HAVING FIELDS
(
NUM_TELEFONO,
CODICE_CIC,
DESCRIZIONE_CIC,
DATA_ATTIVAZIONE_COMPONENTE [date format 'YYYY-MM-DD HH24:MI:SS'],
DATA_CESSAZIONE_COMPONENTE [date format 'YYYY-MM-DD HH24:MI:SS']
)
INTO postgresql://{{PGUSER}}@{{PGHOST}}:{{PGPORT}}/{{PGDATABASE}}
TARGET TABLE TABLEDEST
(
NUM_TELEFONO text using (or NUM_TELEFONO "ND"),
CODICE_CIC text using (or CODICE_CIC "ND"),
DESCRIZIONE_CIC text using (or DESCRIZIONE_CIC "ND"),
DATA_ATTIVAZIONE_COMPONENTE,
DATA_CESSAZIONE_COMPONENTE,
MESE_RIFERIMENTO text using "XXXXX",
YYYYMMDD text using "YYYY"
)
WITH
workers = 16, concurrency = 4,
fields optionally enclosed by '"',
batch rows=100000,
batch size = 200 MB,
batch concurrency = 3,
fields terminated by ';'
SET client_encoding to 'utf8',
work_mem to '12MB',
standard_conforming_strings to 'on';
I give you major details of my environment:
- the database is cloud sql of google cloud platform: 8 vCPUs, 32 GB memory
- the vm that launch pgloader is a vm of google cloud platform : 8 vCPUs, 32 GB memory
- the vm and the dabatase are inside the same vpc and not communicate with internet but using private network
- the table target of the import not have index and trigger.
- the bad records is extremely rare so i can configure pgloader to use big batch in order to improve performance
My question are:
- can i improve the performance ? actually i have imported the big csv: 20 GB, in 4h38m
- some of the settings configured above cause slowness ?
- bearing in mind that I will have a powerful vm and an powrful db in production, what option can improve the performance of my import ?
- what is the behavior of batch concurrency option ? i dont find any note in the docs
Thanks a lot regards
I am curious have u found out better parameters since then?
I just tried importing a csv with 80MB and I get around 30seconds on pgloader vs 30seconds using \copy. I know that when there are no error in the processing the \copy wins but I did not expect it be twice as fast.
Here are my settings
LOAD CSV
FROM '/mnt/d/Tickets/genned100K.csv' (seq,source_artefact ...)
INTO postgresql://db?table (seq,source_artefact ...)
WITH skip header = 1,
fields terminated by ',',
batch rows = 100000,
batch size = 100MB,
workers = 8,
concurrency = 8
SET work_mem to '30MB',
maintenance_work_mem to '30MB',
standard_conforming_strings to 'on'
;
8 vCPUS an 32GB of ram as well
@dario-frongillo you really want to limit your pg session to work_mem to '12MB' ? when you want to treat 20GB :-) Also beware of any indexes on your tables that can slow down your import.
Hi @tigerfoot , i have also tried with this settings SET client_encoding to 'utf8', work_mem to '1000MB', maintenance_work_mem to '1600MB', but no improvement.. it is still slow... i have not index in the table and not constrainct. i have also a test to measure network latency
time nc -zw30 ip port
real 0m0.010s user 0m0.008s sys 0m0.002s
have you any advice ?
I have the same issue of slow import from a GCP VM (centos 7). Rate is 2.1Mb / s so it takes 30s for a 60Mo file. Copy from this csv file takes less than 3 seconds.... Have you find a solution ? I think there is a problem with CPU usage... Thanks for your help