pgloader Performance: slow CSV uploads

Hi, i have to import big csv into postgres. The file is a csv and it have only 5 columns but a lot of rows; the size is around 20 GB. The pgloader import to postgresql takes 4h38m for a file with 20 GB size. This is my pgloader settings for the import is

LOAD CSV
     FROM {{FILEPATH}}
     HAVING FIELDS
      (
         NUM_TELEFONO,
         CODICE_CIC,
         DESCRIZIONE_CIC,
         DATA_ATTIVAZIONE_COMPONENTE [date format 'YYYY-MM-DD HH24:MI:SS'],
         DATA_CESSAZIONE_COMPONENTE [date format 'YYYY-MM-DD HH24:MI:SS']
      )     
     INTO postgresql://{{PGUSER}}@{{PGHOST}}:{{PGPORT}}/{{PGDATABASE}}
     TARGET TABLE TABLEDEST
        (
          NUM_TELEFONO text using (or NUM_TELEFONO "ND"),
          CODICE_CIC text using (or CODICE_CIC "ND"),
          DESCRIZIONE_CIC text using (or DESCRIZIONE_CIC "ND"),
          DATA_ATTIVAZIONE_COMPONENTE,
          DATA_CESSAZIONE_COMPONENTE,
          MESE_RIFERIMENTO text using "XXXXX",
          YYYYMMDD text using "YYYY"
        )
     WITH 
          workers = 16, concurrency = 4,
          fields optionally enclosed by '"',
          batch rows=100000,
          batch size = 200 MB,
          batch concurrency = 3,
          fields terminated by ';'

      SET client_encoding to 'utf8',
          work_mem to '12MB',
          standard_conforming_strings to 'on';

I give you major details of my environment:

the database is cloud sql of google cloud platform: 8 vCPUs, 32 GB memory
the vm that launch pgloader is a vm of google cloud platform : 8 vCPUs, 32 GB memory
the vm and the dabatase are inside the same vpc and not communicate with internet but using private network
the table target of the import not have index and trigger.
the bad records is extremely rare so i can configure pgloader to use big batch in order to improve performance

My question are:

can i improve the performance ? actually i have imported the big csv: 20 GB, in 4h38m
some of the settings configured above cause slowness ?
bearing in mind that I will have a powerful vm and an powrful db in production, what option can improve the performance of my import ?
what is the behavior of batch concurrency option ? i dont find any note in the docs

Thanks a lot regards

Jan 16 '21 19:01 esoni88

I am curious have u found out better parameters since then?

I just tried importing a csv with 80MB and I get around 30seconds on pgloader vs 30seconds using \copy. I know that when there are no error in the processing the \copy wins but I did not expect it be twice as fast.

Here are my settings

LOAD CSV
     FROM '/mnt/d/Tickets/genned100K.csv' (seq,source_artefact ...)
     INTO postgresql://db?table (seq,source_artefact ...)
     WITH skip header = 1,
          fields terminated by ',',
		  batch rows = 100000,
		  batch size = 100MB,    
		  workers = 8,
		  concurrency = 8		  
      SET work_mem to '30MB',  
          maintenance_work_mem to '30MB',
          standard_conforming_strings to 'on'
   ;

8 vCPUS an 32GB of ram as well

Feb 11 '21 16:02 DonHaul

@dario-frongillo you really want to limit your pg session to work_mem to '12MB' ? when you want to treat 20GB :-) Also beware of any indexes on your tables that can slow down your import.

Feb 21 '21 17:02 tigerfoot

Hi @tigerfoot , i have also tried with this settings SET client_encoding to 'utf8', work_mem to '1000MB', maintenance_work_mem to '1600MB', but no improvement.. it is still slow... i have not index in the table and not constrainct. i have also a test to measure network latency

time nc -zw30 ip port

real 0m0.010s user 0m0.008s sys 0m0.002s

have you any advice ?

Feb 22 '21 07:02 esoni

I have the same issue of slow import from a GCP VM (centos 7). Rate is 2.1Mb / s so it takes 30s for a 60Mo file. Copy from this csv file takes less than 3 seconds.... Have you find a solution ? I think there is a problem with CPU usage... Thanks for your help

Sep 30 '21 19:09 jordanebordet

pgloader pgloader copied to clipboard

Performance: slow CSV uploads

pgloader
pgloader copied to clipboard