pgloader icon indicating copy to clipboard operation
pgloader copied to clipboard

Performance: slow CSV uploads

Open esoni88 opened this issue 4 years ago • 4 comments

Hi, i have to import big csv into postgres. The file is a csv and it have only 5 columns but a lot of rows; the size is around 20 GB. The pgloader import to postgresql takes 4h38m for a file with 20 GB size. This is my pgloader settings for the import is

LOAD CSV
     FROM {{FILEPATH}}
     HAVING FIELDS
      (
         NUM_TELEFONO,
         CODICE_CIC,
         DESCRIZIONE_CIC,
         DATA_ATTIVAZIONE_COMPONENTE [date format 'YYYY-MM-DD HH24:MI:SS'],
         DATA_CESSAZIONE_COMPONENTE [date format 'YYYY-MM-DD HH24:MI:SS']
      )     
     INTO postgresql://{{PGUSER}}@{{PGHOST}}:{{PGPORT}}/{{PGDATABASE}}
     TARGET TABLE TABLEDEST
        (
          NUM_TELEFONO text using (or NUM_TELEFONO "ND"),
          CODICE_CIC text using (or CODICE_CIC "ND"),
          DESCRIZIONE_CIC text using (or DESCRIZIONE_CIC "ND"),
          DATA_ATTIVAZIONE_COMPONENTE,
          DATA_CESSAZIONE_COMPONENTE,
          MESE_RIFERIMENTO text using "XXXXX",
          YYYYMMDD text using "YYYY"
        )
     WITH 
          workers = 16, concurrency = 4,
          fields optionally enclosed by '"',
          batch rows=100000,
          batch size = 200 MB,
          batch concurrency = 3,
          fields terminated by ';'

      SET client_encoding to 'utf8',
          work_mem to '12MB',
          standard_conforming_strings to 'on';

I give you major details of my environment:

  • the database is cloud sql of google cloud platform: 8 vCPUs, 32 GB memory
  • the vm that launch pgloader is a vm of google cloud platform : 8 vCPUs, 32 GB memory
  • the vm and the dabatase are inside the same vpc and not communicate with internet but using private network
  • the table target of the import not have index and trigger.
  • the bad records is extremely rare so i can configure pgloader to use big batch in order to improve performance

My question are:

  • can i improve the performance ? actually i have imported the big csv: 20 GB, in 4h38m
  • some of the settings configured above cause slowness ?
  • bearing in mind that I will have a powerful vm and an powrful db in production, what option can improve the performance of my import ?
  • what is the behavior of batch concurrency option ? i dont find any note in the docs

Thanks a lot regards

esoni88 avatar Jan 16 '21 19:01 esoni88

I am curious have u found out better parameters since then?

I just tried importing a csv with 80MB and I get around 30seconds on pgloader vs 30seconds using \copy. I know that when there are no error in the processing the \copy wins but I did not expect it be twice as fast.

Here are my settings

LOAD CSV
     FROM '/mnt/d/Tickets/genned100K.csv' (seq,source_artefact ...)
     INTO postgresql://db?table (seq,source_artefact ...)
     WITH skip header = 1,
          fields terminated by ',',
		  batch rows = 100000,
		  batch size = 100MB,    
		  workers = 8,
		  concurrency = 8		  
      SET work_mem to '30MB',  
          maintenance_work_mem to '30MB',
          standard_conforming_strings to 'on'
   ;

8 vCPUS an 32GB of ram as well

DonHaul avatar Feb 11 '21 16:02 DonHaul

@dario-frongillo you really want to limit your pg session to work_mem to '12MB' ? when you want to treat 20GB :-) Also beware of any indexes on your tables that can slow down your import.

tigerfoot avatar Feb 21 '21 17:02 tigerfoot

Hi @tigerfoot , i have also tried with this settings SET client_encoding to 'utf8', work_mem to '1000MB', maintenance_work_mem to '1600MB', but no improvement.. it is still slow... i have not index in the table and not constrainct. i have also a test to measure network latency

time nc -zw30 ip port

real 0m0.010s user 0m0.008s sys 0m0.002s

have you any advice ?

esoni avatar Feb 22 '21 07:02 esoni

I have the same issue of slow import from a GCP VM (centos 7). Rate is 2.1Mb / s so it takes 30s for a 60Mo file. Copy from this csv file takes less than 3 seconds.... Have you find a solution ? I think there is a problem with CPU usage... Thanks for your help

jordanebordet avatar Sep 30 '21 19:09 jordanebordet