pgloader icon indicating copy to clipboard operation
pgloader copied to clipboard

Future as pgloader extesion for Foreign Data Wrapper

Open ppKrauss opened this issue 2 years ago • 0 comments

FDW (Foreign Data Wrapper) is a build-in solution for loading external files... Load and Transform, all by standard SQL language. Perhaps the PGLoader's premise can be reviewed, and a PGLoader's FDW extension can be developed as a "sister project".

This issue is a suggestion to "improve both the software and its user experience in general". Summarizing the 3 steps that I used to write the issue:

  1. I need FDW in the future. I not need to "keep a separate file of rejected data".
    PS: but, maybe, a specialized PGLoader-FDW extension can do this also.
  2. the result that I need: see illustrative examples below.
  3. how the result is not what I expected: my dream for PostgreSQL is to use standard SQL language in ETL specifications!

Rationale. Need for ETL operations loading "near Big Data" files, using a good (cheaper) server with "low" disk space. Details:

  • complex ETL operations with minimal disc consumption;
  • not need a "full copy AS IS", need to use the original file as "AS IS source";
  • use of standard SQL as ETL language, neither Unix commands nor exotic languages.

Illustrative examples:

ETL of CSV

A typical use, avoiding disk usage and expressing with good SQL what you need. The CREATE FOREIGN TABLE tmp_tablename_fdw is a standard SQL command, so, the generic ETL process will be something as

INSERT INTO tablename (x, y, z)
  SELECT f1(column1), f2(column2), f3(column3), f4(column1,column2,column3) -- the direct transforms 
  FROM tmp_tablename_fdw  -- Load without full-copy, without disk consumption
  WHERE my_condictions(column1,column2,column3)  -- the direct filter
;

Imagine a CSV of latitude-longitude columns (e.g. c1 and c2), with millions of lines, and need for transform it into a geometry by ST_SetSRID(ST_MakePoint(c2,c1),4326) and filtering by c2 < -71.104 AND c1 > 42.315.

You need to prepare FDW, but instead "static" tmp_tablename_fdw you can use "dynamic" one, generated by an exec_fdw_csv(fdwName,columns,filename) function, where parameters can be imagined as values for {{placeholders}} in a kind of custon generator:

CREATE EXTENSION IF NOT EXISTS  file_fdw;
CREATE SERVER IF NOT EXISTS import FOREIGN DATA WRAPPER file_fdw;
DROP FOREIGN TABLE IF EXISTS tmp_{{fdwName}}_fdw;
CREATE FOREIGN TABLE tmp_{{fdwName}}_fdw(
  {{expand(columns)}}
) SERVER import OPTIONS ( filename '/{{path}}/{{filename}}', format 'csv');

ETL of JSON

Not need FDW to implement, but can use FDW bureaucracy as alternative.

Imagine a set of JSON files, for example myRawData1.json and myRawData2.json. They can be can be ingested without any extra-copy, by standard SQL:

INSERT INTO tablename (fname, metadata, content)
 SELECT fname, meta, j  -- do any data transformation here
 FROM jsonb_read_files('myRawData%.json')
 -- WHERE any_condiction_here
;

where the function jsonb_read_files() reads all files of a folder, defined by a mask:

CREATE or replace FUNCTION jsonb_read_files(
  p_flike text, p_fpath text DEFAULT '/tmp/pg_io/'
) RETURNS TABLE (fid int,  fname text, fmeta jsonb, j jsonb) AS $f$
  WITH t AS (
     SELECT (row_number() OVER ())::int id, 
           f as fname,
           p_fpath ||'/'|| f as f
     FROM pg_ls_dir(p_fpath) t(f)
     WHERE    f like p_flike
  ) SELECT id,  fname,
         to_jsonb( pg_stat_file(f) ) || jsonb_build_object('fpath',p_fpath),
         pg_read_file(f)::jsonb
    FROM t
$f$  LANGUAGE SQL IMMUTABLE;

More gz dreams

All with the option to use stream compressed content, to reduce disk usage in the original files. Typically compressed by gzip. file.json.gz and file.csv.gz. As commented at Quickstart, it is possible using pgloader streaming.

ppKrauss avatar Jan 10 '22 14:01 ppKrauss