pgloader
pgloader copied to clipboard
Future as pgloader extesion for Foreign Data Wrapper
FDW (Foreign Data Wrapper) is a build-in solution for loading external files... Load and Transform, all by standard SQL language. Perhaps the PGLoader's premise can be reviewed, and a PGLoader's FDW extension can be developed as a "sister project".
This issue is a suggestion to "improve both the software and its user experience in general". Summarizing the 3 steps that I used to write the issue:
- I need FDW in the future. I not need to "keep a separate file of rejected data".
PS: but, maybe, a specialized PGLoader-FDW extension can do this also. - the result that I need: see illustrative examples below.
- how the result is not what I expected: my dream for PostgreSQL is to use standard SQL language in ETL specifications!
Rationale. Need for ETL operations loading "near Big Data" files, using a good (cheaper) server with "low" disk space. Details:
- complex ETL operations with minimal disc consumption;
- not need a "full copy AS IS", need to use the original file as "AS IS source";
- use of standard SQL as ETL language, neither Unix commands nor exotic languages.
Illustrative examples:
ETL of CSV
A typical use, avoiding disk usage and expressing with good SQL what you need. The CREATE FOREIGN TABLE tmp_tablename_fdw
is a standard SQL command, so, the generic ETL process will be something as
INSERT INTO tablename (x, y, z)
SELECT f1(column1), f2(column2), f3(column3), f4(column1,column2,column3) -- the direct transforms
FROM tmp_tablename_fdw -- Load without full-copy, without disk consumption
WHERE my_condictions(column1,column2,column3) -- the direct filter
;
Imagine a CSV of latitude-longitude columns (e.g. c1 and c2), with millions of lines, and need for transform it into a geometry by ST_SetSRID(ST_MakePoint(c2,c1),4326)
and filtering by c2 < -71.104 AND c1 > 42.315
.
You need to prepare FDW, but instead "static" tmp_tablename_fdw you can use "dynamic" one, generated by an exec_fdw_csv(fdwName,columns,filename)
function, where parameters can be imagined as values for {{placeholders}}
in a kind of custon generator:
CREATE EXTENSION IF NOT EXISTS file_fdw;
CREATE SERVER IF NOT EXISTS import FOREIGN DATA WRAPPER file_fdw;
DROP FOREIGN TABLE IF EXISTS tmp_{{fdwName}}_fdw;
CREATE FOREIGN TABLE tmp_{{fdwName}}_fdw(
{{expand(columns)}}
) SERVER import OPTIONS ( filename '/{{path}}/{{filename}}', format 'csv');
ETL of JSON
Not need FDW to implement, but can use FDW bureaucracy as alternative.
Imagine a set of JSON files, for example myRawData1.json
and myRawData2.json
. They can be can be ingested without any extra-copy, by standard SQL:
INSERT INTO tablename (fname, metadata, content)
SELECT fname, meta, j -- do any data transformation here
FROM jsonb_read_files('myRawData%.json')
-- WHERE any_condiction_here
;
where the function jsonb_read_files() reads all files of a folder, defined by a mask:
CREATE or replace FUNCTION jsonb_read_files(
p_flike text, p_fpath text DEFAULT '/tmp/pg_io/'
) RETURNS TABLE (fid int, fname text, fmeta jsonb, j jsonb) AS $f$
WITH t AS (
SELECT (row_number() OVER ())::int id,
f as fname,
p_fpath ||'/'|| f as f
FROM pg_ls_dir(p_fpath) t(f)
WHERE f like p_flike
) SELECT id, fname,
to_jsonb( pg_stat_file(f) ) || jsonb_build_object('fpath',p_fpath),
pg_read_file(f)::jsonb
FROM t
$f$ LANGUAGE SQL IMMUTABLE;
More gz dreams
All with the option to use stream compressed content, to reduce disk usage in the original files. Typically compressed by gzip. file.json.gz
and file.csv.gz
. As commented at Quickstart, it is possible using pgloader streaming.