Checkbook
Checkbook copied to clipboard
Greenplum database dependency in ETL process.
The "TL" part of the ETL process for loading data into Checkbook currently requires certain features of the Greenplum database. While Greenplum is proprietary, there is a free-for-noncommercial-use "community edition" available as of this writing, so this situation is not a showstopper. Removing this proprietary dependency would be a useful enhancement, however, to enable Checkbook to run entirely on open source software.
At one point I asked some Checkbook developers about this:
What are the Greenplum-specific parts of the ETL process (or of the "TL" part of that process, at least)? Is it just the "distributed by" clauses in the "create table" commands in the SQL files, for example in source/database/ETL/CREATE_NEW_DATABASE/NYCCheckbookETL_DDL.sql? Or does the dependency go deeper than that? (Postgres-XC supports a "distribute by" clause; one of the reasons I'm asking is to figure out if trying Postgres-XC as a substitute for Greenplum is worth a look.)
Tirupati Reddy answered that the three features of Greenplum DB used in ETL Scripts that are not there in PostgreSQL DB are:
- Concept of external tables for loading CSV data
- The "Distributed By" feature to distribute the data in different nodes
- Using Columnar storage feature on big tables to make queries run faster when filtering data on some columns.
Tirupati said that changing the scripts to remove the need for (2) and (3) is probably not too hard, but (1) would take some time.