gtfs-via-postgres icon indicating copy to clipboard operation
gtfs-via-postgres copied to clipboard

support for DuckDB or ClickHouse?

Open patricksheehan opened this issue 2 years ago • 10 comments

I think the arrivals_departures view is what I'm most interested in, but for the simple iOS app I'm building I'd like to avoid hosting a database. Would it be possible/reasonable to provide a SQLite port? FWIW I'm happy to work on this, I just haven't done such a port before.

patricksheehan avatar Dec 16 '22 05:12 patricksheehan

Thanks for asking; If SQLite can be supported in a reasonable way, I would like to support it! I assume that 99% can be ported easily, but there might be a few hard parts; I'm not sure though, because I don't know SQLite well enough to judge without looking into it. Ideally, I would like to avoid a class port (a fork that implements SQLite, while this repo only implements PostgreSQL), because keeping the two synchronised is labor-intensive and error-prone.

I would propose that you try porting it in a draft PR and leave comments on all sections that are unclear or unfinished. We can then think about how to implement a switch that

In the worst case, if the SQLite code path ends up being too different from the PostgreSQL one, we'll have to resort to a true fork.

derhuerst avatar Dec 16 '22 11:12 derhuerst

Makes sense! For now I'm going to pursue some other routes (Transitland API, working directly with a SQLite implementation) and will circle back if this seems like the quickest path :)

patricksheehan avatar Dec 19 '22 20:12 patricksheehan

duckdb could also be an interesting option and it looks like there's node bindings. Maybe an ORM or something like dbt to help with the cross-database compatibility?

atvaccaro avatar Feb 08 '23 15:02 atvaccaro

duckdb could also be an interesting option [...].

Indeed, DuckDB claims to cover the GTFS analysis use case very well. One could analyze how it performs with the queries used as benchmarks, which reflect (my personal, happy to receive some feedback) use cases.

Maybe an ORM or something like dbt to help with the cross-database compatibility?

AFAIK usually ORMs assume that the process connects to the DB and sends queries "interactively", which is unnecessarily slow for our use case (inserting millions of rows known up front). Is it possible to generate SQL (dialects) using dbt?

derhuerst avatar Feb 08 '23 15:02 derhuerst

AFAIK usually ORMs assume that the process connects to the DB and sends queries "interactively", which is unnecessarily slow for our use case (inserting millions of rows known up front).

I think the main use case would be to generate different dialects for the views/tables on top of the raw data, for example using SQLAlchemy to make stuff like service_days cross-db. I think the data loading is trickier and more database-specific. duckdb will gladly create a table for you based on an input CSV so you don't need to actually define the table up front like Postgres.

Is it possible to generate SQL (dialects) using dbt?

Yes it's essentially just a framework for templating SQL via Jinja (and providing a dependency graph it can execute) and developers use macros to provide cross-database support. dbt will compile to the appropriate dialect depending on your chosen database target; I could imagine either SQLAlchemy or dbt being useful for creating the views/tables on top of the initial tables.

I think you've looked at our project some already but we use dbt as the "source of truth" for our BigQuery data warehouse models.

atvaccaro avatar Feb 08 '23 19:02 atvaccaro

I'm hesitant to use an ORM, even if it "just" generates SQL dialects instead of connecting to the DBs, for the following reasons:

  1. complexity – Mapping the GTFS semantic model, which is quite dynamic (e.g. location_type in stops.txt, calendar_dates.txt & calendar.txt, or the GTFS time semanstics), to a set of ORM models seems to be tricky. How we make sure to support all GTFS use cases while keeping the code maintainable?
  2. performance
    • Currently, importing a large GTFS feed already takes a few minutes, which is long enough to throw you out of your flow, and long enough to be annoying in CI runs. If an ORM would increase this time e.g. 3x (which is not too unreasonable, given that they are usually quite complex), that would make gtfs-via-postgres a signifcantly less useful.
    • The arguably more important aspect is "runtime performance", i.e. the DB's query planner's ability to optimise queries using views like arrivals_departures/connections.
  3. usability without ORMs – I would like gtfs-via-postgres to stay a tool that is useful to people analysing GTFS using plain SQL queries, and to programs written in any programming lanauge. We would have to make sure that the SQL(s) generated by the ORM is idiomatic even from a non-ORM perspective.

While these are strengths of gtfs-via-postgres that I have strong opinions about, it is an unproven assumption that it is impossible to retain them using ORMs, so I think it's best to make an experiment to try one of them and see how they behave. I'd be very happy to be proven wrong! I can't promise to merge (or otherwise make use of) that work though.

derhuerst avatar Feb 10 '23 11:02 derhuerst

I think you've looked at our project some already but we use dbt as the "source of truth" for our BigQuery data warehouse models.

Didn't have a look at the DB-related code yet. Will have a look.

derhuerst avatar Feb 10 '23 11:02 derhuerst

FYI: I'm currently looking into how well DuckDB fits as a local in-process database.

also related: http://duckdb.org/2023/08/23/even-friendlier-sql.html

derhuerst avatar Apr 10 '23 21:04 derhuerst

I'd also recommend to have a look at https://clickhouse.com/ - I'm running a 80M x 78 database (rows x columns) on another topic and the query time went down from 1 seconds to less than 0.01 ms - even without indexes.

tobwen avatar Jul 09 '23 12:07 tobwen

I think adopting any other database as a second option is only worth it if it provides significant operational (e.g. SQLite/DuckDB with local in-process execution and great portability) or performance (e.g. parallel execution beyond what PostgreSQL can do) benefits.

I briefly looked into both DuckDB as well as ClickHouse, and both look promising with regards to gtfs-via-postgres's goals.

But I think I won't have enough time/motivation to investigate the necessary changes to support any one. You're very welcome to do this and submit a WIP Pull Request!

Please note though that I won't necessarily merge it, depending on the increase in code complexity it brings, so consider this an experiment. This is also why, if you decide to investigate one of the DB systems, I encourage you to give frequent update on "roadblocks" (where you notice that significant changes in the SQL generated gtfs-via-postgres are necessary), so that I can let you know if I'm not feeling comfortable anymore maintaining the resulting complexity long-term.

derhuerst avatar Jul 11 '23 13:07 derhuerst