arkdb icon indicating copy to clipboard operation
arkdb copied to clipboard

Support more additional formats (i.e. fst, feather, etc?)

Open cboettig opened this issue 6 years ago • 13 comments

  • [x] arkdb currently hardwires write_tsv/read_tsv for I/O. This should be pluggable.

done, now supports readr and base utils table I/O

Would be nice to take advantage of things like faster speed of fst::read_fst /fst::write_fst, but looks like fst does not support the ability to append, making it impossible to stream in chunks.

cboettig avatar Jun 27 '18 22:06 cboettig

Relevant issue for fst: https://github.com/fstpackage/fst/issues/9

richfitz avatar Jul 18 '18 08:07 richfitz

...and https://github.com/fstpackage/fst/issues/24 https://github.com/fstpackage/fst/issues/153 https://github.com/fstpackage/fst/issues/91

richfitz avatar Jul 18 '18 08:07 richfitz

can also consider supporting non-DBI-based backends! e.g. LMDB, see example in https://github.com/cboettig/taxalight/

cboettig avatar Sep 17 '20 05:09 cboettig

+arrow!

1beb avatar Sep 22 '21 20:09 1beb

@1beb can you clarify your use case here?

The arrow R package can already stream between many text-based formats (csv, tsv, json, etc) and 'native' binary formats (feather, parquet, etc) with minimal memory use. Are you trying to go from an arrow data format to a DBI-compliant database format or vice-versa?

cboettig avatar Sep 22 '21 23:09 cboettig

Database <> Parquet would be nice.

On Wed, Sep 22, 2021, 19:03 Carl Boettiger @.***> wrote:

@1beb https://github.com/1beb can you clarify your use case here?

The arrow R package can already stream between many text-based formats (csv, tsv, json, etc) and 'native' binary formats (feather, parquet, etc) with minimal memory use. Are you trying to go from an arrow data format to a DBI-compliant database format or vice-versa?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/arkdb/issues/3#issuecomment-925389183, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADG7Y36MMYNUAWF3MRRVODUDJODVANCNFSM4FHKAOUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

1beb avatar Sep 22 '21 23:09 1beb

Just a note: if you have parquet already, the most recent version of duckdb can give you a native full SQL / DBI interface directly to the parquet files, without even requiring an 'import' step: https://duckdb.org/docs/data/parquet. (If you have data in native duckdb database, it can also write it to parquet for you).

If you have data stuck in some other database (postgres etc) and want to export it to parquet, I hear you, unfortunately, parquet is a format that doesn't take well to processing in chunks (I believe a bunch of metadata is always placed at the end of the parquet file, so you can't stream / append to it, as far as I understand).

cboettig avatar Sep 23 '21 00:09 cboettig

The latter is one of my use cases (db to parquet) my current strategy is breaking into variable based chunks then writing. But when you chunk, often field types get dropped so you have to be explicit. It would be nice to be able to stream to chunk_X.parquet, and be able to specify a row count. I'm trying to avoid writing code for offset/fetch SQL with my bare hands. Was thinking the ark function might work if the write method could be parameterized as a function argument. But that doesn't look like how it's designed. Another problem is that this targets an entire table, whereas at my scale I'd need to target subsets of a table partition to fit in RAM. I didn't notice a way to a filter.

What's the target size of DB that this package is generally used for?

PS: Great pointer to duckdb, I'm using it for some production cases and it's really great with parquet!

1beb avatar Sep 23 '21 00:09 1beb

Looking at the code, I'm thinking that here: https://github.com/ropensci/arkdb/blob/master/R/ark.R#L224 you could probably add an optional write to a chunked parquet. You'd need to do some accounting for chunk naming if it mattered to you. I have a need - I'll take a swing at it, if I get it working I'll post it as a PR.

1beb avatar Sep 23 '21 00:09 1beb

@1beb sweet. yeah, definitely agree that a type-safe export format would be super nice, and parquet seems the logical choice.

in principle, you can ark to any output format for which you can define a custom streamable_table() function: https://github.com/ropensci/arkdb/blob/master/R/streamable_table.R, for .csv & friends, that's essentially just an append=TRUE. I suppose with parquet you can just write each chunk out as a separate parquet file perhaps? I'm not particular about the naming. Definitely let me know how it goes!

cboettig avatar Sep 23 '21 03:09 cboettig

Ah! I see it now. Great. This looks much more straightforward than I was expecting. Are you aware of a package or script that does a good job of maintaining reasonable defaults between typical SQL schema and R's available types? Read a table, look at the schema, convert to colClasses.

Also a type-safe format isn't just nice it's required. Parquet files in a folder can be read with arrow::open_dataset(path) and it spits back if you have schema mismatches in your component files. A common example would be a logical vs. an integer where part of your db pull had sparse, missing data and the next part had at least one value.

1beb avatar Sep 23 '21 05:09 1beb

:+1:

Are you aware of a package or script that does a good job of maintaining reasonable defaults between typical SQL schema and R's available types? Read a table, look at the schema, convert to colClasses.

Not entirely sure I follow this. It should be built in to the database interface, right? e.g. DBI::dbReadTable() / DBI::dbWriteTable() etc handle the appropriate types.

Also a type-safe format isn't just nice it's required.

I totally here you on this. Still, UNIX philosophy is "plain text is the universal interface" and the permanent archive folks are still all in on it as binary formats which encode types come and go. Date/time encodings in particular have had a rough time of it. The classicist would say: no conflict between logical vs integer if you parse it all as text. :hand: but I hear you, I hear you, I live in the real world too and we need types. I think if we stick with DBI <-> parquet though this is all handled already pretty well though, right?

cboettig avatar Sep 23 '21 16:09 cboettig

Note to self: https://cran.r-project.org/web/packages/DBI/vignettes/spec.html

1beb avatar Sep 24 '21 16:09 1beb