dm icon indicating copy to clipboard operation
dm copied to clipboard

Store `dm` as xlsx or collection of csv files (zip)

Open TSchiefer opened this issue 4 years ago • 9 comments

...and read it again.

  • xlsx: one table per sheet plus one extra sheet with the metadata
  • csv: one table per file plus one extra file with the metadata(?)

CC @christophsax

TSchiefer avatar Feb 12 '20 13:02 TSchiefer

The associated PR has been sitting around for way too long. I think I now know why I've been pushing back.

Can we lose the meta information, and infer it entirely from the data only? We would only support the following logic:

  • key column names must match in primary and foreign key tables
  • all primary key columns must start at the front, we infer how many columns we actually need for a primary key
  • we only link to primary keys that we can infer unambigously

This might work out of the box with our internal data, at least we'll see if there are differences. What do you think?

In general, should we offer a macro that moves all key columns to the front?

Priority undecided.

krlmlr avatar Jul 05 '22 07:07 krlmlr

Sounds alright in general. Some caveats:

  • for bigger datasets, determining how many columns are needed as PK could take a while longer
  • for some tables we might not want a PK at all and would need to drop them in retrospect
  • roundtrips would only work for those special dm's
  • colors

I am not sure if these don't outweigh the advantage of simplicity...?

A macro that moves all key columns to the front sounds useful!

TSchiefer avatar Jul 05 '22 07:07 TSchiefer

The point is that we should be able to make sense of a folder full of CSVs, or of an Excel file, that somebody drops at us. This will hardly contain metadata, even less likely in the format we appreciate.

Can we handle anything related to additional metadata with code that we generate?

krlmlr avatar Jul 05 '22 07:07 krlmlr

I see, that is indeed a common use case. There will of course be the naming rules that the datasets have to follow, but ensuring that could be an acceptable effort for the people providing the data.

OK, let's do it.

I am just wondering if there could still be use cases for the functionality that was implemented in #485. There could be 2 sets of functions; or just 1 that looks for the additional files with metainfo.

TSchiefer avatar Jul 05 '22 07:07 TSchiefer

Can we handle anything related to additional metadata with code that we generate?

I don't understand. Are we talking about a different scenario here than a folder full of CSVs, or of an Excel file, that somebody drops at us?

TSchiefer avatar Jul 05 '22 07:07 TSchiefer

If I were to build a dm from a folder of csvs I think I would like an option to be guided through the process, maybe for each table we would print it as a tibble (or View() to see more), and suggest PKs to accept or not, or propose to provide manually (in case of compound pk). Then once done for all tables propose fks based on names, print code dm(foo = read.csv("foo.csv"), ...) %>% dm_add... and suggest to use dm_add_pk() and dm_add_fk() on top of it if some relationships are missing. This way I don't have to juggle with objects and make only one pass on the data. It's slow only once because in the final code there's no guessing of keys.

moodymudskipper avatar Jul 05 '22 10:07 moodymudskipper

Sounds good to me.

For the inverse operation of writing a dm to an Excel file or csv-files, we could then at least warn/inform the user, in case the reading function would not suggest the same key configuration as the original dm and suggest code to rename/relocate things so it would work.

TSchiefer avatar Jul 05 '22 10:07 TSchiefer

Maybe we could still produce (and try to read) a metadata file or sheet, if it's not there when reading we fall back on (optionally guided) automation and can even offer an option to create it. This way we stay flexible but still provide faithful round trips.

moodymudskipper avatar Jul 05 '22 11:07 moodymudskipper

Still low priority, let's discuss when we have a use case.

krlmlr avatar Jul 05 '22 17:07 krlmlr

The now closed PRs contain working implementations, need to decide on the functionality and integrate.

krlmlr avatar Aug 20 '23 15:08 krlmlr