DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

What metadata should be

Open pdeffebach opened this issue 4 years ago • 7 comments

This post outlines, briefly, how I would like metadata to work in DataFrames.

  • Metadata is a property of a column name in a data frame, not the vector itself. As a consequence, df.income is simply a Vector and there is not metadata attached to it in general. Conceptually, think of metadata as an extension of column names. If I pass df.income to a function, that function only knows it recieves a Vector and does not know it has a name :income.

Metadata should work the same way.

  • Metadata is persistent. copy(df) preserves metadata, as does filter etc. It is also persistent across joins. For instance, if df1 and df2 both have the column :id, then
df = leftjoin(df1, df2, on = :id)` 

will preserve metata for all columns. The entry for metadata(df, :id) will be the same as metadata(df1, :id) because in a leftjoin the left data frame is thought of as the master data frame and the right one is the using data frame, in Stata-speak.

  • Metadata should be easy to access but an ecosystem should not rely on particular naming conventions for metadata. For instance, we should not write any functions guaranteeing that the metadata of a data frame includes the field label. Rather, if someone wants to graph df.income, they should do
histogram(df.income, title = metadata(df, "income")["label"])

or similar.

My ideal API for this is implemented, at least partly, in #1458. It includes the functions

  • metadata! for setting metadata via metadata!(df, :income, :label, "Personal Income")
  • metadata for getting metadata for an object, via metadata(df, :income, :label) == "Personal Income".

Notice, again, that these are handled at the level of the data frame and agnostic about what the columns are, you could change the vector corresponding to df.income and the metadata would be the same.

Here are @bkamins

  • the result of df.col not to have any metadata attached

True. Just like df.col has no name attached to it.

  • but on the other hand by doing df.col = some_new_value then the metadata should be kept

Yes. Metadata is attached to a name in a data frame.

  • given the two rules above I was not clear for example what you wanted to happen in the following cases:

    • df.col2 = df.col (I guess you do not want col2 to have any metadata)

Yes, df.col2 has no metadata.

  • if you then do select!(df, :col => :col2, :col2 => :col) - then still :col should have metadata and :col2 should not have metadata

Exactly, :col is a name in the data frame. If the user wants to transfer the metadata from one column to another they can do

@pipe df |>
    select!(df, :col => :col2, :col2 => :col)
    metadata!(df, :col, :label, metadata(df, :col2, :label))

Or something along those lines. Presumably we can overload getindex for cleaner syntax. In stata this would be

replace col = col2
label var col "`var label `col2''" // i forget the escaping rules at the moment

I used that kind of workflow a lot working with survey data.

pdeffebach avatar Jun 02 '20 18:06 pdeffebach

Thank you for the comment. So I understand you essentially want metadata to be a Symbol-value mapping that is a property of a data frame. I assume you want it to follow the following restrictions:

  1. it is not allowed to have a key that is not a valid column name. In particular:
    • If some column is removed from a data frame, then we make sure to remove the respective entry in metadata.
    • If an automatic column renaming is done then metadata should be updated accordingly (this happens e.g. in joins or hcat)
    • but normal (manual) column renaming should just check after the operation for column names retained and just make sure that metadata is consistent
  2. It is sticky (as long as a data frame is processed with multi column selector it is retained)
  3. I was not clear what should happen with metadata with vcat, append! and push! (where in the first case two vcatted data frames can have conflicting metadata; while in the second and third case operations can add columns to a data frame and potentially can contain metadata)
  4. Also it was not clear to me what should happen in the extension to join we plan to do that would combine columns while joining (https://github.com/JuliaData/DataFrames.jl/issues/2243)
  5. Finally should df1 .+ df2 just drop all metadata, or what should coalesce.(df1, df2) do?

bkamins avatar Jun 03 '20 06:06 bkamins

Thanks for detailing this. I'm curious why you explicitly don't want vectors to carry metadata. That wasn't my priority either, but if we can have it for free thanks to how it's implemented, why reject it? Since DataFrames own their columns that doesn't prevent us from having df.col2 = df.col retain metadata from col2 if we want.

Likewise, why wouldn't plotting methods use the :label metadata of the by default when it is set? In your example histogram(df.income, title = metadata(df, "income")["label"]) would become histogram(df.income), which is much nicer (and something that doesn't work in R).

Another point that I wonder about is what should happen when renaming columns. Wouldn't it make sense to preserve the metadata in that case, since the columns contents haven't changed so presumably the description, unit, etc. still apply? Isn't Stata's behavior just due to an incomplete support of labels ? BTW, if you have a reference describing how Stata progates labels across operations, that would be interesting.

Probably in most tricky cases @bkamins listed (4 and 5) we should just drop the metadata unless it's the same in the input columns being combined? When only one column has metadata, we could use that, but that could be a bit risky (e.g. you concatenate data frames with an :income column, the first having label "2019 income" and the second having no label for some reason but referring to 2018).

nalimilan avatar Jun 03 '20 07:06 nalimilan

@bkamins yes that is how I envision metadata. Functionally, it's just a Dict of Dicts where the keys of the first Dict are column names and the keys of the second Dict are metadata fields.

3. I was not clear what should happen with metadata with vcat, append! and push! (where in the first case two vcatted data frames can have conflicting metadata; while in the second and third case operations can add columns to a data frame and potentially can contain metadata)

vcat, append and push should maintain the metadata of the column, since the symbol -> col mapping is unchanged. For vcat(df1, df2), the metadata of the new data frame should be that of df1 since it is the "master" data frame.

5. Finally should df1 .+ df2 just drop all metadata, or what should coalesce.(df1, df2) do?

I think we should have a rule where the left data frame dominates. But I think that we can maintain the gist of my goal while dropping metadata in some small edge cases.

@nalimilan My aversion to column specific metadata is that I feel like metadata only really makes sense at the dataframe level. Labels, for example, need to be interpreted in the context of a data frame. Take, for example, my technique in Stata of writing a "history" of operations in notes. If you see a vector with the note field

A standardized index of 4 variables: :consumption, :income, :durable_assets, and :savings

How should one interpret that with a vector on its own? Perhaps there are use cases for metadata that exists without a table, but that seems complicated and would result in a lot of extraneous, not-useful, information floating around.

Likewise, why wouldn't plotting methods use the :label metadata of the by default when it is set? In your example histogram(df.income, title = metadata(df, "income")["label"]) would become histogram(df.income), which is much nicer (and something that doesn't work in R).

It would be very nice, but then we would have to agree on a standard and there is a lot of hidden behavior. For instance, what if the user wants the :label field to be an x-axis instead of a table? What if people don't like the word :label? Plus, we have StatsPlots that has these helper functions with DataFrames. It would be easier to have more flexible behavior for working with metadata, and convenient syntax, defined there rather than having many packages work with the unenforced convention :label.

Unfortunately I can't find a full spec for Sata's behavior. However I just played around with it and confirmed that

  1. Labels are preserved upon renaming
  2. When you merge, i.e. leftjoin, the labels on the left data set are preserved, even after an update or replace command is used, the behavior described in #2243. The exception is that when the variable on the left has no label, it gets the one on the right.
  3. When you append (vcat in DataFrames), the top data frame's labels are always preserved. Even when a variable on the left data frame has no label, the bottom data frame's labels never get added.
  4. collapse in Stata, our combine always destroys variable labels. I have found this very annoying in practice and in an ideal world we would preserve labels upon collapse.

pdeffebach avatar Jun 03 '20 13:06 pdeffebach

It would be very nice, but then we would have to agree on a standard and there is a lot of hidden behavior. For instance, what if the user wants the :label field to be an x-axis instead of a table? What if people don't like the word :label?

Given the way you describe this PR, it seems like DataFrames.jl doesn't need to be opinionated about this. If plotting recipe authors want to set default labels that it looks for to call histogram, that should be up to them. Personally, I think allowing defaults in plot recipes that users can override would be ideal, but not germane to the df API.

Another point that I wonder about is what should happen when renaming columns. Wouldn't it make sense to preserve the metadata in that case, since the columns contents haven't changed so presumably the description, unit, etc. still apply?

Strongly agree here, at least in the case of calling rename!(). Other cases where renaming happens has part of some operation, I'm less certain about.

I was not clear what should happen with metadata with vcat, append! and push! (where in the first case two vcatted data frames can have conflicting metadata; while in the second and third case operations can add columns to a data frame and potentially can contain metadata)

This is complicated, but I agree that left should dominate. With things like cols=:union, seems that any columns that get added to the left should maintain their metadata, but overlapping columns should take from the left. Seems more complicated (and arguable whether it's a good idea) to look for metadata that exists in overlapping columns from the right but not the left. In such a case, I think keeping the lack of metadata in the left df (maybe throw a warning) and making the user do it explicitly would be better. Or, have an option again.

kescobo avatar Jun 04 '20 15:06 kescobo

Given https://arrow.juliadata.org/dev/manual.html#Table-and-column-metadata the question is if we should not go the "easy" way and just support Dict{String, String} metadata for DataFrame and have custom types to take care of column-level metadata? (note in particular that top-level Dict{String,String} would be still good enough to store column-value mappings if needed at top level, the only limitation would be that we would not do any checking of the contents of the metadata). This approach would be simplest and allow easy serialization and deserialization of DataFrame as Arrow object.

The only thing that would need to be added is extension of Tables.jl API to allow metadata passing.

@pdeffebach @nalimilan @quinnj : what do you think?

bkamins avatar Nov 16 '20 07:11 bkamins

I haven't read all the previous discussions again, but requiring custom array types to add column metadata sounds problematic. I think we need a way to attach at least a long name or description to columns without changing their types. As we discussed at https://github.com/JuliaData/DataAPI.jl/issues/22, this can be achieved either by storing the column metadata in the data frame itself, or by keeping a global table associating the object ID with their metadata (which could then be accessed separately from the data frame if needed). Anyway both scenarios can be supported for Arrow serialization and deserialization: we just need a protocol in DataAPI.jl or Tables.jl which allows extracting the metadata and passing it to Arrow.

I also think we should anticipate allowing meta-data other than string in the future, even if we don't allow that immediately.

nalimilan avatar Nov 16 '20 08:11 nalimilan

OK. So I am moving the discussion to https://github.com/JuliaData/DataAPI.jl/issues/22 as it is more general.

bkamins avatar Nov 16 '20 11:11 bkamins

Closed with #3055

bkamins avatar Sep 20 '22 07:09 bkamins