DataFrames.jl
DataFrames.jl copied to clipboard
What metadata should be
This post outlines, briefly, how I would like metadata to work in DataFrames.
- Metadata is a property of a column name in a data frame, not the vector itself. As a consequence,
df.income
is simply aVector
and there is not metadata attached to it in general. Conceptually, think of metadata as an extension of column names. If I passdf.income
to a function, that function only knows it recieves aVector
and does not know it has a name:income
.
Metadata should work the same way.
- Metadata is persistent.
copy(df)
preserves metadata, as doesfilter
etc. It is also persistent acrossjoin
s. For instance, ifdf1
anddf2
both have the column:id
, then
df = leftjoin(df1, df2, on = :id)`
will preserve metata for all columns. The entry for metadata(df, :id)
will be the same as metadata(df1, :id)
because in a leftjoin
the left data frame is thought of as the master
data frame and the right one is the using
data frame, in Stata-speak.
- Metadata should be easy to access but an ecosystem should not rely on particular naming conventions for metadata. For instance, we should not write any functions guaranteeing that the metadata of a data frame includes the field
label
. Rather, if someone wants to graphdf.income
, they should do
histogram(df.income, title = metadata(df, "income")["label"])
or similar.
My ideal API for this is implemented, at least partly, in #1458. It includes the functions
-
metadata!
for setting metadata viametadata!(df, :income, :label, "Personal Income")
-
metadata
for getting metadata for an object, viametadata(df, :income, :label) == "Personal Income"
.
Notice, again, that these are handled at the level of the data frame and agnostic about what the columns are, you could change the vector corresponding to df.income
and the metadata would be the same.
Here are @bkamins
- the result of
df.col
not to have any metadata attached
True. Just like df.col
has no name attached to it.
- but on the other hand by doing
df.col = some_new_value
then the metadata should be kept
Yes. Metadata is attached to a name
in a data frame.
given the two rules above I was not clear for example what you wanted to happen in the following cases:
df.col2 = df.col
(I guess you do not wantcol2
to have any metadata)
Yes, df.col2
has no metadata.
- if you then do
select!(df, :col => :col2, :col2 => :col)
- then still:col
should have metadata and:col2
should not have metadata
Exactly, :col
is a name in the data frame. If the user wants to transfer the metadata from one column to another they can do
@pipe df |>
select!(df, :col => :col2, :col2 => :col)
metadata!(df, :col, :label, metadata(df, :col2, :label))
Or something along those lines. Presumably we can overload getindex
for cleaner syntax. In stata this would be
replace col = col2
label var col "`var label `col2''" // i forget the escaping rules at the moment
I used that kind of workflow a lot working with survey data.
Thank you for the comment. So I understand you essentially want metadata to be a Symbol
-value mapping that is a property of a data frame. I assume you want it to follow the following restrictions:
- it is not allowed to have a key that is not a valid column name. In particular:
- If some column is removed from a data frame, then we make sure to remove the respective entry in metadata.
- If an automatic column renaming is done then metadata should be updated accordingly (this happens e.g. in
join
s orhcat
) - but normal (manual) column renaming should just check after the operation for column names retained and just make sure that metadata is consistent
- It is sticky (as long as a data frame is processed with multi column selector it is retained)
- I was not clear what should happen with metadata with
vcat
,append!
andpush!
(where in the first case twovcat
ted data frames can have conflicting metadata; while in the second and third case operations can add columns to a data frame and potentially can contain metadata) - Also it was not clear to me what should happen in the extension to join we plan to do that would combine columns while joining (https://github.com/JuliaData/DataFrames.jl/issues/2243)
- Finally should
df1 .+ df2
just drop all metadata, or what shouldcoalesce.(df1, df2)
do?
Thanks for detailing this. I'm curious why you explicitly don't want vectors to carry metadata. That wasn't my priority either, but if we can have it for free thanks to how it's implemented, why reject it? Since DataFrames own their columns that doesn't prevent us from having df.col2 = df.col
retain metadata from col2
if we want.
Likewise, why wouldn't plotting methods use the :label
metadata of the by default when it is set?
In your example histogram(df.income, title = metadata(df, "income")["label"])
would become histogram(df.income)
, which is much nicer (and something that doesn't work in R).
Another point that I wonder about is what should happen when renaming columns. Wouldn't it make sense to preserve the metadata in that case, since the columns contents haven't changed so presumably the description, unit, etc. still apply? Isn't Stata's behavior just due to an incomplete support of labels ? BTW, if you have a reference describing how Stata progates labels across operations, that would be interesting.
Probably in most tricky cases @bkamins listed (4 and 5) we should just drop the metadata unless it's the same in the input columns being combined? When only one column has metadata, we could use that, but that could be a bit risky (e.g. you concatenate data frames with an :income
column, the first having label "2019 income" and the second having no label for some reason but referring to 2018).
@bkamins yes that is how I envision metadata. Functionally, it's just a Dict
of Dict
s where the keys of the first Dict
are column names and the keys of the second Dict
are metadata fields.
3. I was not clear what should happen with metadata with
vcat
,append!
andpush!
(where in the first case twovcat
ted data frames can have conflicting metadata; while in the second and third case operations can add columns to a data frame and potentially can contain metadata)
vcat
, append
and push
should maintain the metadata of the column, since the symbol -> col
mapping is unchanged. For vcat(df1, df2)
, the metadata of the new data frame should be that of df1
since it is the "master" data frame.
5. Finally should
df1 .+ df2
just drop all metadata, or what shouldcoalesce.(df1, df2)
do?
I think we should have a rule where the left data frame dominates. But I think that we can maintain the gist of my goal while dropping metadata in some small edge cases.
@nalimilan My aversion to column specific metadata is that I feel like metadata only really makes sense at the dataframe level. Labels, for example, need to be interpreted in the context of a data frame. Take, for example, my technique in Stata of writing a "history" of operations in notes. If you see a vector with the note
field
A standardized index of 4 variables: :consumption, :income, :durable_assets, and :savings
How should one interpret that with a vector on its own? Perhaps there are use cases for metadata that exists without a table, but that seems complicated and would result in a lot of extraneous, not-useful, information floating around.
Likewise, why wouldn't plotting methods use the
:label
metadata of the by default when it is set? In your examplehistogram(df.income, title = metadata(df, "income")["label"])
would becomehistogram(df.income)
, which is much nicer (and something that doesn't work in R).
It would be very nice, but then we would have to agree on a standard and there is a lot of hidden behavior. For instance, what if the user wants the :label
field to be an x-axis instead of a table? What if people don't like the word :label
? Plus, we have StatsPlots
that has these helper functions with DataFrames. It would be easier to have more flexible behavior for working with metadata, and convenient syntax, defined there rather than having many packages work with the unenforced convention :label
.
Unfortunately I can't find a full spec for Sata's behavior. However I just played around with it and confirmed that
- Labels are preserved upon renaming
- When you
merge
, i.e.leftjoin
, the labels on the left data set are preserved, even after anupdate
orreplace
command is used, the behavior described in #2243. The exception is that when the variable on the left has no label, it gets the one on the right. - When you append (
vcat
in DataFrames), the top data frame's labels are always preserved. Even when a variable on the left data frame has no label, the bottom data frame's labels never get added. -
collapse
in Stata, ourcombine
always destroys variable labels. I have found this very annoying in practice and in an ideal world we would preserve labels upon collapse.
It would be very nice, but then we would have to agree on a standard and there is a lot of hidden behavior. For instance, what if the user wants the :label field to be an x-axis instead of a table? What if people don't like the word :label?
Given the way you describe this PR, it seems like DataFrames.jl
doesn't need to be opinionated about this. If plotting recipe authors want to set default labels that it looks for to call histogram
, that should be up to them. Personally, I think allowing defaults in plot recipes that users can override would be ideal, but not germane to the df API.
Another point that I wonder about is what should happen when renaming columns. Wouldn't it make sense to preserve the metadata in that case, since the columns contents haven't changed so presumably the description, unit, etc. still apply?
Strongly agree here, at least in the case of calling rename!()
. Other cases where renaming happens has part of some operation, I'm less certain about.
I was not clear what should happen with metadata with vcat, append! and push! (where in the first case two vcatted data frames can have conflicting metadata; while in the second and third case operations can add columns to a data frame and potentially can contain metadata)
This is complicated, but I agree that left should dominate. With things like cols=:union
, seems that any columns that get added to the left should maintain their metadata, but overlapping columns should take from the left. Seems more complicated (and arguable whether it's a good idea) to look for metadata that exists in overlapping columns from the right but not the left. In such a case, I think keeping the lack of metadata in the left df (maybe throw a warning) and making the user do it explicitly would be better. Or, have an option again.
Given https://arrow.juliadata.org/dev/manual.html#Table-and-column-metadata the question is if we should not go the "easy" way and just support Dict{String, String}
metadata for DataFrame
and have custom types to take care of column-level metadata? (note in particular that top-level Dict{String,String}
would be still good enough to store column-value mappings if needed at top level, the only limitation would be that we would not do any checking of the contents of the metadata). This approach would be simplest and allow easy serialization and deserialization of DataFrame
as Arrow object.
The only thing that would need to be added is extension of Tables.jl API to allow metadata passing.
@pdeffebach @nalimilan @quinnj : what do you think?
I haven't read all the previous discussions again, but requiring custom array types to add column metadata sounds problematic. I think we need a way to attach at least a long name or description to columns without changing their types. As we discussed at https://github.com/JuliaData/DataAPI.jl/issues/22, this can be achieved either by storing the column metadata in the data frame itself, or by keeping a global table associating the object ID with their metadata (which could then be accessed separately from the data frame if needed). Anyway both scenarios can be supported for Arrow serialization and deserialization: we just need a protocol in DataAPI.jl or Tables.jl which allows extracting the metadata and passing it to Arrow.
I also think we should anticipate allowing meta-data other than string in the future, even if we don't allow that immediately.
OK. So I am moving the discussion to https://github.com/JuliaData/DataAPI.jl/issues/22 as it is more general.
Closed with #3055