DataFrames.jl
DataFrames.jl copied to clipboard
Metadata for columns and/or DataFrames
Should we leave room for metadata on structures? Frank Harrell's Hmisc package allows units and labels to be attached to data.frame columns.
People may want to attach other metadata like experimenter name or a DataFrame comment.
We could add a meta Dict to the DataFrame, the colindex, and/or at the DataVec level.
Yes, I think this could be useful. At the DataVec level, we will need meta-data for factor-like behavior (#6). And some of the other things you suggest make sense too. On the other hand, we probably want to rely less on arbitrary attributes, like R, and more on types, when there's the possibility of doing so.
I would love to see this, wrote about my wish for better support for things like questionnaires with the code book integrated with the code here (towards the bottom): http://reganmian.net/blog/2013/10/02/likert-graphs-in-r-embedding-metadata-for-easier-plotting/...
Of course also raises the issue about serialization.
If we can make this work without performance degradation, I'm in.
Standardizing on a few meta-data attributes like variable label and unit would be wonderful. In R, Harrell's Hmisc offers this feature, but unfortunately very few package use it since it's not standard at all. OTC, SAS has built-in support for variable labels, which are used e.g. to label tables and plot axes automatically. Stata also has this concept, and even allows associating longer "notes" to variables, to explicit their meaning.
More specialized attributes like question names would be useful, if there was an easy way for a separate package to create and use them.
Adding units should be trivial, especially if Julia settles on a standard unit package soon. What are the variable labels for: descriptions of the columns to supplement the brief names?
Yeah, variable labels are just the readable, complete name of the variable, as opposed to the abbreviated form that is practical to type (no spaces, no special characters...) but often cryptic and ugly which is used for variable names. The most typical use of variable labels is when you want to provide a good default for axes labels, like "Annual GDP growth", rather than "GDPG". They could also be useful to describe the contents of a database, with a function like Hmisc's describe() [1] or SAS's proc contents.
1: http://www.inside-r.org/packages/cran/Hmisc/docs/describe
Yes, I agree with all of this. A long, human-readable Name (which could be leveraged for axes labels by plotting routines), Units, and Level-of-measurement would be very helpful. Possibly also Domain.
LoM could be very handy for statistical modeling routines and the creation of appropriate model matrices (or the throwing of warnings). I never intended PooledDataVector to be equivalent to Factor -- it's a representational optimization. Would be much better for statistical routines to look for Nominal or Ordinal types and act appropriately, even if the underlying type is a non-pooled integer or string.
On Sun, Oct 6, 2013 at 11:54 AM, Milan Bouchet-Valat < [email protected]> wrote:
Yeah, variable labels are just the readable, complete name of the variable, as opposed to the abbreviated form that is practical to type (no spaces, no special characters...) but often cryptic and ugly which is used for variable names. The most typical use of variable labels is when you want to provide a good default for axes labels, like "Annual GDP growth", rather than "GDPG". They could also be useful to describe the contents of a database, with a function like Hmisc's describe() [1] or SAS's proc contents.
1: http://www.inside-r.org/packages/cran/Hmisc/docs/describe
— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/DataFrames.jl/issues/35#issuecomment-25770479 .
All great ideas. long name, LoM (for example I'd love to indicate that something is a likert-item, which is more specific than just categorical), etc. Not sure what is meant by domain? Units of course useful for measurements. An open ended comment field would be great for code book stuff (how data is collected, coded etc) - I could see some great ways of showing this, especially in the web view.
Not sure how this would fit in, but in my R code, I also have the concept of grouping columns - for example having five groups of questions.
Also curious about how we serialize this - we can't just spit this out into CSV again. What's DataFrame's "native" format for storing all this metadata? Ideally it would be something that was compatible with other tools as well. HDF5?
I think separating Likert scales from other categorical variables might be too specific for something as generic as DataFrames: what functions would apply to them that don't apply to other categorical variables?
I believe domain is meant in the math sense of "allowable, but not necessarily present, values for entries in this column".
We once had grouped columns, but they were dropped because they proved difficult to maintain. They need to added back in, but it takes a good chunk of work to do.
Serialization is kind of a nightmare. I think HDF5 may work, but that's a question for people with more expertise than I have in our current serialization infrastructure.
I was thinking of LoM as user defined, for example I might want to graph likert-scales differently from a demographic categorical variable... But this isn't super-important.
Stian
On Mon, Oct 7, 2013 at 11:57 AM, John Myles White [email protected]:
I think separating Likert scales from other categorical variables might be too specific for something as generic as DataFrames: what functions would apply to them that don't apply to other categorical variables?
I believe domain is meant in the math sense of "allowable, but not necessarily present, values for entries in this column".
We once had grouped columns, but they were dropped because they proved difficult to maintain. They need to added back in, but it takes a good chunk of work to do.
Serialization is kind of a nightmare. I think HDF5 may work, but that's a question for people with more expertise than I have in our current serialization infrastructure.
— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/DataFrames.jl/issues/35#issuecomment-25820799 .
http://reganmian.net/blog -- Random Stuff that Matters
You could actually do that already: you'd just make a DataArray{LikertResponse}
, where LikertResponse
is a custom type. This is one of the virtues of our approach to NA
: you can create a DataArray
for any type in Julia, not just those we've built into the system.
I think serialization becomes an important issue - many of these things can probably be done already by subclassing DataFrame etc (and whether it's better to extend DataFrame or subclass it becomes a design question), however the key question is how I can setup my data the way I want it (with full names, groups, etc), and then store it for future analysis by other scripts...
On Mon, Oct 7, 2013 at 12:04 PM, John Myles White [email protected]:
You could actually do that already: you'd just make a DataArray{LikertResponse}, where LikertResponse is a custom type. This is one of the virtues of our approach to NA: you can create a DataArray for any type in Julia, not just those we've built into the system.
— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/DataFrames.jl/issues/35#issuecomment-25821376 .
http://reganmian.net/blog -- Random Stuff that Matters
If you add support for arbitrary meta-data attributes to DataFrames, it will be easy for separate packages to mark some columns as grouped using a group index. No need to hardcode support for every specific feature - just make it easy to extend.
What would arbitrary metadata consist of? A Dict called metadata that people can do anything with?
Sure, a Dict containing vectors with one value per column, or even just a DataFrame, since attributes would all have the same length. Only standard attributes would have a pre-specified type, others would be free. Of course setters and getters would make the whole process transparent.
If you're up for making a demo with that approach, it'd be nice to see. My instinct is that trying to avoid pre-specified types is going to make things slow, but I could be wrong.
I like the idea of metadata, but I'm worried that it complicates things, especially if applied to a DataFrame. As John said, a demo would be a great way to work things out. We once had a concept of column groupings that we eventually pulled out because it tended to complicate things. Trying out an implementation is the best way to judge the balance of additional complexity relative to its benefit.
Applying metadata to columns but embedding that data into the DataFrame structure has issues. For example, I may create a DataFrame column that points to a DataArray originally in a different DataFrame like: df1["colX"] = df2["colY"]
. If df2
had column labels or other metadata, it would be lost because the DataArray df2["colY"]
doesn't know about that. This type of column reuse is common in DataFrames.
It's easier to attach metadata to DataArrays or other column data. Then, the metadata goes with columns. Nothing really needs to change in the DataFrame structure.
I've never really programmed in Julia yet, so I cannot promise anything...
Tom's point about storing meta-data directly in DataArrays sounds interesting for attributes that make sense when columns are taken in isolation (i.e. for label, unit...). It would not make much sense for column groupings, since a group index taken alone does not mean much. But that may not be an issue: if you take a column out of it's original DataFrame, you know that you're breaking its grouping with other columns.
I kind of like this solution: it means the meta-data would be preserved when passing the DataArray directly to a function, which could happen in many cases.
Here's the metadata I'm on board with adding permanently:
- Nullable: Is this column a Vector or a DataVector? (Note that, if we make the changes described in a recent discussion regarding problems with PDA's never being able to capture all properties of categorical data, we'll only have Vector or DataVector going forward.)
- Column label/description: An arbitrarily length string describing the contents of that column in natural language.
Here's the metadata I like, but don't feel comfortable committing to just yet:
- Units of measurement: Saying whether a vector is measured in inches or feet or meters seems really awesome, but it seems like it might be done rarely enough that I'm not ready to commit to it just yet. Let's shoot for working this idea out for after the 0.3 release.
FWIW, I'm used to people storing a description of the levels of cryptic enums in the description field of column tables in RDBMS.
PDAs were originally intended to be a performance/memory optimization, not (just) a representation for categorical data. I missed the discussion of their limitations -- would you point me at that?
On Wed, Jan 29, 2014 at 4:22 PM, John Myles White [email protected]:
Here's the metadata I'm on board with adding permanently:
- Nullable: Is this column a Vector or a DataVector? (Note that, if we make the changes described in a recent discussion regarding problems with PDA's never being able to capture all properties of categorical data, we'll only have Vector or DataVector going forward.)
- Column label/description: An arbitrarily length string describing the contents of that column in natural language.
Here's the metadata I like, but don't feel comfortable committing to just yet:
- Units of measurement: Saying whether a vector is measured in inches or feet or meters seems really awesome, but it seems like it might be done rarely enough that I'm not ready to commit to it just yet. Let's shoot for working this idea out for after the 0.3 release.
FWIW, I'm used to people storing a description of the levels of cryptic enums in the description field of column tables in RDBMS.
Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/DataFrames.jl/issues/35#issuecomment-33632269 .
Agreed that PDA's were an optimization, but they've gotten used as factors.
I wrote about the limitations of PDA's after "an epiphany" described in https://github.com/JuliaStats/DataArrays.jl/issues/50.
Summary: R gets a lot of mileage out of storing information about factor levels in vectors, but that's because each subset (including singleton-elements) retains information about the vector as a whole. Since Julia has proper scalars, factors need to be represented using a new scalar type, which will probably end up looking like Enum's.
I don't get why we would need a Nullable
attribute: shouldn't this be inferred from the type of the column vector (i.e. Array
or DataArray
)?
Starting with column labels and not supporting units is reasonable. The essential point is to make the system extendable so that new attributes can be added in the future (custom attributes too?).
Finally, there's the question of whether some meta-data should be stored in DataArrays
directly. For factors, the levels will have to. Conceptually, a variable label is also attached to the column rather than to the DataFrame
. The problem is that standard Arrays
do not support meta-data.
Regarding "Nullable", can't you just use colwise
and extract that from the column type? Arrays can't have missing data and DataArrays can. Actually Arrays could have missing data if the Arrays holds a type that can be an NA. In any case, you should still be able to tell by the type of Array{T,N}
using T
.
As far as what we store in metadata, maybe we can use a Dict for that to allow storing different fields, and standardize on a few common names.
Regarding where to store the metadata, in this thread above, I outlined adding metadata to the columns. That helps with df[:newcol] = df2[:othercol]
. But, what do you do with df[:col] + 1
?
So, if we stick metadata in the DataFrame (or Index), we might need a structure that carries the data and metadata to handle the df[:newcol] = df2[:othercol]
case. Or, we can just require the user to do metadata(df)[:newcol] = metadata(df2)[:othercol]
.
We don't need to store a nullable attribute. We just need to expose that information through an interface. But it might be faster to check a BitVector than to check the type tag of each column. Let's worry about implementation later and focus on design first.
I'm not really ready to embrace custom attributes just yet, since it fragments the community if some people's DataFrames have properties that other DataFrames don't share. Let's think about whether we should have them later.
For factors, the levels of the factor will be stored in the type system, not in a DataArray.
I don't think we should store metadata in columns. AFAIK, RDBMS systems don't do that: these properties are attached to the specific table and don't come along for the ride with the values in that table. So I'd rather require the user to do metadata(df)[:newcol] = metadata(df2)[:othercol]
.
Sounds good, John.
I'm not very familiar with database management systems, but it seems to me it would be convenient and completely logical to preserve column labels if you copy a column to another DataFrame
, which is what df[:newcol] = df2[:othercol]
is about. That said, a special function to copy a column could also be added if needed, which would handle this special case.
A more general issue I'm thinking about is that if meta-data is attached to the DataFrame
and not the vector, then an (imaginary) call like plot(df[:col1], df[:col2])
will not be able to access the column label to find a meaningful default axis labels. An interface dedicated to DataFrames
will have to be used, something like plot(~ col1 + col2, df)
. This sounds fine to me (and even better than the first form), but it's worth checking it would work in all common cases.
I think people should always specify their axes labels manually if they don't want defaults.
Of course, I don't deny that. I was talking about the impossibility for plot(df[:col1], df[:col2])
to offer reasonable defaults when the user does not set them explicitly.
That's true. But we'll never offer anything as smooth as R's kind of defaults, where you have access to information about the calling context. I think people can get used to explicitness.
With column labels, we can actually offer something much more useful than R's defaults. In R most of the time the default axis label is ugly or useless, e.g. df[["datebrth"]]
or even x[[3]]
. With DataFrames
it could be Date of birth
instead.
We can use column labels for interfaces that take in DataFrames as arguments in the way that Gadfly does. For things that work with vectors, they should not assume labels will exist. Otherwise they're broken for normal Arrays.
Do we really want arbitrary metadata in the type? Seems like there are generally other ways to include metadata about your DataFrame w/o stuffing it into the type itself.
I think we something like that would be useful, even if that's not the highest priority. How could you store metadata about columns without support in DataFrame
itself?
I agree with the above. If the goal is to have easy plotting (automatic labels), and easy table creation, forcing a long list of packages to interact with a third labeling package would be far more difficult to maintain than incorporating metadata into dataframes.