StatsModels.jl icon indicating copy to clipboard operation
StatsModels.jl copied to clipboard

Disruptive Variable Names

Open Nosferican opened this issue 8 years ago • 7 comments

using Feather
using DataFrames
using StatsModels
df = Feather.read(joinpath(Pkg.dir("Feather"), "test/newdata/iris.feather"),
    use_mmap = false)
names(df)

The names are: Vector{Symbol}(:Sepal.Length, :Sepal.Width, :Petal.Length, :Petal.Width, :Species)

Having variable names with . makes them Expr with ex.head == :(.) which prompts errors in:

  1. dospecials
  2. Terms

The following fix works:

function dospecials(ex::Expr)
    if ex.head == :(.) return Symbol(ex) end
    ...
end
function Terms(f::Formula)
    ...
    if haslhs
        unshift!(etrms, Any[Symbol(f.lhs)]) # Shouldn't typeof(eterms) == Vector{Symbol} ?
end

Nosferican avatar Nov 13 '17 21:11 Nosferican

While this fix is only for variable names that have . it might be wise to decide against supporting names that cannot be framed as a :... such as having special operators for StatsModels (e.g., Symbol("A*B")), special operators in Julia (e.g., Symbol("A.B"), spaces (e.g., Symbol("A B")) or others such as Symbol(""). As long as is documented I think it would be fine... The proposed approach would allow _ as a separator for compound words.

Nosferican avatar Nov 13 '17 22:11 Nosferican

This seems like an issue with Feather rather than StatsModels. DataFrames provides an option to normalize column names to valid Julia identifiers upon reading. Feather should probably have an option like that as well.

ararslan avatar Nov 13 '17 22:11 ararslan

I verified that with CSV the default is to keep the variable names as they are in the original file. For example:

This is a var,Value
1,0

will import using CSV.read with names Vector{Symbol}(Symbol("This is a var"), :Value)

DataFrames depreciated readtable in favor of using I/O packages directly. Do you remember which function normalized column names? If it still exported, it could be used and just document that step.

Nosferican avatar Nov 13 '17 22:11 Nosferican

It wasn't a separate function but rather an argument to readtable in DataFrames. I'd be surprised if that functionality isn't available in CSV or Feather. cc @quinnj

ararslan avatar Nov 13 '17 22:11 ararslan

CSV nor Feather seem to have it. I found it in the DataFrames other/utils.jl. The function is identifier(s::AbstractString) which isn't exported. It might be worthwhile to add this to the documentation as a line,

names!(df, Symbol.(DataFrames.identifier.(string.(names(df)))))

Nosferican avatar Nov 13 '17 22:11 Nosferican

Even if normalizing names by default is a good idea, it would be nice to support non-standard names if that's possible, as that makes generic code more robust.

nalimilan avatar Nov 14 '17 08:11 nalimilan

I believe that for handling non-normalized names the minimum requirements might be:

  1. Names can't have leading or trailing white spaces.
  2. Names must have a minimum character length of one (i.e., Symbol("") not allowed)
  3. Operators in formula must have a leading and trailing space (i.e., @formula(y ~ a & b) rather than @formula(y ~ a&b))

Then Terms could potentially be used with colnames::Vector{Symbol} = names(df) in order to make a mapping and operate by column position rather than names.

I believe the requirements are not unreasonable. There might be some caveats, for example with a an expression such as Symbol("ID & Year") where it might be unclear for a parser and would have to make a greedy expression to correctly parse the terms based of the column names in the off-chance that names(df) = Vector{Symbol}(["ID", "Year", "ID & Year"]). The example might indicate a need to fall back on normalizing it first and then map back as an alternative.

Nosferican avatar Nov 14 '17 15:11 Nosferican