HDF5.jl
HDF5.jl copied to clipboard
Opening Pytables/Pandas file
I am very interested in opening files created with Pandas/Pytables in Julia. I didn't see it mentioned anywhere that it was not supposed to work, so I tried. I can open the HDF5 file and see the contents, and read in a Pandas series, which works great.
However, when I try to read in the main table, I get the following:
julia> read(a["db"])
ERROR: no method hdf5_type_id(Type{FixedArray{Float64,(DimSize{17},)}})
in read at /home/stian/.julia/v0.3/HDF5/src/plain.jl:1240
in read at /home/stian/.julia/v0.3/HDF5/src/plain.jl:1060
in read at /home/stian/.julia/v0.3/HDF5/src/plain.jl:1048
in read at /home/stian/.julia/v0.3/HDF5/src/datafile.jl:45
The main table does indeed have 17 columns with floats. Ideally it would be possible to read these into a DataFrame... Am I doing something wrong? Is this supposed to be working, but there's a bug? Or is it not currenlty implemented (in which case I might play around with trying to get it to work)?
Thanks!
To my knowledge this hasn't been tried, but the goal is to get to the point where we can read any HDF5 file.
From where the error is occurring, this might be an easy fix or it could take some digging. This is a pretty long message (sorry), but most is background and the strategy (near the end) should be pretty simple.
Here's my suspicion of what's happening. It's reading an HDF5 Compound
data type, perhaps corresponding to a row of the DataFrame
. Compound
types correspond to Julia immutable
s or C struct
s. In this case, one of the fields inside that compound data type is an array of 17 Float64
s. In HDF5 parlance this is called an H5T_ARRAY
type; these differ from more commonly-used arrays by having a fixed size (17 in this case).
Now some HDF5.jl background. Since Julia doesn't have a fixed-size array type, a FixedArray
is just a "dummy type" internal to the HDF5 module that encapsulates the information about how the object should be represented. If you search for FixedArray
in plain.jl
, you'll find that when read they normally get loaded into a regular array. However, since in this case this is just one field of a H5T_COMPOUND
type, that won't work; you'll need to read this in either as one field of an immutable or just as a set of bytes in a plain buffer.
HDF5.jl's support for H5T_COMPOUND objects is on the rudimentary side, but that may not be a bad thing here. What will happen is that your information will be returned as little more than an opaque buffer (an HDF5Compound
object), but you could reinterpret
is as an array of whatever immutable
type you want, and from there convert to a DataFrame.
I'd guess that a great (and fairly easy) first step would be simply to define that missing version of hdf5_type_id
. It's essentially the inverse of hdf5array
, going from the Julia type to declaring an H5T_ARRAY with the proper information in it.
Found this issue from google.
julia> h5 = h5open("example.h5","r")
julia> x = read(h5["/pandas/frame_df"])
Dict{String,Any} with 3 entries:
"meta" => Dict{String,Any}("values_block_2"=>Dict{String,Any}("meta"=>Dict{String,Any}("_i_table"=>Dict{String,Any}("values"=>Dict{String,Any}("mbounds"=>String[],"abounds"=>String[],"mr…
"_i_table" => Dict{String,Any}("index"=>Dict{String,Any}("mbounds"=>[512, 1536, 2560, 3584, 4608, 5632, 6656, 7680, 8704, 9728 … 178688, 179712, 180736, 181760, 182784, 183808, 184832, 185…
"table" => HDF5.HDF5Compound{4}[HDF5Compound{4}((0, [0.0, 0.0], [0, 0], Int8[0]), ("index", "values_block_0", "values_block_1", "values_block_2"), (Int64, FixedArray{Float64,(2,)}, FixedA…
julia> x = read(h5["/pandas/frame_df/table"])
346507-element Array{HDF5.HDF5Compound{4},1}:
HDF5.HDF5Compound{4}((0, [0.0, 0.0], [0, 0], Int8[0]), ("index", "values_block_0", "values_block_1", "values_block_2"), (Int64, HDF5.FixedArray{Float64,(2,)}, HDF5.FixedArray{Int64,(2,)}, HDF5.FixedArray{Int8,(1,)}))
[...]
julia> df = DataFrame(x);
julia> first(df,1)
1×3 DataFrame
│ Row │ data │ membername │ membertype │
│ │ Tuple… │ NTuple{4,String} │ NTuple{4,DataType} │
├─────┼──────────────────────────────┼─────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│ 1 │ (0, [0.0, 0.0], [0, 0], [0]) │ ("index", "values_block_0", "values_block_1", "values_block_2") │ (Int64, FixedArray{Float64,(2,)}, FixedArray{Int64,(2,)}, FixedArray{Int8,(1,)}) │
julia> names(attrs(h5["/pandas/frame_df"]))
16-element Array{String,1}:
"CLASS"
"TITLE"
"VERSION"
"data_columns"
"encoding"
"errors"
"index_cols"
"info"
"levels"
"metadata"
"nan_rep"
"non_index_axes"
"pandas_type"
"pandas_version"
"table_type"
"values_cols"
It seems the situation has improved! Can read everything at least. Should this now be a feature request on DataFrames.jl?
Without the test files it's hard to know what is working or not. Are we missing something on the HDF5 here or can I close this?