JuliaDB.jl
JuliaDB.jl copied to clipboard
Unable to Index Out-of-Core Table
I have ~8GB of BCI data that I just fed into JuliaDB and now I can't seem to get a look at it. I'm still very new to Julia so hopefully I'm not doing anything too stupid.
Julia Version 1.0.2 (2018-11-08)
Load CSV's into JuliaDB
using JuliaDB
loadtable(glob("*.csv", "/home/beboobop/Datasets/bci_data/train/csv/"), output="/home/beboobop/Datasets/bci_data/train/juliadb", chunks=10;);
Load JuliaDB data
tbl = load("/home/beboobop/Datasets/bci_data/train/juliadb");
Now I try to index the data to just look at it.
tbl[2]
Returns
MethodError: no method matching getindex(::JuliaDB.DNextTable{NamedTuple{(:Time, :Fp1, :Fp2, :AF7, :AF3, :AF4, :AF8, :F7, :F5, :F3, :F1, :Fz, :F2, :F4, :F6, :F8, :FT7, :FC5, :FC3, :FC1, :FCz, :FC2, :FC4, :FC6, :FT8, :T7, :C5, :C3, :C1, :Cz, :C2, :C4, :C6, :T8, :TP7, :CP5, :CP3, :CP1, :CPz, :CP2, :CP4, :CP6, :TP8, :P7, :P5, :P3, :P1, :Pz, :P2, :P4, :P6, :P8, :PO7, :POz, :P08, :O1, :O2, :EOG, :FeedBackEvent),Tuple{Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Int64}},Tuple{Int64}}, ::Int64)
Stacktrace:
[1] top-level scope at In[21]:1
For distributed tables, getindex (which gets called when you write tbl[2]) is not supported.
I think we do need a better way of taking a glance at a distributed table.
Would it be possible to define iteration on a distributed table? This way all the tools that only use iteration (like for example TableView) would work on a distributed table.
I don't know if this is feasible but what I am expecting to happen or would like to see is this Distributed Table act similar to a Dask DataFrame or Dask Array in Python.
Any update on this? I love JuliaDB but can't use it for most things because of this.
Not having row indexing renders this pretty much unusable for a lot of Pandas-type operations on large data - I was super, super excited to find this as a better alternative to Pandas and Dask for speedy, rowwise operations...until I couldn't do rowwise operations.
Not having row indexing renders this pretty much unusable for a lot of Pandas-type operations on large data - I was super, super excited to find this as a better alternative to Pandas and Dask for speedy, rowwise operations...until I couldn't do rowwise operations.
Note that, even though indexing is not supported, you can do row-wise operations using map or filter or using JuliaDBMeta macros.
Example:
julia> iris5 = table(iris, chunks = 5);
# row-wise operation parallelize out of the box
julia> @where iris5 :SepalLength == 4.9 && :Species == "setosa"
Distributed Table with 4 rows in 2 chunks:
SepalLength SepalWidth PetalLength PetalWidth Species
──────────────────────────────────────────────────────────
4.9 3.0 1.4 0.2 "setosa"
4.9 3.1 1.5 0.1 "setosa"
4.9 3.1 1.5 0.2 "setosa"
4.9 3.6 1.4 0.1 "setosa"
# call @applychunked to run function on all chunk: here I take
# first 5 entries of each chunk
julia> @applychunked iris5 _[1:5]
Distributed Table with 25 rows in 5 chunks:
SepalLength SepalWidth PetalLength PetalWidth Species
──────────────────────────────────────────────────────────────
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
4.8 3.1 1.6 0.2 "setosa"
5.4 3.4 1.5 0.4 "setosa"
5.2 4.1 1.5 0.1 "setosa"
5.5 4.2 1.4 0.2 "setosa"
4.9 3.1 1.5 0.2 "setosa"
5.0 2.0 3.5 1.0 "versicolor"
5.9 3.0 4.2 1.5 "versicolor"
6.0 2.2 4.0 1.0 "versicolor"
6.1 2.9 4.7 1.4 "versicolor"
5.6 2.9 3.6 1.3 "versicolor"
5.5 2.6 4.4 1.2 "versicolor"
6.1 3.0 4.6 1.4 "versicolor"
5.8 2.6 4.0 1.2 "versicolor"
5.0 2.3 3.3 1.0 "versicolor"
5.6 2.7 4.2 1.3 "versicolor"
6.9 3.2 5.7 2.3 "virginica"
5.6 2.8 4.9 2.0 "virginica"
7.7 2.8 6.7 2.0 "virginica"
6.3 2.7 4.9 1.8 "virginica"
6.7 3.3 5.7 2.1 "virginica"
@piever I love usingJuliaDBMeta as well - I should have said index-dependent row-wise operations like select(t,:col1)[4] = 16, like what I'm used to with df.loc[1,"col1"] = 16. The row-wise operations, especially with your macros, are excellent.
This comes up in constructing new columns based on more complex conditions, like:
dfA - a table of marketing responses
dbB - a table of sent marketing items
for all rows in dfA where customer ID responded,
dfBtemp = dfB rows that match that customer and where the contact date is before the response date
for each marketing response,
attribute that response to a sent marketing item with dfB.at[i,"response"] = 1
add the index of the marketing item to a list of successful responses
JuliaDB IMO is better than Pandas for these kinds of things because of piping and begin etc. but the lack of indexing holds me back from taking advantage of the excellent DIndexedTable in these scenarios. I'd love to see it happen in the future.
...that is unless I'm missing a major piece of how JuliaDB/JuliaDBMeta work, I'd be eager to learn more as I'm pretty new to it.