TypedTables.jl
TypedTables.jl copied to clipboard
Documentation and examples
This package needs a user guide.
There's some work-in-progress using Documenter.jl and github-pages now - you can find this with the "latest" documentation badge from the README.
Hey Andy,
I wrote the first two doc review parts to you in an email... but thought I'd continue here (for the syntax highlighting etc). I'm up to IO:
Input and output
- Why are headings Tables.jl and CSV.jl in italics?
Tables.jl
I'm not sure what information you're trying to convey in the Tables.jl section. Are you saying that Table and FlexTable integrate with it? It's not clear how this is related to IO until you get to the section on CSV.
Perhaps talking about the how TypedTables relates to the Tables API should go into the Table types section?
CSV.jl
Now here's where it gets really practical. I'd suggest renaming this section something like "Reading delimited text files into a Table", and include an example with the actual data which can be pasted into the REPL, perhaps. For that you can put the string containing the delimited data inline:
raw_data = IOBuffer("""
name,age
Alice,25
Bob,42
Charlie,37
""")
csvfile = CSV.File(raw_data, delim=',')
table = Table(delimfile)
BTW the following is an interesting read https://github.com/JuliaData/CSV.jl/issues/340
BTW, I'm continuing to love the simple composable design of this and related packages. At this stage it just needs good documentation to clearly show how nice it all is!
Thoughts on the next section
Basic data manipulation
Mapping rows of data
Using map
Extracting a column - perhaps too simplistic given that there's another much
better way to do this using t.name? I think you could probably cut this
example.
How about this example of adding a new column?
julia> map(row -> (row..., is_old=row.age > 40), t)
Table with 3 columns and 3 rows:
name age is_old
┌─────────────────────
1 │ Alice 25 false
2 │ Bob 42 true
3 │ Charlie 37 false
Generators
This is nice.
Generators and comprehensions also support filtering data and combining multiple datasets, which cover in Finding Data and Joining Data.
- covered
- Link to the Finding Data and Joining Data sections
Preselection
You're right, I couldn't see the point of getproperty(:name) out of context.
Finding data
Example can be expressed as
t[t.age .> 40]
I guess this is a frustration I'm having with a lot of the examples in the mapping and finding sections — it's good that they're simple, but on the other hand they're kinda unrealistically simple in the sense that you woudln't express the code that way in practice. IMO the examples should be just complicated enough to show idiomatic use. Easy to say, I know.
Next section... one thing which strikes me here is that you're not really documenting TypedTables per se? But the Andyverse of data analysis? Which makes it kind of odd that the documentation is in TypedTables.jl.
Another thing which strikes me is that the grouping and joining sections seem quite polished. I especially enjoyed the grouping, which looks like it would address some of my frustrations having used the DataFrames groupby.
Grouping data
Spelling: Groupind in the index on the left
Using the group function
Ok, now I see why your curried version of getproperty is worthwhile. Perhaps you could link between the sections where getproperty is introduced/used. Actaully the curried getproperty should be arguably be in Base.
Lazy grouping and Groupreduce
Very nice, you've got all the things!
Joining data
I do wonder whether product might better be named crossjoin. Not because I partiularly like the latter name, but mainly because product is such a generic name, and crossjoin has better symmetry with innerjoin. Though product returns data which is naturally cartesian product shaped in contrast to leftjoin...
Left-group-join
I thought the intro to this section could describe the operation itself rather than the analogy with SQL or LINQ (which I'm not very familiar with).
Acceleration indices
However, the second "magic" ingredient used by an RDBMS for performance are secondary "acceleration indices", which are pre-calculated views of the data.
I'm not sure the database people would agree, I get the impression that disk layout and caching are quite important ;-) Also, views can be built using indices, but they're not the same thing.
The user is free to write generic code to execute their query, and the presence of the acceleration index will only act to speed up [...]
I think it's worth making the point that this is also the power of indices in SQL: you can add them to speed up the execution, but they are a performance tool and the query stays the same. In the same way, in juila the code which manipulates the arrays stays the same but things go faster. It's a great composable design.
Ok I think I've read through all the docs. Overall, great stuff! I want to start using these packages ASAP.
Random (likely useless) thought bubble — having just written that AcceleratedArrays is great because it decouples performance from query semantics — could we think of the types in Table in a similar way; extra data to improve performance, but which might be missing? Thus somehow folding FlexTable and Table together?
Haha - that's kind of interesting, actually. Until now I've been thinking of FlexTable as the slower Table. Should we define decellerate(t::Table) = FlexTable(t)? :smile:
And thank you very much Chris for the valuable feedback! (I now have to find the time to make some fixes).
one thing which strikes me here is that you're not really documenting TypedTables per se? But the Andyverse of data analysis? Which makes it kind of odd that the documentation is in TypedTables.jl.
Yes... well that is kind-of true. They were developed quite specifically to work together - a kind of "native" and "Julian" relational algebra interface. SplitApplyCombine deserves better documentation of it's own (and I'd like to port it to Base).
Another thing which strikes me is that the grouping and joining sections seem quite polished. I especially enjoyed the grouping, which looks like it would address some of my frustrations having used the DataFrames
groupby.
Thanks for the feedback! It's fair to say that SplitApplyCombine exists specifically because there is no Base.group, so yeah this is the bit which I definitely feel the most strongly about and have thought about the longest.
Should we define decellerate(t::Table) = FlexTable(t)?
Not quite what I was thinking :-) More like trying to define
const FlexTable{N} = Table{Placeholder, N, NamedTuple{<:Any, <:Tuple{Vararg{AbstractArray{<:Any,N}}}}}
where Placeholder might be Nothing or NamedTuple undecorated with column names and types. Or something. And trying to see if that can lead anywhere productive.
Regarding writing documentation, I liked this blog: https://www.divio.com/blog/documentation/
I feel like I should better factor my tutorials, explanations and how-tos. (At least the reference material is naturally docstrings in Julia).
:100: That's a really interesting article for framing the discussion around documentation. It's very interesting that they insist that these four types of documentation are really separate and should be written separately. In my mind, I suppose there were only two types: prose which has to function as all of tutorial, howto and explanation. And technical reference (docstrings).
In the language of the article, I'd say several sections of the TypedTables documentation had too much explanation. I think I made the same mistake with the Logging documentation which probably makes it read more like a design document than a practical guide. It was so much work! And yet people are still (understandably) confused about how to use it! Ack!
Yes, agreed. I already began a rewrite to create a much more focussed tutorial. Interestingly, this starkly highlighted a couple of the (known) missing features, so I'm looking into these as I go.
I'm not certain how to phrase the left-over design explanation without it being just a rant. Anyway; iterate, iterate, iterate...
Yeah the explanation of a design is hard to write and make useful. The abstract design arises from a bunch of concrete use cases and practical constraints... but writing those down without any organization leads to a pile. On the other hand, remove them and it feels like you're writing fluff without justification. Kind of like a rant, yes!
Maybe it would help to try to name the dimensions of the "use case space"? A given design satisfies the needs of a bunch of use cases, and so fills out some nontrivial volume in that space. At the boundaries of the volume are some particular extrema which the design only just satisfies... are these the use cases which matter and are worth discussing to keep things concrete?
On the other hand there's the design space and performance spaces, which (looking at the literature) seems to be more standard concepts. But for software design the design space seems rather high dimensional, poorly defined and combinatoric rather than continuous. Probably like most real world design problems...
Oh, I'm sure some category theory will help us out. (Um. I have only the vaguest idea of what that paper is proposing.)
Oops, got the link wrong... here's the paper which talks about using category theory for Formal Design.
Thanks. Unfortunately, I haven't the time to look over something so... dense... at the moment ;)
Chris - there's a new "tutorial" section up now, and a basic API reference. The remainder of the docs still need refactoring. But I think I'm much happier with the tutorial - to me it now doesn't seem significantly worse than getting started guides for DataFrames.jl, Python Pandas, R's Data.Table, etc.
Having _.name syntax in Julia 1.1 instead of this package using getproperty(:name) would make it nicer (I had to add an "explanation" to the "tutorial", shudder).
I do wonder whether _.name will get into 1.1. The issue of binding tightness is really thorny. Reading back on the issue, I'm rather dissatisfied with tight binding and Stefan's counter proposal seems better but is quite complicated and lacks an implementation.
I agree. It all seems thorny enough to sink it (or at least delay it signficantly).
Well, you totally nerd sniped me with the underscores business... Now there's MagicUnderscores.jl. You're the first to see it :-P