This package needs a user guide.

Aug 12 '18 19:08 andyferris

There's some work-in-progress using Documenter.jl and github-pages now - you can find this with the "latest" documentation badge from the README.

Oct 16 '18 05:10 andyferris

Hey Andy,

I wrote the first two doc review parts to you in an email... but thought I'd continue here (for the syntax highlighting etc). I'm up to IO:

Input and output

Why are headings Tables.jl and CSV.jl in italics?

Tables.jl

I'm not sure what information you're trying to convey in the Tables.jl section. Are you saying that Table and FlexTable integrate with it? It's not clear how this is related to IO until you get to the section on CSV.

Perhaps talking about the how TypedTables relates to the Tables API should go into the Table types section?

CSV.jl

Now here's where it gets really practical. I'd suggest renaming this section something like "Reading delimited text files into a Table", and include an example with the actual data which can be pasted into the REPL, perhaps. For that you can put the string containing the delimited data inline:

raw_data = IOBuffer("""
name,age
Alice,25
Bob,42
Charlie,37
""")

csvfile = CSV.File(raw_data, delim=',')

table = Table(delimfile)

BTW the following is an interesting read https://github.com/JuliaData/CSV.jl/issues/340

Oct 27 '18 02:10 c42f

BTW, I'm continuing to love the simple composable design of this and related packages. At this stage it just needs good documentation to clearly show how nice it all is!

Oct 27 '18 02:10 c42f

Thoughts on the next section

Basic data manipulation

Mapping rows of data

Using map

Extracting a column - perhaps too simplistic given that there's another much better way to do this using t.name? I think you could probably cut this example.

How about this example of adding a new column?

julia> map(row -> (row..., is_old=row.age > 40), t)
Table with 3 columns and 3 rows:
     name     age  is_old
   ┌─────────────────────
 1 │ Alice    25   false
 2 │ Bob      42   true
 3 │ Charlie  37   false

Generators

This is nice.

Generators and comprehensions also support filtering data and combining multiple datasets, which cover in Finding Data and Joining Data.

covered
Link to the Finding Data and Joining Data sections

Preselection

You're right, I couldn't see the point of getproperty(:name) out of context.

Finding data

Example can be expressed as

t[t.age .> 40]

I guess this is a frustration I'm having with a lot of the examples in the mapping and finding sections — it's good that they're simple, but on the other hand they're kinda unrealistically simple in the sense that you woudln't express the code that way in practice. IMO the examples should be just complicated enough to show idiomatic use. Easy to say, I know.

Nov 02 '18 20:11 c42f

Next section... one thing which strikes me here is that you're not really documenting TypedTables per se? But the Andyverse of data analysis? Which makes it kind of odd that the documentation is in TypedTables.jl.

Another thing which strikes me is that the grouping and joining sections seem quite polished. I especially enjoyed the grouping, which looks like it would address some of my frustrations having used the DataFrames groupby.

Grouping data

Spelling: Groupind in the index on the left

Using the group function

Ok, now I see why your curried version of getproperty is worthwhile. Perhaps you could link between the sections where getproperty is introduced/used. Actaully the curried getproperty should be arguably be in Base.

Lazy grouping and Groupreduce

Very nice, you've got all the things!

Joining data

I do wonder whether product might better be named crossjoin. Not because I partiularly like the latter name, but mainly because product is such a generic name, and crossjoin has better symmetry with innerjoin. Though product returns data which is naturally cartesian product shaped in contrast to leftjoin...

Left-group-join

I thought the intro to this section could describe the operation itself rather than the analogy with SQL or LINQ (which I'm not very familiar with).

Acceleration indices

However, the second "magic" ingredient used by an RDBMS for performance are secondary "acceleration indices", which are pre-calculated views of the data.

I'm not sure the database people would agree, I get the impression that disk layout and caching are quite important ;-) Also, views can be built using indices, but they're not the same thing.

The user is free to write generic code to execute their query, and the presence of the acceleration index will only act to speed up [...]

I think it's worth making the point that this is also the power of indices in SQL: you can add them to speed up the execution, but they are a performance tool and the query stays the same. In the same way, in juila the code which manipulates the arrays stays the same but things go faster. It's a great composable design.

Ok I think I've read through all the docs. Overall, great stuff! I want to start using these packages ASAP.

Nov 02 '18 22:11 c42f

Random (likely useless) thought bubble — having just written that AcceleratedArrays is great because it decouples performance from query semantics — could we think of the types in Table in a similar way; extra data to improve performance, but which might be missing? Thus somehow folding FlexTable and Table together?

Nov 02 '18 22:11 c42f

Haha - that's kind of interesting, actually. Until now I've been thinking of FlexTable as the slower Table. Should we define decellerate(t::Table) = FlexTable(t)? :smile:

Nov 03 '18 09:11 andyferris

And thank you very much Chris for the valuable feedback! (I now have to find the time to make some fixes).

one thing which strikes me here is that you're not really documenting TypedTables per se? But the Andyverse of data analysis? Which makes it kind of odd that the documentation is in TypedTables.jl.

Yes... well that is kind-of true. They were developed quite specifically to work together - a kind of "native" and "Julian" relational algebra interface. SplitApplyCombine deserves better documentation of it's own (and I'd like to port it to Base).

Another thing which strikes me is that the grouping and joining sections seem quite polished. I especially enjoyed the grouping, which looks like it would address some of my frustrations having used the DataFrames groupby.

Thanks for the feedback! It's fair to say that SplitApplyCombine exists specifically because there is no Base.group, so yeah this is the bit which I definitely feel the most strongly about and have thought about the longest.

Nov 03 '18 09:11 andyferris

Should we define decellerate(t::Table) = FlexTable(t)?

Not quite what I was thinking :-) More like trying to define

const FlexTable{N} = Table{Placeholder, N, NamedTuple{<:Any, <:Tuple{Vararg{AbstractArray{<:Any,N}}}}}

where Placeholder might be Nothing or NamedTuple undecorated with column names and types. Or something. And trying to see if that can lead anywhere productive.

Nov 03 '18 22:11 c42f

Regarding writing documentation, I liked this blog: https://www.divio.com/blog/documentation/

I feel like I should better factor my tutorials, explanations and how-tos. (At least the reference material is naturally docstrings in Julia).

Nov 12 '18 06:11 andyferris

:100: That's a really interesting article for framing the discussion around documentation. It's very interesting that they insist that these four types of documentation are really separate and should be written separately. In my mind, I suppose there were only two types: prose which has to function as all of tutorial, howto and explanation. And technical reference (docstrings).

In the language of the article, I'd say several sections of the TypedTables documentation had too much explanation. I think I made the same mistake with the Logging documentation which probably makes it read more like a design document than a practical guide. It was so much work! And yet people are still (understandably) confused about how to use it! Ack!

Nov 13 '18 03:11 c42f

Yes, agreed. I already began a rewrite to create a much more focussed tutorial. Interestingly, this starkly highlighted a couple of the (known) missing features, so I'm looking into these as I go.

I'm not certain how to phrase the left-over design explanation without it being just a rant. Anyway; iterate, iterate, iterate...

Nov 13 '18 04:11 andyferris

Yeah the explanation of a design is hard to write and make useful. The abstract design arises from a bunch of concrete use cases and practical constraints... but writing those down without any organization leads to a pile. On the other hand, remove them and it feels like you're writing fluff without justification. Kind of like a rant, yes!

Maybe it would help to try to name the dimensions of the "use case space"? A given design satisfies the needs of a bunch of use cases, and so fills out some nontrivial volume in that space. At the boundaries of the volume are some particular extrema which the design only just satisfies... are these the use cases which matter and are worth discussing to keep things concrete?

On the other hand there's the design space and performance spaces, which (looking at the literature) seems to be more standard concepts. But for software design the design space seems rather high dimensional, poorly defined and combinatoric rather than continuous. Probably like most real world design problems...

Oh, I'm sure some category theory will help us out. (Um. I have only the vaguest idea of what that paper is proposing.)

Nov 13 '18 05:11 c42f

Oops, got the link wrong... here's the paper which talks about using category theory for Formal Design.

Nov 13 '18 10:11 c42f

Thanks. Unfortunately, I haven't the time to look over something so... dense... at the moment ;)

Nov 14 '18 01:11 andyferris

Chris - there's a new "tutorial" section up now, and a basic API reference. The remainder of the docs still need refactoring. But I think I'm much happier with the tutorial - to me it now doesn't seem significantly worse than getting started guides for DataFrames.jl, Python Pandas, R's Data.Table, etc.

Having _.name syntax in Julia 1.1 instead of this package using getproperty(:name) would make it nicer (I had to add an "explanation" to the "tutorial", shudder).

Nov 14 '18 13:11 andyferris

I do wonder whether _.name will get into 1.1. The issue of binding tightness is really thorny. Reading back on the issue, I'm rather dissatisfied with tight binding and Stefan's counter proposal seems better but is quite complicated and lacks an implementation.

Nov 15 '18 20:11 c42f

I agree. It all seems thorny enough to sink it (or at least delay it signficantly).

Nov 15 '18 22:11 andyferris

Well, you totally nerd sniped me with the underscores business... Now there's MagicUnderscores.jl. You're the first to see it :-P

Nov 16 '18 12:11 c42f

TypedTables.jl
TypedTables.jl copied to clipboard

Documentation and examples

Input and output

Tables.jl

CSV.jl

Basic data manipulation

Mapping rows of data

Using map

Generators

Preselection

Finding data

Grouping data

Using the group function

Lazy grouping and Groupreduce

Joining data

Left-group-join

Acceleration indices

TypedTables.jl TypedTables.jl copied to clipboard

Documentation and examples

Input and output

Tables.jl

CSV.jl

Basic data manipulation

Mapping rows of data

Using map

Generators

Preselection

Finding data

Grouping data

Using the group function

Lazy grouping and Groupreduce

Joining data

Left-group-join

Acceleration indices

TypedTables.jl
TypedTables.jl copied to clipboard