AbstractTrees.jl icon indicating copy to clipboard operation
AbstractTrees.jl copied to clipboard

Add an abstract type for `AbstractTrees` to `AbstractTrees.jl`

Open roland-KA opened this issue 2 years ago • 28 comments

I'm currently implementing a plot recipe for decision trees in order to be able to visualize all decision trees within the MLJ-package.

The goal is a plot recipe that

  1. has no dependency on Plots.jl
  2. is independent of any specific decision tree implementation

Goal 1 is an inherent characteristic of plot recipes. So we have reached it just by using plot recipes. Goal 2 is realised by requiring each decision tree implementation which wants its trees to be visualized to implement a narrow part of the AbstractTrees-interface (namely children and printnode).

I have already done such an implementation for DecisionTree.jl (see https://github.com/JuliaAI/DecisionTree.jl/blob/dev/src/abstract_trees.jl). The plot recipe itself exists also (see JuliaAI/DecisionTree.jl#147; the recipe is btw so general that it can be applied to each tree implementing the above mentioned interface, it isn't restricted to decision trees).

The only missing part to make all that work, is a type for dispatch. If we have a tree which we would like to plot (e.g. an instance of DecisionTree) then we want to be able to call plot(tree). In order for plot to chose the correct plot recipe, tree must be of the type that is associated with the plot recipe.

As we want a recipe that is independent of any specific decision tree implementation, this type has to be independent of all these implementations. Therefore I have defined a new abstract type AbstractNode.

The question is now, where this new type should be placed. A very natural place would be AbstractTrees.jl (as the AbstractTrees-interface is already used as a common denominator). Therefore the question: Could such an abstract type be added to AbstractTrees.jl?

BTW: If AbstractTrees.jl wouldn't be "just" an interface but define also a type (like AbstractTree), the whole problem wouldn't exist at all).

@ablaom: Do you want to add some remarks?

roland-KA avatar Jul 31 '22 17:07 roland-KA

the hard thing here is that not all trees in abstract trees have a node, and those the do often have a more useful super type than tree node.

oscardssmith avatar Jul 31 '22 19:07 oscardssmith

@oscardssmith, the name of the new type doesn't matter for my issue. It's about having a type (in contrast to not having a type), because dispatch needs a type. We could also call it AbstractTree, if that makes things easier.

roland-KA avatar Jul 31 '22 19:07 roland-KA

I'm currently implementing the AbstractTrees-interface also for BetaML/DecisionTrees and noted that they define in that package an abstract type called AbstractNode.

So in order to avoid confusion, it would even be better to name the abstract type I need for the plot recipe AbstractTree.

Just to make the idea of the new type clearer: An object of (sub)type AbstractTree would be something that implements (partially) the AbstractTrees-interface.

The question here is, how much of the AbstractTrees-interface it should implement, in order to qualify for being of type AbstractTree?

  • For my plot recipe I need children as well as printnode.
  • But I would also be happy, if we could agree on a minimal interface, consisting only of children.

Comments on this are welcome 🤓.

roland-KA avatar Aug 05 '22 17:08 roland-KA

The interface section of the documentation currently says "All trees must define children" but could probably be clearer that this is the only thing they are required to implement.

jlumpe avatar Aug 06 '22 22:08 jlumpe

Thanks for the hint! In the top section it also states:

"Using this package involves implementing the abstract tree interface which, at a minimum, requires defining the function AbstractTrees.children for an object."

So my question from above is already clearly answered. That's good news! 😊

Then I would suggest that we add the new abstract type

abstract type AbstractTree end

to the AbstractTrees-package (presumably in base.jl?) and define that:

Any object x for which holds isa(x, AbstractTree) must implement at least children.

Would that be a solution everybody could agree on?

roland-KA avatar Aug 07 '22 12:08 roland-KA

@roland-KA what exactly is your use case for the AbstractTree type? I'm not familiar with DecisionTrees.jl, but I imagine that a decision tree has to define some additional attributes/methods besides just conforming to the AbstractTrees API. Why not just an AbstractDecisionTree type?

There is also an issue with the AbstractTree name - despite the name of this package the core type should really be AbstractNode. There is a whole tree-vs-nodes issue with this package that so far hasn't been addressed. Any node (something that implements children()) can be considered as a tree with itself as the root, but there are other ways a type could represent a tree without also being a node (IndexedTree is the current one, but I can envision others). I've definitely put some thought into having these base classes but that would be for an API which is aware of the tree-vs-node issue and needs to dispatch on one or the other.

jlumpe avatar Aug 10 '22 22:08 jlumpe

Use Case

@jlumpe the use case is, as described above, to enable a visualization of (arbitrary) decision trees. We are currently focused on decision tree implementations within MLJ, but the concept is not limited to that meta-package.

The visualization should meet the following goals:

  1. no dependency on Plots.jl
  2. independence of any specific decision tree implementation

We have realised these goals by using a plot recipe and by requiring each decision tree to implement the AbstractTrees interface (see https://github.com/JuliaAI/DecisionTree.jl/issues/147 for details and one recent version of the recipe).

A few months ago I already published this article on Towards Data Science, where I explain the whole concept.

but I imagine that a decision tree has to define some additional attributes/methods besides just conforming to the AbstractTrees API.

No, I could indeed implement a plot recipe that depends exclusively on the AbstractTrees-API (only children and printnode are required). Remark: But perhaps the explanation in "Use Case (Part 2)" (see below) answers this question completely.

Why not just an AbstractDecisionTree type?

There are at least two reasons for an AbstractTree-type:

  1. The decision trees implement the AbstractTrees-interface (not a, currently not existing, AbstractDecisionTree-interface) and I think it's a good idea to build such things on existing packages like AbstractTrees and not reinvent the wheel again.
  2. The plot recipe is actually so generic, that it can plot each tree conforming to the AbstractTrees-API (not only decision trees).

... and perhaps it is a good occasion to grow the user base of AbstractTrees.jl and thus appreciating the work that you and your fellow contributors have put into that package 😊.

Current state of implementation

Just to make the situation clear. We already have

  • an implementation of a plot recipe that meets the above mentioned requirements
  • an implementation of the AbstractTrees-API for JuliaAI/DecisionTree.jl
  • an implementation of the AbstractTrees-API for BetaML/DecisionTrees.jl

The only part that lacks in order to make the whole thing publicly available is this abstract type that allows us to call plot in the following way

tree :: TreeImplementingTheAbstractTreesAPI
plot(tree)

and thus enabling plot (via dispatch on that type) to chose our plot recipe.

Tree vs. Node

Yes there are indeed different ways to implement a tree. In some implementations the tree is simply represented by the top-most node. But for the use case described above, this distinction is irrelevant, as it relies only on children (and printnode to have some meaningful content when printing).

And from a more general software engineering perspective, I think this distinction shouldn't show up in a type definition. It is the very purpose of an abstract data type (like a tree) to hide details of the implementation.

But perhaps I just don't know what applications you have in mind, where this distinction should show up on a type level. Do you have any examples?

BTW: This wouldn't conflict with the definition of an AbstractTree-type. You could simply define subtypes like AbstractNodeTree and AbstractRealRootTree (or whatever that would be called).

roland-KA avatar Aug 11 '22 14:08 roland-KA

To make it easier to understand, I have created a diagram of the intended structure, which I described above:

image

Here we can see, that AbstractTrees.jl plays an important role. All tree implementations which want the ability to be visualized, have to implement the AbstractTrees-API - but only that! It's the only dependency they get in this scenario.

The plot recipe DecisionTreesRecipe.jl has also a dependency on AbstractTrees.jl (because it uses the AbstractTree type). But again, it's the only dependency it has.

Only a script that wants to visualize a tree has dependencies to several components: To the tree itself, to the plot recipe and to Plots.jl, which is finally responsible for plotting.

roland-KA avatar Aug 11 '22 19:08 roland-KA

Use Case (Part 2)

Perhaps I need to explain the use case a bit in more detail. Not everybody is familiar with plot recipes.

Each plot recipe is defined on a type (that's why I need something like AbstractTree). This type is used on the call to plot to select the right recipe. When plot gets passed a tree (as in the example above), and tree isn't an object which plot knows how to plot, then plot searches for a plot recipe which is defined on the type of tree.

For that reason all tree implementations not only have to implement children and printnode but they have to "present" themselves as a subtype of AbstractTree. E.g. in BetaML/DecisionTrees we added for that purpose the following type:

struct InfoNode{T} <: AbstractTree
    node    :: DecisionNode{T}
    info    :: NamedTuple
end

The decision tree from BetaML is in its basic form a DecisionNode, but we wrap that into this new structure InfoNode which is a subtype of AbstractTree.

And for plotting we pass actually an object of type InfoNode to plot (not a DecisionNode).

roland-KA avatar Aug 11 '22 19:08 roland-KA

Be good to get a decision here. @roland-KA 's suggestion is clear, and the use-case comprehensively explained.

In view of discussion at #116, AbstractNode sounds like the best name.

ablaom avatar Aug 17 '22 23:08 ablaom

Hi alltogether,

it's now almost one month since I explained my use case in detail and there hasn't been any feedback since. I understand that many may be (or have been) on vacation during this time. But it would be nice to get some answer, so we know how to proceed with our project, because this issue stalls our works completely.

If you need more details on the use case, feel free to ask. If you object the proposal, please say so (and explain why) and of course the most pleasing answer would be that you add the abstract type within a reasonable timeframe 😃.

roland-KA avatar Sep 05 '22 11:09 roland-KA

Hi @jlumpe, @ExpandingMan, @timholy, @oscardssmith you seem to be the most active contributors/maintainers of this package, so I would appreciate your opinion, decision or comment on this issue.

Can we add an abstract type AbstractTree to this package as proposed above?

roland-KA avatar Sep 19 '22 15:09 roland-KA

Sorry, we've been pretty distracted by solving iteration without abusing the compiler, so those of us who've worked on this package recently have been very focused on that (I also am not a maintainer... I don't think, so I wasn't getting notifications on this thread).

Anyway, I'm going to have to come back and read the above extremely carefully, but so far my take on your problem is that you have a whole bunch of types that all implement the AbstractTrees interface and you want them to have a common abstract type. In that case, the question this issue needs to answer is: should that type be defined in AbstractTrees.jl or somewhere in plot recipes world?

My initial reaction to this is pretty firmly toward the latter, that you should define an abstract type somewhere in your own packages and that AbstractTrees.jl should not implement this. The reason is that trees in general do not have and cannot have a common abstract type. For example, nested arrays are tree nodes, but they cannot descend from a node type since they clearly need to be AbstractArray. This is ultimately attributable to the broader problem with Julia that type hierarchies and dispatch are global rather than contextual (at least that's my take on the problem, others have suggested multiple inheritance which I personally don't want). Hence, the existence of non-dispatching interface packages such as AbstractTrees.jl (the naming of which is a bit unfortunate) and Tables.jl.

In spite of the above, it could still be argued that the AbstractTree you are looking for should be defined here. My main problem with that is that I don't see a clear use case for it. Maybe it can provide some clever solution to our ongoing type-stability recursion problem, but it's certainly not immediately obvious to me how it would do that.

So, my question at this point is: why not define an AbstractTree somewhere in the plotting ecosystem instead of here?

ExpandingMan avatar Sep 19 '22 16:09 ExpandingMan

The Julia package AbstractTrees.jl specifies (and partly implements) the abstract data type TREE. I refer here to the term "abstract data type" (ADT) in the computer science-sense (as defined by Liskov and Zilles in 1974), not in the Julia sense.

An ADT is specified by a set of operations (thus defining an interface) and it has a name.

Naming of such an ADT-interface can be done in Julia in two (complementary) ways:

  1. The interface of the ADT can be placed within a Julia module (which has a name), as is the case with AbstractTrees.jl
  2. A Julia abstract type can be defined for that ADT. This has the added benefit that multiple dispatch can be applied on that ADT (and is the reason why we need such an abstract type for the use case described above).

Now to the question "should that type be defined in AbstractTrees.jl or somewhere in plot recipes world?": In my opinion it should be placed within AbstractTrees.jl because

  • it's a about naming (and typing) of the ADT which has been specified within AbstractTrees.jl (so conceptually it inherently belongs to that package)
  • it has absolutely nothing plot recipe specific about it
  • it would be useful to other users of the AbstractTrees.jl package who also need multiple dispatch on that ADT
  • defining such an abstract data type outside of this package would require us to build (and register) a Julia package which consists only of this single definition, which would be quite strange and awkward

The example of the "nested arrays" above which could be used to implement a TREE is no contradiction to this approach, since nobody is forced to use an abstract type AbstractTree which would be part of AbstractTrees.jl; it's an option than can be used, if appropriate but can be left aside if it doesn't fit (AbstractTrees.jl/src/traits.jl implements a whole list of abstract types which are in this sense optional).

IMHO, the "nested array" example is an unfortunate mixture of interface and implementation (which unfortunately can be seen often in the Julia world) and contradicts the very idea of an ADT because it exhibits the implementation in the interface. You only get the problem described above because these aspects are mixed. Therefore the sentence "trees in general do not have and cannot have a common abstract type" is a contradiction in itself -- a TREE is an abstract data type (in the computer-science sense). But that's another discussion.

@ExpandingMan you write: "My main problem with that is that I don't see a clear use case for it."

Well, our use case is clearly stated above and I think there may be more cases where users of the AbstractTrees.jl package may need exactly such an abstract data type in order to apply multiple dispatch.

roland-KA avatar Sep 20 '22 17:09 roland-KA

IMHO, the "nested array" example is an unfortunate mixture of interface and implementation (which unfortunately can be seen often in the Julia world) and contradicts the very idea of an ADT because it exhibits the implementation in the interface.

I certainly agree that there is something lacking in the design of Julia itself where this is concerned, though it's difficult to see what the solution should be and this is a much larger discussion. In the meantime, we simply have no choice about this. Other abstract types such as AbstractArray need to be used in many cases. AbstractArray is an example of an abstract type which is so important that it's pretty much always going to win against any abstract tree type where applicable.

I am not dead set against adding an AbstractTree, and the part of your argument which I am most sympathetic to is that you are going to have to go define your abstract type somewhere, so why not here? My biggest problem with it is that it's simply not going to be the case that most stuff that implements the interface is going to be an AbstractTree. You may not like it, and I have my own misgivings about the situation, but this is not a problem we are going to have a better solution to any time soon, and I doubt it's even possible short of Julia 2.0.

In my view it makes it much harder to argue for an AbstractTree if the type is basically a naming convention, it's not even a priori clear to me what it should implement. It seems it would just be a wrapper that would have to extend children for its one field, if you have better suggestions I'm certainly open to them. I suspect you are being overoptimistic about how this can be used for dispatch since, again there are all sorts of trees that won't have this type hierarchy. That has me leaning of the direction of saying that you should define an AbstractTree yourself in whatever package you deem appropriate, that way you can implement it however you like.

There is some precedent here: Tables.jl is probably even more widely implemented than this package. It provides some special wrapper types for use cases that require transforming a generic table into something more specific, but there is no AbstractTable, and if one were added it's not clear how it could be used. This package seems analogous: we have a wrapper type for treemap which is a use case not so different from the use cases in Tables.jl

If there is some consensus to add an AbstractTree type, I would not stand in the way of it, again I have some sympathy for the argument "it has to go somewhere, why not here?". However at the moment I'm still quite dubious about it. If there is a good argument that such a type should implement something non-trivial I would find that compelling.

ExpandingMan avatar Sep 20 '22 17:09 ExpandingMan

I'm glad to hear that you are open to a pragmatic solution 😊.

it's not even a priori clear to me what it should implement

Of course the AbstractTree type wouldn't be just a naming convention. We already discussed above that a TREE of type AbstractTree would have to implement a minimal interface consisting of children.

Computer science text books on "data structures & algorithms" typically include in a minimal TREE interface also operations like isLeaf and isRoot. But as these operations are not part of AbstractTress.jl we would reduce the list to just children.

With respect to the question of usefulness: I was able to implement the complete plot recipe for plotting a tree on the basis of that function (plus printnode to get labels for the nodes).

roland-KA avatar Sep 21 '22 10:09 roland-KA

I think a good question would be - how would this type even be used within the package? Besides defining AbstractTree and having our existing types inherit from it, I'm not sure if anything would actually need to dispatch on it.

jlumpe avatar Sep 21 '22 22:09 jlumpe

No, it would not be used by AbstractTrees.jl. Nothing internal to AbstractTrees.jl would inherit from this. The purpose is not internal, but to allow more external packages to make use of AbstractTrees.jl.

This is not about plotting per se; it just so happens that in the context of plotting and the way plot recipes work, there does not appear to be any way to avoid introducing this type somewhere. Plots.jl is very widely used, mature, and unlikely to change the way recipes work.

I think all @roland-KA is requesting here is the addition of a single line of code:

abstract type AbstractTree end

and inviting the community to adopt the convention that any object subtyping AbstractTree implements AbstractTrees.children, with no obligation for objects implementing children to actually do so (in particular the children themselves need not subtype AbstractTree). And that's it. The only alternative appears to be a standalone package having that single line of code in it. Philosophical objections aside, what possible advantage could that have?

ablaom avatar Sep 21 '22 23:09 ablaom

The concern with these sorts of things is always about adding something dubious to a package with a ton of dependents. The fact that this hasn't been done in e.g. Tables.jl gives me some pause. I suppose my biggest concern is about code in other packages expecting to dispatch on AbstractTree when, again, this is just not realistic.

Indeed, I'm finding the arguments both for and against this proposition similarly weak. The only part of any of this discussion that I find compelling is that if it is going to be defined somewhere it might as well be here. I guess that means I've been convinced that this should be added, though I'm not without misgivings.

However, I don't think we are going to be able to call it AbstractTree, I think it's going to have to be AbstractNode or AbstractTreeNode, as what this package calls a tree is merely the node with no parent.

ExpandingMan avatar Sep 22 '22 00:09 ExpandingMan

However, I don't think we are going to be able to call it AbstractTree, I think it's going to have to be AbstractNode or AbstractTreeNode, as what this package calls a tree is merely the node with no parent.

Yes. The previous discussions made that clear, my bad.

ablaom avatar Sep 22 '22 02:09 ablaom

Then I would prepare a PR which adds an abstract type AbstractNode. The file base.jl would be probably the right place for it?

And about your concerns @ExpandingMan: I wouldn't call this small change "dubious". The purpose and its use case are clearly specified and not a single user of AbstractTrees.jl will notice this change. It doesn't brake anything and it doesn't prevent any future developments. It's just about introducing an additional service to the package for users who want it (and dispatch is really not an exotic use case in Julia ). The use of this new abstract type is purely optional. ... and in a first step, it will make the users of the MLJ package happy. So I hope, this will reduce your misgivings 😊.

roland-KA avatar Sep 22 '22 10:09 roland-KA

Well others will have to chime in on this as well, I'm not even an admin on this repo.

The very next thing that has to be done on this package is that we MUST fix this. I might have an idea how to solve it, I'm going to turn my attention to it as soon as I finish up another project. I can add AbstractNode in that PR.

ExpandingMan avatar Sep 22 '22 13:09 ExpandingMan

Ok, thanks for taking care of the issue!

So we can expect to get an AbstractNode included in the package, but we will have to wait a few weeks until issue #117 is fixed?

roland-KA avatar Sep 23 '22 12:09 roland-KA

I'd feel somewhat better about having that fixed first mainly so we can be more sure about what all the cursor type signatures are going to be, but it's probably not strictly necessary. I can put it in an upcoming PR for that fix, but if you want to make one sooner it's possible it'll get merged. Again, I'm not even the person who'll make the decision here.

@oscardssmith what's your take on adding this?

ExpandingMan avatar Sep 23 '22 13:09 ExpandingMan

I think I pretty much agree with you. I don't see it being especially useful, but I also don't think it will cause major problems.

oscardssmith avatar Sep 23 '22 13:09 oscardssmith

Oh, it would be great, if we could get it sooner in an upcoming PR (as it seems to me, that there is no easy solution for #117).

roland-KA avatar Sep 23 '22 18:09 roland-KA

I've added this in #120 which more importantly should solve #117.

I really don't feel great about it since I basically had to spend the entire docstring warning people not to use AbstractNode, but there are 2 AbstractNode types in the package (StableNode and MapNode).

ExpandingMan avatar Sep 26 '22 23:09 ExpandingMan

Thanks for advancing the issue! 👍

roland-KA avatar Sep 27 '22 15:09 roland-KA