DataStructures.jl
DataStructures.jl copied to clipboard
Maturing out the package (Path to DataStructures.jl v1.0)
What do we need to do before we would be happy to tag DataStructures.jl v1.0? I mean, we are one of the most common dependencies in the package ecosystem, and julia 1.0 is out.
My thoughts;
- [ ] Search the code base for anything written in outmoded styles (e.g. #472)
- [ ] Get rid of anything deprecated #477 (but that will need to be done again after)
- [ ] Maybe split up the package #310, but noting that all the packages sliced out that way will want to hit 1.0 before (or at the same time as) DataStructures.jl does.
It hurts people when we break things, and while we are good at deprecating, getting on the the "strict" side of semver is something we need to do for the sake of the ecosystem
People keep asking when will DataStructures 1.0 come out. The answer is as soon as we completely audit the package to make sure there is nothing that needs to be removed/corrected to follow modern julia conventions. And have actioned those corrections. A lot of the code in the package is pretty old, and follows e.g. conventions from julia 0.3. We need to identify and correct these.
I will post a list of files at the bottom of this post. Then people can go through this and use the checklist to make sure everything is good, one file at a time.
How to audit a file to check if 1.0 ready
- Answer the question of should we even keep this in the package. E.g. no data structure that somone just kind of made up (like the now removed `Classifier Collection). If ours is is a worse version of something in another package it should also go.
-
Type Names Correct/Sensible: for example make sure we don't have a type called
MagicVectors
that is a subtype of vector. That should be calledMagicVector
(singular). -
No constructors taking types as positional arguments. For example there should not be
MagicVector(::Type{T}) where T
, instead there should just beMagicVector{T}()
. -
No functions that duplicate constructors. For example there should not be a function
magic_vector(xs)
, onlyMagicVector(xs)
. -
No functions that are improperly cased and pretend to be constructors.
- E.g. no
MagicVector(xs) = Vector(xs) .* "Magic"
, that should just be a normal functionmagic_vector(xs) = Vector(xs) .* "Magic"
- E.g. no
MagicVector(xs)=Vector{Magic}(xs)
, that should be aconst
type aliasconst MagicVector=Vector{Magic}
- E.g. no
-
No exported functions that could be just done with overloads of
Base
. E.g. don't needtop
,peak
,head
all of which act the same asfirst
but on a particular data structure. - No internal helpers exported. If something isn't useful to the end user it should not be exported. In general The vast majority of the public API should be constructor or overloads of
Base
functions. Thus any export should be looked at closely. -
Correct return types
push!
andempty!
anddelete!
should return the collection. (this is relatively minor as it would be a bug if they didn't so it isn't breaking for use to fix it later.) -
No defining methods in ways that disagree with Base: for example because Base has
insert!(xs, index, value)
we should not have noinstert!(xs, val)
(push!
is for that). Conversely we should not havepush!(xs, key, value)
( we dio both of these right now in various trees, and in accumulators respectively)
All these things are things that have changed during the life-time of this package.
List of files that need to be checked
(Checking each will also require checking the top-level file for relevant exports)
- [ ] accumulator.jl
- [ ] default_dict.jl
- [ ] robin_dict.jl
- [ ] dict_support.jl
- [ ] trie.jl
- [ ] balanced_tree.jl
- [ ] circ_deque.jl
- [ ] circular_buffer.jl
- [ ] dibit_vector.jl
- [ ] disjoint_set.jl
- [ ] fenwick.jl
- [ ] int_set.jl
- [ ] sparse_int_set.jl
- [ ] list.jl
- [ ] multi_dict.jl
- [ ] mutable_list.jl
- [ ] deque.jl
- [ ] queue.jl
- [ ] stack.jl
- [ ] sorted_dict.jl
- [ ] sorted_multi_dict.jl
- [ ] sorted_set.jl
- [ ] tokens.jl
- [ ] tokens2.jl
- [ ] priorityqueue.jl
- [ ] heaps.jl
- [ ] heaps/arrays_as_heaps.jl
- [ ] heaps/binary_heap.jl
- [ ] heaps/minmax_heap.jl
- [ ] heaps/mutable_binary_heap.jl
- [ ] container_loops.jl
- [ ] delegate.jl
Many of these should be easy, because they have kind of already been done. E.g. Heaps were done for 0.18; robindict was created after julia 1.0 and so should follow conventions. But lets get them all and check them off using the list criteria listed above.
Higher level process:
0.19 will be the last release before 1.0. 0.19 will be the same as 1.0 but with deprecations. 1.0 will come out the same day as 0.19
We will stick to making patch releases on 0.18 with deprecations while we can. But once we can't we will swap master to 0.19-DEV, and say on that until we can release it and 1.0. That might take a 6 or more months; depends how many (if any) people are about to spend time helping with the audit. If necessary we will have a backport branch for 0.18 bug-fixes
Once @eulerkochy's GSOC work is complete we will not accept any new data structures until after 1.0 released.
If anyone want to help out, please feel strongly encouraged to do so.
No exported functions that could be just done with overloads of Base
How about reverse_iter
? Can this just be Iterators.reverse
? The only issue i see is that reverse_iter(s::Stack)
returns a DequeIterator
, not a Iterators.Reverse
, but idk if Iterator's API is really supposed to guarantee Iterators.reverse(::T) -> Iterators.Reverse{T}
or not. (edit: no it's not, that's why we have reverse
, rather than just Reverse
, so we don't have to guarantee returning a Reverse
)
No exported functions that could be just done with overloads of Base.
No defining methods in ways that disagree with Base
I think this suggests we should make a call on which side this falls: https://github.com/JuliaCollections/DataStructures.jl/issues/296
Could anyone explain in more detail the above-mentioned recommendation regarding the insert!
function? In particular, does the recommendation mean that insert!(ss, k)
, where ss is a SortedSet and k is a new key, should be deprecated? Note that push!
does not replicate this functionality because insert!(ss,k)
returns a bool to indicate whether the item was already present and a semitoken indexing the newly inserted item, whereas push!(ss,k)
returns the container.
Could anyone explain in more detail the above-mentioned recommendation regarding the insert! function?
Base defines insert!
as:
insert!(a::Vector, index::Integer, item)
Insert an item into a at the given index. index is the index of item in the resulting a.
and it returns the collection.
In particular, does the recommendation mean that insert!(ss, k), where ss is a SortedSet and k is a new key, should be deprecated?
yes, it doesn't match the description of the Base function.
Note that push! does not replicate this functionality because insert!(ss,k) returns a bool to indicate whether the item was already present and a semitoken indexing the newly inserted item, whereas push!(ss,k) returns the container.
Yeah, so in that case we can't deprecate it for push!
.
We would need to make some other plan.
Like st_push!
or something? or a submodule that defines a bunch of function names for things that return subtokens?
Maybe a case could be made for push!(::Type(SemiToken),::SortedSet, x)
or something?
Here is a more serious conflict with Base
: according to the Base docs, pop!(s)
is supposed to return the last item in s
if s
is an ordered collection. However, pop!
on the sorted containers currently returns the first item. Which is the lesser of two evils, a breaking change for the package or doing the opposite of the Base
documentation?
Maybe deprecate pop!(s)
for sorted containers to popfirst!(s)
, and introduce new functionality poplast!(s)
(the latter not present in Base)? Then, several years in the future, deprecate poplast!(s)
to just pop!(s)
?
Maybe deprecate pop!(s) for sorted containers to popfirst!(s), and introduce new functionality poplast!(s) (the latter not present in Base)? Then, several years in the future, deprecate poplast!(s) to just pop!(s)?
yeah this, sounds right.
but I would argue for putting a deprecation warning in poplast!
in 0.19 as it is defined saying this will just be pop!
in 1.0, which will be released not years in the future but rather days.
Doesn't this approach for deprecating pop!(s)
mean that everyone who upgrades from 0.18 to 1.0, skipping 0.19, will encounter a breaking change that is also silent?
With regard to renaming insert!
: since insert!
currently has a different API for each of the three containers, how about if we choose the three names ss_push!
, sd_push!
and smd_push!
?
Doesn't this approach for deprecating pop!(s) mean that everyone who upgrades from 0.18 to 1.0, skipping 0.19, will encounter a breaking change that is also silent?
yes, that is what they get for skipping versions. But we will write release notes etc.
With regard to renaming insert!: since insert! currently has a different API for each of the three containers, how about if we choose the three names ss_push!, sd_push! and smd_push! ?
Lets split this off to another issue or a PR and we can think about our options.
OK. I am currently working on a major overhaul of the sorted containers to address a number of open issues at once. I will include these new names in my forthcoming PR; and then you or another reviewer can comment on them.
With regard to attaching proper meanings to push!
, pop!
, and insert!
: Kevin Squire @kmsquire wrote a long message on discourse about how these functions are somewhat muddled in Base
:
https://discourse.julialang.org/t/taking-push-and-pop-seriously/34326
He proposed a more principled way to implement them in Base, but, as far as I can tell, did not get any traction for this proposal. Presumably this is because it is now too late to make changes to Base. If we try to follow Base conventions in the DataStructures package, then we will end up with the same muddled semantics.
I'd also like to point out open issue https://github.com/JuliaCollections/DataStructures.jl/issues/403 which is basically a request for clearer semantics for these functions.
One thing I've found is that sometimes I would like to use special ordering that I do not want to define with isless
, maybe because it would be type piracy or simply does not make sense outside the context of the algorithm implemented. Would it be possible for the data structures that need an ordering function to always accept something like lt
for sort
?
This is possible-- the sorted containers are parameterized by an order function that you can customize. Please see the documentation for an example in which strings are ordered by case-insensitive comparison.