adds the `nth` function for iterables
Hi,
I've turned the open ended issue #54454 into an actual PR. Tangentially related to #10092 ?
This PR introduces the nth(itr, n) function to iterators to give a getindex type of behaviour.
I've tried my best to optimize as much as possible by specializing on different types of iterators.
In the spirit of iterators any OOB access returns nothing. (edit: instead of throwing an error, i.e. first(itr, n) and last(itr, n))
here is the comparison of running the testsuite (~22 different iterators) using generic nth and specialized nth:
@btime begin
for (itr, n, _) in $testset
_fallback_nth(itr, n)
end
end
117.750 μs (366 allocations: 17.88 KiB)
@btime begin
for (itr, n, _) in $testset
nth(itr, n)
end
end
24.250 μs (341 allocations: 16.70 KiB)
how would this compare to a more naive implementation like
nth(itr, n) = first(Iterators.Rest(itr, n))
?
how would this compare to a more naive implementation like
nth(itr, n) = first(Iterators.Rest(itr, n))?
Rest requires knowing the state used by the iterator, which is often considered an implementation detail and hard to pick automatically (unless i am missing something!)
If the state was known first(Rest(itr, n)) would probably be the fastest, since you alwasy do at most one iteration.
but knowing the correct n-1 state means that you most likely calculate n state directly.
In that case then a specialization would be even better!
Seems like a lot of code.
I reproduced the above benchmark here:
https://gist.github.com/mcabbott/fe2e0821e9bfe5cc7643bb15adf445d0
I get 75.625 μs for first(Iterators.drop(itr, n-1)) vs 1.558 μs for the PR. However, this is entirely driven by one case, Cycle{Vector{Int64}}. Some other cases are faster, some slower (String). Maybe they ought to be discussed in follow-up PRs?
No strong position on whether this needs a name or not, but perhaps this first PR can focus on that, and let the implementation be just:
nth(itr, n::Integer) = first(Iterators.drop(itr, n-1))
nth(itr::AbstractArray, n::Integer) = itr[begin-1+n]
A lot of the code is for optimizing out of bound checking. If we go with davidantoff suggestion of letting nth just error on oob n most of the actual code can be scrapped while retaining the speed.
I disagree with throwing an error. In cases where you don't know if an nth element exists, that forces a try-catch which is both slow and brittle. I would imagine that most ordered iterators with a known length support indexing, so this would probably mostly be used precisely when the length is unknown.
I think another consideration here is consistency: the other functions we have that take an individual element from an iterator are first and last, and both throw an error if you ask them for something that doesn't exist (i.e. when you call them on an empty source). In my mind, first, nth and last should have the same kind of design, so that also speaks in favor of throwing an error.
I agree with @jakobnissen that in some situations being able to handle this without an exception would be nice, but on the flip side, I can also see scenarios where an error seems much better, in particular in interactive sessions where I might be playing around with some data and this function could be very useful. And especially in an interactive scenario it would be super inconvenient if Some was used...
Maybe the best design would be to allow for both scenarios. Say something like
nth(itr, n, nothrow=false)
So the default would be that an exception is thrown if the nth element doesn't exist, but when nothrow=true then nothing is returned if the element doesn't exist, and on success things are wrapped in Some.
We could also opt for relying on the IteratorEltype trait to check if an iterator can contain nothing elements.
With ::HasEltype() we can dispatch over eltype(itr) <: Union{Nothing, T} where T.
In that case nth can return Some(nothing) while otherwise just return nothing since that would be unambiguous for those collections that do not have nothing in them. Of course wrapping with Some would be the default in case we have a ::EltypeUnknown Iterator.
Some is already expected in workflows with Union{Nothing, T} so that wouldn't introduce any extra complexity.
Although I see the similarity with first and last I'd be more akin to accomunate nth(itr, n) to the their siblings first(itr, n) and last(itr, n) and, I would argue, that first is a bit of an outlier in throwing on an empty collection, since for example,
according to documentation (and code) last never really throws an error:
"Return the end point of an
AbstractRangeeven if it is empty."
the error in last([]) is from getindex that receives a 0 from
lastindex(a::AbstractArray) = (@inline; last(eachindex(IndexLinear(), a))) # equals to last(OneTo(0))
Similarly, both first(itr, n) and last(itr, n) rely on the take(itr, n) iterator which simply returns nothing when finishing the elements of the underlying iterator.
From this my idea that in principle iterators are non throwing by default, any throwing should be done one level higher and not at the iterator level itself (like how getindex and last interact). Iteration is a "low level" interface and I believe it should give the user the choice on how to handle "end states" of the iteration.
We could also opt for relying on the
IteratorEltypetrait
I have to admit, I think that is the option I like least of all of the proposed options so far :) It would make it very tricky to write generic code that uses the nth function, essentially now I would have to check the trait every time I call nth on something to be able to correctly interpret the return value from nth.
I'd be more akin to accomunate
nth(itr, n)to the their siblingsfirst(itr, n)andlast(itr, n)
To me nth(itr, n) is conceptually way closer to first(itr) and last(itr) because all of these produce single values, rather than streams of values. Whenever a function produces a stream of values there is a simple, natural way to return no value: namely an empty stream. But that is exactly not possible for functions that are supposed to return just one value.
From this my idea that in principle iterators are non throwing by default
Agreed, but the whole difference between first(itr) and first(itr, n) is the one does not produce an iterator, while the other one does.
I still think that my proposal with an argument like nothrow would be the cleanest solution here :), are there things that you think are problematic about it?
Is there any precedent for a nothrow keyword?
We could also follow get(A, key, default), as you suggested earlier. It seems [edit: in this case!] a little confusing that this goes by enumeration not indexing, so maybe it shouldn't be called Iterators.get, but is there a good suggestive name? getnth(iter, number, default) or getcount or something? Somehow using nth(iter, n, default) seems a bit at odds with first, last, at least to me.
It would make it very tricky to write generic code that uses the nth function, essentially now I would have to check the trait every time I call
nthon something to be able to correctly interpret the return value fromnth.
I think it's already hard to write generic code that covers both generic collections and Union{Nothing, Some{T}} collections.
Most likely you'd need a specialized function to handle the Somes anyway.
So my reasoning was that since handling of collections of Union{Nothing, Some{T}} already requires a special context that uses Somes, nth in that context would just behave like you'd expect.
But everywhere else, is always the same.
Whenever a function produces a stream of values there is a simple, natural way to return no value: namely an empty stream. But that is exactly not possible for functions that are supposed to return just one value.
Agreed, but the whole difference between
first(itr)andfirst(itr, n)is the one does not produce an iterator, while the other one does.
first(itr, n) and last(itr, n) are not iterators but just wrappers (like nth), they collect an iterator to produce a single value. The only difference is that this value is a collection of elements instead of just an element.
I still think that my proposal with an argument like
nothrowwould be the cleanest solution here :), are there things that you think are problematic about it?
Not really, I don't have particularly hard opinions about it. In the original issue I had proposed something similar with nth(itr, n, skip_checks=false). What I am most concerned about is forcing the use of Some.
It seems a little confusing that this goes by enumeration not indexing, so maybe it shouldn't be called
Iterators.get, but is there a good suggestive name?getnth(iter, number, default)or getcount or something? Somehow usingnth(iter, n, default)seems a bit at odds withfirst,last, at least to me.
My proposal for nth was an attempt to stay in the "cardinal numbering" semantic sphere like first or last.
Another inspiration was the iter.nth(n) from Rust, and in that case the n was actually used as an index so nth element == iter.nth(n-1) which i found very confusing but I digress.
I can get behind the default argument though!
throwing together some "PR litterature review" for cross reference since I think this PR can depend/interact on/with these:
- #49965
- #47353
- #48277
- #54187
EDIT:
- #43773
- #45080
Having thought about it, I do have some sympathy for the argument of @davidanthoff that it should behave like first and last.
I do see myself wanting to use it in code like:
fourth_field = @something nth(eachsplit(line, '\t'), 4) throw(FormatError(
lazy"Line $lineno does not contain four tab-separated fields fields"
))
Which would now instead be
fourth_field = first(@something iterate(drop(eachsplit(line, '\t'), 3)) throw(FormatError(
lazy"Line $lineno does not contain four tab-separated fields fields"
)))
That's certainly doable (especially since, for iterators of unknown length, most of the clever tricks that nth can do doesn't apply, so drop + iterate will do the same thing), but it's slightly less nice. But the consistency with first and last does perhaps weigh more heavy.
What is the semantic difference between this function and getindex or get?
nth and trynth seems a bit more julian than a nothrow kwarg
matching tryparse and trylock,
nthandtrynthseems a bit more julian than anothrowkwarg
Yes, agreed! Having two distinct functions probably also helps with type stability.
Another naming scheme I thought about is nth_or_nothing, or something along those lines, Rust has that kind of pattern a fair bit. But I think @adienes idea is actually better as there seems to be more precedence in Julia for that.
Julia has a bunch of patterns for handling this already, so one has some freedom to choose "consistent with what?" :)
For example, first and last error when there's no element. While findfirst returns nothing, exhibiting the same confusion between "not found" and "found nothing".
I see 4 ways of handling errors in Julia:
-
Union{T,Nothing}: These functions are convenient because they don't require unwrappingT, but imho more trouble than they're worth because in generic library programming you can't safely assumeTandNothingare disjoint, which means libraries have to document complex APIs or risk data corruption. -
throw(IndexError): This is convenient because it doesn't require any unwrapping and is unambiguous. It's notnoexcept, obviously, but that's often fine in many use cases. -
Option{Ok{T},Err{E}}: With Moshi.jl, Julia has the foundations for proper option types; we just need a package withOption{Ok{T},Err{E}}and some Rust-style functions likeand_then,or_elseimplemented on it. This style requires a bit more code but produces reliable and efficient systems because all branches can be covered with compile-time exhaustive pattern matching. -
Union{Some{T},Nothing}is just a lightweightOptionwith less-informative error values.
Personally I'd be happy with Base having both throwing functions and Option-returning functions. I'd prefer to minimize the amount of Union{T,Nothing} functions spreading through the ecosystem because they complicate correct generic programming.
The 5th option is Union{T,S} where you supply S -- like get. That, 2 (error, like getindex) and 4 (Some/Nothing) are the competitors.
What is the semantic difference between this function and
getindexorget?
It takes the iteration count, not the index. (Same on Vector, different on Dict, or OffsetArray.)
While
findfirstreturnsnothing, exhibiting the same confusion between "not found" and "foundnothing".
That's one's not as bad, as it's either an index or nothing. (You could make Dict(nothing=>2, missing=>3), but you could also name your team "who", "what" and "I don't know"...)
It takes the iteration count, not the index
I see, thanks.
That's one's not as bad, as it's either an index or nothing. (You could make Dict(nothing=>2, missing=>3), but you could also name your team "who", "what" and "I don't know"...)
The problem with "just assume users do what you expect" is that (1) nobody ever documents what they expect and (2) even documented it increases the complexity of usage. No library function using findfirst ever documents that passing a nothing key is undefined behavior. And if they did, the caller would have to increase the complexity of their code to work around the failure in the special case of nothing keys, instead of just using a simple procedure that always works. Imho this kind of API will increase the overall level of complexity and opportunity for mistakes.
The 5th option is Union{T,S} where you supply S -- like get. That, 2 (error, like getindex) and 4 (Some/Nothing) are the competitors.
get(c,k,d) is a good point. That's a good option where appropriate, though the 2-arg function still needs to do something, unless it's just not provided?
Might be nice to accept an AbstractVector{<:Integer}, similar to getindex, so that you could do nth(itr, 5:7) or nth(itr, [3, 17, 245]).
But I agree with nth(itr, n) throwing an error for out-of-bounds, and nth(itr, n, default) returning default if n is out-of-bounds. Similar to get.
@stevengj
Might be nice to accept an AbstractVector{<:Integer}, similar to getindex, so that you could do nth(itr, 5:7) or nth(itr, [3, 17, 245]).
Nonscalar indexing getindex(xs, ixs) diverges from modern julia practice of using explicit rather than implicit broadcasting when mapping a function over a collection. Discussed at length in https://github.com/JuliaLang/julia/issues/30845. Personally I'd rather this practice not be extended into new functions.
I presume the point of nth(iter, [3, 17, 245]) vs. broadcasting nth.(Ref(iter), [3, 17, 245]) is that it would run the iterator only once.
How well nth fits with the existence of iterators which don't like to be run twice, or are expensive to run through, that seems a bit awkward.
Questions for triage:
- Do we want some form of this feature?
- Do we want this specific API?
I'd love to be able to write map(nth(3), my_vec) instead of map(x->x[3], my_vec). Or map(third, v).
That could be obtained already using simpler nth.(my_vec, 3) no?
I think that adding the single argument variants would probably add too much to this PR (I have locally a version with both throwing and with default variants and it's a lot of code already), maybe as a separate one?
I'm ok with:
nth(itr, n) = first(drop(itr, n-1))
nth(n) = Fix2(nth, n)
as a spec.
@LilithHafner I saw you removed the tag, Is jeff's comment the desired spec from triage? So to recap:
- throw when out of bounds
- no
defaultvalue variant - fix2 wrappers
We talked about this two traiges ago and @JeffBezanson's comment is what came out of that. I wasn't at the meeting so couldn't say.
Removed a lot of code, now it just defaults to first(drop()) and has a couple of overrides for cycles, abstract arrays and a couple more.
I reproduced the above benchmark here: https://gist.github.com/mcabbott/fe2e0821e9bfe5cc7643bb15adf445d0 I get 75.625 μs for
first(Iterators.drop(itr, n-1))vs 1.558 μs for the PR. However, this is entirely driven by one case,Cycle{Vector{Int64}}. Some other cases are faster, some slower (String). Maybe they ought to be discussed in follow-up PRs?
Now it should be the best of both worlds, first(drop()) was on par on a lot of cases with the initial implementation.
And even better on a couple of cases, so threw away a lot of stuff and just kept the specializations.
Click to expand
this PR: 2.083 ns (0 allocations: 0 bytes) default: 152.322 ns (0 allocations: 0 bytes)
this PR: 3.666 ns (0 allocations: 0 bytes) default: 59.723 ns (0 allocations: 0 bytes)
this PR: 7.042 ns (0 allocations: 0 bytes) default: 7.083 ns (0 allocations: 0 bytes)
this PR: 3.000 ns (0 allocations: 0 bytes) default: 3.000 ns (0 allocations: 0 bytes)
this PR: 2.416 ns (0 allocations: 0 bytes) default: 16.032 ns (0 allocations: 0 bytes)
this PR: 2.125 ns (0 allocations: 0 bytes) default: 3.041 ns (0 allocations: 0 bytes)
this PR: 3.042 ns (0 allocations: 0 bytes) default: 3.041 ns (0 allocations: 0 bytes)
this PR: 3.083 ns (0 allocations: 0 bytes) default: 3.041 ns (0 allocations: 0 bytes)
this PR: 3.042 ns (0 allocations: 0 bytes) default: 3.041 ns (0 allocations: 0 bytes)
this PR: 3.000 ns (0 allocations: 0 bytes) default: 3.041 ns (0 allocations: 0 bytes)
this PR: 2.125 ns (0 allocations: 0 bytes) default: 3.000 ns (0 allocations: 0 bytes)
this PR: 2.750 ns (0 allocations: 0 bytes) default: 16.408 ns (0 allocations: 0 bytes)
this PR: 5.541 ns (0 allocations: 0 bytes) default: 5.541 ns (0 allocations: 0 bytes)
this PR: 7.042 ns (0 allocations: 0 bytes) default: 7.041 ns (0 allocations: 0 bytes)
this PR: 18.514 ns (0 allocations: 0 bytes) default: 18.496 ns (0 allocations: 0 bytes)
this PR: 7.708 ns (0 allocations: 0 bytes) default: 7.716 ns (0 allocations: 0 bytes)
this PR: 6.125 ns (0 allocations: 0 bytes) default: 6.125 ns (0 allocations: 0 bytes)
this PR: 6.166 ns (0 allocations: 0 bytes) default: 6.125 ns (0 allocations: 0 bytes)
this PR: 3.333 ns (0 allocations: 0 bytes) default: 3.333 ns (0 allocations: 0 bytes)
this PR: 3.333 ns (0 allocations: 0 bytes) default: 3.333 ns (0 allocations: 0 bytes)
this PR: 2.750 ns (0 allocations: 0 bytes) default: 92.833 μs (0 allocations: 0 bytes)
this PR: 3.041 ns (0 allocations: 0 bytes) default: 104.299 ns (25 allocations: 1.17 KiB)
Could someone check this? :)