julia
julia copied to clipboard
Collecting an `Iterators.PartitionIterator{String}` should return a vector of `SubStrings` and not a `Vector{Vector{Char}}`
This subject was initially discussed in https://discourse.julialang.org/t/efficient-way-to-split-string-at-specific-index/83115/17 regarding a way to split a string into equal-length substrings.
The solutions discussed involved (for a string composed of 10 blocks of 8 strings)
- using a comprehension
split8(str::String) = [str[i+1:8] for i in 0:8:length(str)]
- using a comprehension and
SubStrings
split8(str::String) = [SubString(str,i+1,i+8) for i in 0:8:length(str)-1]
- using
Iterators.partition
andjoin
split8(str::String) = join.(Iterators.partition(str, 8))
- using
Iterators.partition
andString
split8(str::String) = String.(Iterators.partition(str, 8))
With this configuration
Julia Version 1.7.3 Commit 742b9abb4d (2022-05-06 12:58 UTC) Platform Info: OS: Windows (x86_64-w64-mingw32) CPU: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake) Environment: JULIA_EDITOR = code JULIA_NUM_THREADS = 1
it seems that the first solution is the most efficient. However, Iterators.partition
seems to be the correct tool for that. Also as noted by DNF in the original discussion, given its description
partition(collection, n)
Iterate over a collection n elements at a time.
Examples ≡≡≡≡≡≡≡≡≡≡
julia> collect(Iterators.partition([1,2,3,4,5], 2)) 3-element Vector{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}: [1, 2] [3, 4] [5]
one could assume that collect
ing the iterator should return a vector of SubString
.
Yes, this certainly makes sense to me. We already have specialized methods for partition(array, n)
that return views, so why not a specialized method for partition(string, n)
?
Should just need something like:
eltype(::Type{PartitionIterator{T}}) where {T<:AbstractString} = SubString{T}
function iterate(itr::PartitionIterator{<:AbstractString}, state = firstindex(itr.c))
state > ncodeunits(itr.c) && return nothing
lastind = min(nextind(itr.c, state, itr.n - 1), lastindex(itr.c))
return SubString(itr.c, state, lastind), nextind(itr.c, lastind)
end
which gives:
julia> collect(Iterators.partition("foobars", 2))
4-element Vector{SubString{String}}:
"fo"
"ob"
"ar"
"s"
This is a user-visible change, but I don't think it would qualify as breaking since the only documented requirement of partition
is that it returns iterable collections of elements, I think?
Thanks for tackling this so quickly ! I just benchmarked it on my machine. This gives
julia> @benchmark collect(Iterators.partition($str,8))
BenchmarkTools.Trial: 10000 samples with 203 evaluations.
Range (min … max): 374.877 ns … 2.802 μs ┊ GC (min … max): 0.00% … 81.70%
Time (median): 433.005 ns ┊ GC (median): 0.00%
Time (mean ± σ): 446.508 ns ± 80.783 ns ┊ GC (mean ± σ): 0.56% ± 2.88%
▁ ▄▇▇▇█▇▅▃▄▄▃▃▃▃▃ ▁ ▁▁ ▁▂▁▂▂▁▁▁ ▂
█▆▆███▆▇▄▇██████████████████▇██████████████▇▆▆▅▅▆▄▃▅▄▂▃▂▄▃▄▄ █
375 ns Histogram: log(frequency) by time 613 ns <
Memory estimate: 304 bytes, allocs estimate: 1.
Median times are 2.3 times the ones obtained with the comprehension, but it's much better than with the join
or String
solutions !
With respect to the breaking characteristic, I believe this could brake only if methods rely on the exact type of the output, rather than the indexing itself which seems identical with SubString
and Vector{Char}
.
Just needs someone to add a test and create a pull request.
split8_6(str::String) = [@view str[i.+(1:8)] for i in 0:8:length(str)-1]
u guys are crazy lol
I have opened a PR for the above implementation. There is one odd case partition(enumerate(string),1)
, which I'm not sure about but I think the new behaviour makes sense (more in the PR).
At the moment it fails testall
but I believe that's unrelated to the PR (it fails on master too).
Any feedback would be greatly appreciated.