julia icon indicating copy to clipboard operation
julia copied to clipboard

Collecting an `Iterators.PartitionIterator{String}` should return a vector of `SubStrings` and not a `Vector{Vector{Char}}`

Open BambOoxX opened this issue 2 years ago • 8 comments

This subject was initially discussed in https://discourse.julialang.org/t/efficient-way-to-split-string-at-specific-index/83115/17 regarding a way to split a string into equal-length substrings.

The solutions discussed involved (for a string composed of 10 blocks of 8 strings)

  • using a comprehension split8(str::String) = [str[i+1:8] for i in 0:8:length(str)]
  • using a comprehension and SubStrings split8(str::String) = [SubString(str,i+1,i+8) for i in 0:8:length(str)-1]
  • using Iterators.partition and join split8(str::String) = join.(Iterators.partition(str, 8))
  • using Iterators.partition and String split8(str::String) = String.(Iterators.partition(str, 8))

With this configuration

Julia Version 1.7.3 Commit 742b9abb4d (2022-05-06 12:58 UTC) Platform Info: OS: Windows (x86_64-w64-mingw32) CPU: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-12.0.1 (ORCJIT, tigerlake) Environment: JULIA_EDITOR = code JULIA_NUM_THREADS = 1

it seems that the first solution is the most efficient. However, Iterators.partition seems to be the correct tool for that. Also as noted by DNF in the original discussion, given its description

partition(collection, n)

Iterate over a collection n elements at a time.

Examples ≡≡≡≡≡≡≡≡≡≡

julia> collect(Iterators.partition([1,2,3,4,5], 2)) 3-element Vector{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}: [1, 2] [3, 4] [5]

one could assume that collecting the iterator should return a vector of SubString.

BambOoxX avatar Jun 21 '22 17:06 BambOoxX

Yes, this certainly makes sense to me. We already have specialized methods for partition(array, n) that return views, so why not a specialized method for partition(string, n)?

stevengj avatar Jun 21 '22 19:06 stevengj

Should just need something like:

eltype(::Type{PartitionIterator{T}}) where {T<:AbstractString} = SubString{T}

function iterate(itr::PartitionIterator{<:AbstractString}, state = firstindex(itr.c))
    state > ncodeunits(itr.c) && return nothing
    lastind = min(nextind(itr.c, state, itr.n - 1), lastindex(itr.c))
    return SubString(itr.c, state, lastind), nextind(itr.c, lastind)
end

which gives:

julia> collect(Iterators.partition("foobars", 2))
4-element Vector{SubString{String}}:
 "fo"
 "ob"
 "ar"
 "s"

stevengj avatar Jun 21 '22 19:06 stevengj

This is a user-visible change, but I don't think it would qualify as breaking since the only documented requirement of partition is that it returns iterable collections of elements, I think?

stevengj avatar Jun 21 '22 19:06 stevengj

Thanks for tackling this so quickly ! I just benchmarked it on my machine. This gives

julia> @benchmark collect(Iterators.partition($str,8))
BenchmarkTools.Trial: 10000 samples with 203 evaluations.
 Range (min … max):  374.877 ns …  2.802 μs  ┊ GC (min … max): 0.00% … 81.70%
 Time  (median):     433.005 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   446.508 ns ± 80.783 ns  ┊ GC (mean ± σ):  0.56% ±  2.88%

     ▁      ▄▇▇▇█▇▅▃▄▄▃▃▃▃▃  ▁ ▁▁  ▁▂▁▂▂▁▁▁                    ▂
  █▆▆███▆▇▄▇██████████████████▇██████████████▇▆▆▅▅▆▄▃▅▄▂▃▂▄▃▄▄ █
  375 ns        Histogram: log(frequency) by time       613 ns <

 Memory estimate: 304 bytes, allocs estimate: 1.

Median times are 2.3 times the ones obtained with the comprehension, but it's much better than with the join or String solutions ! With respect to the breaking characteristic, I believe this could brake only if methods rely on the exact type of the output, rather than the indexing itself which seems identical with SubString and Vector{Char}.

BambOoxX avatar Jun 22 '22 07:06 BambOoxX

Just needs someone to add a test and create a pull request.

stevengj avatar Jun 24 '22 16:06 stevengj

split8_6(str::String) = [@view str[i.+(1:8)] for i in 0:8:length(str)-1]

Vladisvob avatar Jul 26 '22 00:07 Vladisvob

u guys are crazy lol

SonyK16 avatar Jul 26 '22 10:07 SonyK16

I have opened a PR for the above implementation. There is one odd case partition(enumerate(string),1), which I'm not sure about but I think the new behaviour makes sense (more in the PR).

At the moment it fails testall but I believe that's unrelated to the PR (it fails on master too).

Any feedback would be greatly appreciated.

svilupp avatar Jul 31 '22 20:07 svilupp