`stimes 0` is undefined on lazy text
Consider this example:
λ import Data.Text.Lazy
λ stimes @Data.Text.Lazy.Text 0 (Data.Text.Lazy.pack "x")
"*** Exception: stimes: positive multiplier expected
Compare with the behaviour of stimes on String:
λ stimes @String 0 "x"
""
Surely we can return an empty lazy text!
This happens because implementation for stimes is not provided. A similar issue for strict text was addressed in #288.
There is also a blind spot in our test suite — it should reveal differences in behaviour between Text and String.
I drafted an implementation that resolves this issue, but there are certain fine points that we should address.
- For strict text,
stimesworks only on relatively small numbers — specifically,Int. Should the same restriction apply for lazy text? - Should we expect any performance gains or losses? There is no bench mark measuring this yet.
For strict text,
stimesworks only on relatively small numbers — specifically, Int. Should the same restriction apply for lazy text?
There's the pragmatic limitation that Text.Lazy.replicate currently only takes an Int64.
Beyond that, I'm not even sure what's the status of text's handling of lazy text values of length more than 2^64 (whether in chunks or in code points or in bytes). For example length overflows and there's no comment about it. So my guess is nobody's thought about handling insanely long text values yet, much less making use of them. Until we have a clear policy about it, the conservative choice is to error, because it will be easier to replace an error with a defined value than the other way around, if we come to that.
Should we expect any performance gains or losses? There is no bench mark measuring this yet.
I would bet there is a performance gain to be made but now that I think about how the default implementation of stimes works I'm not so sure.
Alright, so who is making the next step on this issue?
You're a good candidate!
I would bet there is a performance gain to be made but now that I think about how the default implementation of
stimesworks I'm not so sure.
~~There is! Since (<>) is linear in (the number of chunks of) the left argument, the default binary exponentiation style implementation is $O(n \log n)$. We can implement it in $O(n)$. Semigroup [a] implements a linear stimes too.~~
- For strict text,
stimesworks only on relatively small numbers — specifically,Int. Should the same restriction apply for lazy text?
IMO no. I would consider having two implementations for performance reasons. If the value fits in Int64 call replicate, otherwise call a similar Integer-based implementation.
Though the current replicate implementation doesn't seem great—it allocates a list and delegates to concat.
Until we have a clear policy about it, the conservative choice is to error, because it will be easier to replace an error with a defined value than the other way around, if we come to that.
My opinion differs (as I mentioned), but note that this option is a breaking change because it does not error today.
I had forgotten about the breaking change with the current version, silly me!
Since
(<>)is linear in (the number of chunks of) the left argument, the default binary exponentiation style implementation is $O ( n \log n )$.
I'm not sure about that. Binary exponentiation makes $\log n$ recursive calls, and each call does one or two (<>) on lists of $O(2^i)$ chunks for $0 \le i \le \log n$, so the sum is linear, $O(n)$.
Furthermore, many of the concatenations are of the accumulator with itself. If you do it right, you can avoid allocating useless lists, by only doing concatenations that contribute to the accumulator, resulting in a quite competitive algorithm. That's what I was unsure about, but now that I look at it again, the default implementation of stimes does not work that way because it has two accumulators in most cases.
So it is already $O(n)$ but there should be a significant speed up from not allocating lists that will be immediately garbage collected.
Binary exponentiation makes $\log n$ recursive calls, and each call does one or two
(<>)on lists of $O(2^i)$ chunks for $0 \leq i \leq \log n$ , so the sum is linear, $O(n)$ .
You are absolutely right, I wasn't thinking clearly 😅
So it is already $O(n)$ but there should be a significant speed up from not allocating lists that will be immediately garbage collected.
Agreed, this should be a good improvement nevertheless.
- For strict text,
stimesworks only on relatively small numbers — specifically,Int. Should the same restriction apply for lazy text?IMO no. I would consider having two implementations for performance reasons. If the value fits in
Int64callreplicate, otherwise call a similarInteger-based implementation.
An option could be to clamp the number of repetitions at maxBound :: Int64, making the second implementation unnecessary. This is relying on the fact that one cannot realistically reach the end of the Text to prove that the result is incorrect.
Agreed, this should be a good improvement nevertheless.
Missed this before, another benefit here is $O(1)$ access to the head, unlike $O(\log n)$ with the default stimes.