missingh
missingh copied to clipboard
split is inefficient; can be sped up over 2x for single character delimiter
Data.List.Utils.split if you profile it, does a lot of allocation.
With a 600 mb "data" file containing short words, I profiled this program:
import Data.List.Utils
main = do
s <- readFile "data"
let l = split " " s
print (length l)
Tue Jan 31 19:49 2017 Time and Allocation Profiling Report (Final)
foo +RTS -p -RTS
total time = 117.11 secs (117106 ticks @ 1000 us, 1 processor)
total alloc = 139,419,374,656 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
spanList Data.List.Utils 46.2 59.3
startswith Data.List.Utils 28.4 14.1
MAIN MAIN 14.7 17.8
split Data.List.Utils 8.3 6.4
breakList Data.List.Utils 2.5 2.4
Compare with this program using a splitc specialized to a single character delimiter:
main = do
s <- readFile "data"
let l = splitc ' ' s
print (length l)
splitc :: Char -> String -> [String]
splitc _ [] = []
splitc c s = case break (== c) s of
(i, d:rest) -> i : splitc c rest
(i, []) -> i : []
Tue Jan 31 19:54 2017 Time and Allocation Profiling Report (Final)
foo +RTS -p -RTS
total time = 54.44 secs (54435 ticks @ 1000 us, 1 processor)
total alloc = 99,505,766,960 bytes (excludes profiling overheads)
So over twice as fast!
A simple improvement then would be to make split use splitc when the delimiter is a single character. But, it might be better to investigate why the helper functions are doing so much allocation.
(Actually it's more like 3x as fast since the readFile of 500 mb is responsible for 20 seconds of the runtime.)
Btw, http://hackage.haskell.org/package/split's splitOn is similarly slow to MissingH's split.
Wow, good catch. It would be easy enough to add an optimized splitc for a single-character delimter, and have split use it automatically. Does that splitOn perform better than this split?
split's splotOn does not perform identically, but it's in a very close ballpark to the current MissingH split.
From way back in the MissingH history dating back to 2004 is another definition of split. I have not verified its correctness, but I wonder how it performs in comparison?
split :: Eq a => [a] -> [a] -> [[a]] split delim str = let splitworker :: Eq a => [a] -> [a] -> [a] -> [[a]] splitworker delim [] [] = [] splitworker delim [] accum = [accum] splitworker delim str accum = if delim == str then accum : [] : [] else if startswith delim str then accum : splitworker delim (drop (length delim) str) [] else splitworker delim (tail str) (accum ++ [head str]) in splitworker delim str []
On 02/14/2017 09:41 PM, Joey Hess wrote:
split's splotOn does not perform identically, but it's in a very close ballpark to the current MissingH split.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jgoerzen/missingh/issues/39#issuecomment-279908971, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG5HXQ7yUm9jFEy1f8r51t4F9qcfgxSks5rcnPggaJpZM4LzVBB.
This time I used this data file:
perl -e 'for (1..50000000) { print $_." "}' > data
The current MissingH split profile with that:
Wed Feb 15 12:14 2017 Time and Allocation Profiling Report (Final)
foo +RTS -p -RTS
total time = 80.99 secs (80991 ticks @ 1000 us, 1 processor)
total alloc = 99,733,231,232 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
spanList Data.List.Utils 47.7 59.3
startswith Data.List.Utils 27.0 14.1
MAIN MAIN 14.7 17.8
split Data.List.Utils 8.0 6.4
breakList Data.List.Utils 2.6 2.4
Compare with the 2004 version:
Wed Feb 15 12:17 2017 Time and Allocation Profiling Report (Final)
foo +RTS -p -RTS
total time = 52.74 secs (52739 ticks @ 1000 us, 1 processor)
total alloc = 47,066,563,520 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
MAIN MAIN 93.1 70.2
startswith Data.List.Utils 6.9 29.8
Compare with splitc:
Wed Feb 15 12:19 2017 Time and Allocation Profiling Report (Final)
foo +RTS -p -RTS
total time = 38.25 secs (38251 ticks @ 1000 us, 1 processor)
total alloc = 71,155,452,896 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
MAIN MAIN 100.0 100.0
splitc still wins by a reasonable margin, interesting it has more allocations. The 2004 split is a nice improvement, assuming it's correct.
-- see shy jo
Hello. I think, 2004 split is much cutier and it could be further improved, to smth like this, for example:
split :: Eq a => [a] -> [a] -> [[a]]
split _ [] = []
split [x] str = splitc x str
split delim str =
let splitworker :: Eq a => Int -> [a] -> [a] -> [a] -> [[a]]
splitworker _ delim [] [] = []
splitworker _ delim [] accum = [reverse accum]
splitworker l delim str accum =
let (a, t) = prefixLen 0 delim str
in if null t then
accum : [] : []
else if a == l then
accum : splitworker l delim t []
else splitworker l delim (tail str) (head str : accum)
prefixLen :: Eq a => Int -> [a] -> [a] -> (Int, [a])
prefixLen acc (x:xs) l@(y:ys) = if x == y then prefixLen (acc+1) xs ys else (acc, l)
prefixLen acc _ ys = (acc, ys)
in
splitworker (length delim) delim str []
(testing on equality and isPrefixOf have much in common + accumulating with (++) is bad i believe)
Actually, i also tried to speciallize split to 2-character delimiter and didn't see significant improvements (like in case with splitc)
On my system 2004 version runs ~41 seconds, version in this post ~ 29 seconds and splitc ~ 23 seconds for data generated above and single-character delimiter.
Interesting. I wonder about multi-character delimiters? (This is one of the reasons for this split.) If it's no worse than the other implementation for those, then this is a slam dunk
I generated test file like this (version with 1000000 eats 3gb of memory when used with non-existing delimiter, be careful).
perl -e 'for (1..1000000) { print $_."abcdefghij"}' >many10CharDelims
Ill appreciate other tests on which default version could be the fastest.
And tested all 3 versions (default, 2004, 2004Opt) https://github.com/fyrchik/randomthings (pushed all stuff here with testScript and results)
For delimiters 123 and abcdefghij default version is the slowest, 2004 with optimizations the fastest. I also tried to use non-existent delimiter and here 2004 version was the slowest, but it still has consumed less memory than default version. 2004 with optimization outperformed both versions in tests with all 3 delimiters in terms of both time and memory (thought it loses to 2004-nonoptimized sometimes).
It is also interesting if 'spanList' could be more effective.