SortingLab.jl
SortingLab.jl copied to clipboard
radixsort much slower than buit-in sort (regarding string sorting)
lines = open("f.txt") do file
readlines(file) #has approx 4.3m lines
end
@time sort(lines)
@time radixsort(lines)
delivers the following output:
2.585242 seconds (9 allocations: 49.539 MiB, 0.42% gc time)
18.793147 seconds (14.83 k allocations: 2.916 GiB, 36.62% gc time)
Interesting can you let me know the length of each line? If very long then it will be slow I think.
On 6 Mar. 2018 10:40 am, "Aaron" [email protected] wrote:
lines = open("f.txt") do file readlines(file) #has approx 4.3m linesend@time sort(lines)@time radixsort(lines)
delivers the following output:
2.585242 seconds (9 allocations: 49.539 MiB, 0.42% gc time) 18.793147 seconds (14.83 k allocations: 2.916 GiB, 36.62% gc time)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/xiaodaigh/SortingLab.jl/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJXwa63RlcjBqZzrXLXd0LBiwRQjJks5tbc0DgaJpZM4Sd27u .
Please also run it twice.
On 6 Mar. 2018 10:40 am, "Aaron" [email protected] wrote:
lines = open("f.txt") do file readlines(file) #has approx 4.3m linesend@time sort(lines)@time radixsort(lines)
delivers the following output:
2.585242 seconds (9 allocations: 49.539 MiB, 0.42% gc time) 18.793147 seconds (14.83 k allocations: 2.916 GiB, 36.62% gc time)
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/xiaodaigh/SortingLab.jl/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJXwa63RlcjBqZzrXLXd0LBiwRQjJks5tbc0DgaJpZM4Sd27u .
longest entry is 212 chars long but only ~8k out of 4.3m entries are longer than 50 chars...
Maybe I should just add in some comments and give a warning radixsort in its current implemebtation will be slow for strings longer than 16bytes. Apologies, no way around it at the moment. Should just use quicksort which should be the faster than r and python. I assume your strings are mostly unique? That is also not gonna be that fast with radixsort
On 6 Mar. 2018 10:50 am, "Aaron" [email protected] wrote:
longest entry is 212 chars long but only ~8k out of 4.3m entries are longer than 50 chars...
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaodaigh/SortingLab.jl/issues/2#issuecomment-370609388, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJaJRFv8K9T9w5vgRUf3aZ-Cpu8q1ks5tbc8-gaJpZM4Sd27u .
unique approx after the first 10 characters. Is there an easy way to sort only after the first 10 characters? alias you might want to extend the radixsort function by something similar to the "by" keyword :D
I need to think. You can use the by parameter and use quicksort
On 6 Mar. 2018 10:56 am, "Aaron" [email protected] wrote:
unique approx after the first 10 characters. Is there an easy way to sort only after the first 10 characters?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaodaigh/SortingLab.jl/issues/2#issuecomment-370610643, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJarnJy85mbfrOTEpAMQ-BO13yIk3ks5tbdC5gaJpZM4Sd27u .
for the Base.sort it results in that pretty bad performance while i guess your radix sort would outperform this easily...
@time sort(lines, by=x->x[1:min(end,10)])
11.629258 seconds (188.29 M allocations: 5.660 GiB, 12.06% gc time)
@time sort(lines)
2.552390 seconds (9 allocations: 49.539 MiB, 0.22% gc time)
You sorted upto first 10 characters. So you only want to sort by first 10 characters? Or after? If after radixsort will also be slow. It's slow at sorting long strings.
Unfortunately there is no by functionality in radixsort yet. I will see if I can whip something later.
Also 2 second ain't bad. Unless you need to sort by thing that are 100 times larger
On 6 Mar. 2018 11:07 am, "Aaron" [email protected] wrote:
for the Base.sort it results in that pretty bad performance while i guess your radix sort would outperform this easily...
@time sort(lines, by=x->x[1:min(end,10)]) 11.629258 seconds (188.29 M allocations: 5.660 GiB, 12.06% gc time) @time sort(lines) 2.552390 seconds (9 allocations: 49.539 MiB, 0.22% gc time)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaodaigh/SortingLab.jl/issues/2#issuecomment-370612935, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJbrhd_H_702aYnHGBzh1zqmwtdxxks5tbdMqgaJpZM4Sd27u .
well i got more than 100 of these files -.- you can guess that my RAM isn't big enough -.- what i was thinking about is sorting all of these files only after the first 2 chars and then saving each beginning into its own file again. and after that sorting the smaller files by quicksort. thus it'd be cool if there was a way to append each "bucket" to its own file. :D
honestly, i am a bit confused by the description. for each you file you want sort by the first 2 characters or sort by every character after the first 2?
Then you want to break it up into smaller files? IO is a huge bottleneck, so I don't see why that might be faster.
as i was saying before, i still only want to sort by the first 2 caracters, not those after the first 2! bcs that way i want to set up an external sort: sort every file (only for 2 chars instead of whole length should be faster) regroup parts of the files (some sort of multiway merge) finally sort the contents of the resulting files while maintaining a good file structure based on the first 2 chars
In that case you can just use SortingAlgorithms.jl and not use SortingLab.jl. I can get a 2x speed up.
a = randstring()
rand(2:128)
nn = abs.(randn(4_300_000))
x = randstring.(Int.(round.(nn .* 106)).+2)
@time sort(x, by = x-> x |> pointer |> Ptr{UInt16} |> unsafe_load |> hton); # 2
@time sort(x); # 4
using SortingAlgorithms
@time sort(x, by = x-> x |> pointer |> Ptr{UInt16} |> unsafe_load |> hton, alg = RadixSort); # 1.1