SortingLab.jl icon indicating copy to clipboard operation
SortingLab.jl copied to clipboard

radixsort much slower than buit-in sort (regarding string sorting)

Open rapus95 opened this issue 7 years ago • 12 comments

lines = open("f.txt") do file
    readlines(file) #has approx 4.3m lines
end
@time sort(lines)
@time radixsort(lines)

delivers the following output:

2.585242 seconds (9 allocations: 49.539 MiB, 0.42% gc time)
18.793147 seconds (14.83 k allocations: 2.916 GiB, 36.62% gc time)

rapus95 avatar Mar 05 '18 23:03 rapus95

Interesting can you let me know the length of each line? If very long then it will be slow I think.

On 6 Mar. 2018 10:40 am, "Aaron" [email protected] wrote:

lines = open("f.txt") do file readlines(file) #has approx 4.3m linesend@time sort(lines)@time radixsort(lines)

delivers the following output:

2.585242 seconds (9 allocations: 49.539 MiB, 0.42% gc time) 18.793147 seconds (14.83 k allocations: 2.916 GiB, 36.62% gc time)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/xiaodaigh/SortingLab.jl/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJXwa63RlcjBqZzrXLXd0LBiwRQjJks5tbc0DgaJpZM4Sd27u .

xiaodaigh avatar Mar 05 '18 23:03 xiaodaigh

Please also run it twice.

On 6 Mar. 2018 10:40 am, "Aaron" [email protected] wrote:

lines = open("f.txt") do file readlines(file) #has approx 4.3m linesend@time sort(lines)@time radixsort(lines)

delivers the following output:

2.585242 seconds (9 allocations: 49.539 MiB, 0.42% gc time) 18.793147 seconds (14.83 k allocations: 2.916 GiB, 36.62% gc time)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/xiaodaigh/SortingLab.jl/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJXwa63RlcjBqZzrXLXd0LBiwRQjJks5tbc0DgaJpZM4Sd27u .

xiaodaigh avatar Mar 05 '18 23:03 xiaodaigh

longest entry is 212 chars long but only ~8k out of 4.3m entries are longer than 50 chars...

rapus95 avatar Mar 05 '18 23:03 rapus95

Maybe I should just add in some comments and give a warning radixsort in its current implemebtation will be slow for strings longer than 16bytes. Apologies, no way around it at the moment. Should just use quicksort which should be the faster than r and python. I assume your strings are mostly unique? That is also not gonna be that fast with radixsort

On 6 Mar. 2018 10:50 am, "Aaron" [email protected] wrote:

longest entry is 212 chars long but only ~8k out of 4.3m entries are longer than 50 chars...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaodaigh/SortingLab.jl/issues/2#issuecomment-370609388, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJaJRFv8K9T9w5vgRUf3aZ-Cpu8q1ks5tbc8-gaJpZM4Sd27u .

xiaodaigh avatar Mar 05 '18 23:03 xiaodaigh

unique approx after the first 10 characters. Is there an easy way to sort only after the first 10 characters? alias you might want to extend the radixsort function by something similar to the "by" keyword :D

rapus95 avatar Mar 05 '18 23:03 rapus95

I need to think. You can use the by parameter and use quicksort

On 6 Mar. 2018 10:56 am, "Aaron" [email protected] wrote:

unique approx after the first 10 characters. Is there an easy way to sort only after the first 10 characters?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaodaigh/SortingLab.jl/issues/2#issuecomment-370610643, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJarnJy85mbfrOTEpAMQ-BO13yIk3ks5tbdC5gaJpZM4Sd27u .

xiaodaigh avatar Mar 06 '18 00:03 xiaodaigh

for the Base.sort it results in that pretty bad performance while i guess your radix sort would outperform this easily...

@time sort(lines, by=x->x[1:min(end,10)])
11.629258 seconds (188.29 M allocations: 5.660 GiB, 12.06% gc time)
@time sort(lines)
 2.552390 seconds (9 allocations: 49.539 MiB, 0.22% gc time)

rapus95 avatar Mar 06 '18 00:03 rapus95

You sorted upto first 10 characters. So you only want to sort by first 10 characters? Or after? If after radixsort will also be slow. It's slow at sorting long strings.

Unfortunately there is no by functionality in radixsort yet. I will see if I can whip something later.

Also 2 second ain't bad. Unless you need to sort by thing that are 100 times larger

On 6 Mar. 2018 11:07 am, "Aaron" [email protected] wrote:

for the Base.sort it results in that pretty bad performance while i guess your radix sort would outperform this easily...

@time sort(lines, by=x->x[1:min(end,10)]) 11.629258 seconds (188.29 M allocations: 5.660 GiB, 12.06% gc time) @time sort(lines) 2.552390 seconds (9 allocations: 49.539 MiB, 0.22% gc time)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/xiaodaigh/SortingLab.jl/issues/2#issuecomment-370612935, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJbrhd_H_702aYnHGBzh1zqmwtdxxks5tbdMqgaJpZM4Sd27u .

xiaodaigh avatar Mar 06 '18 00:03 xiaodaigh

well i got more than 100 of these files -.- you can guess that my RAM isn't big enough -.- what i was thinking about is sorting all of these files only after the first 2 chars and then saving each beginning into its own file again. and after that sorting the smaller files by quicksort. thus it'd be cool if there was a way to append each "bucket" to its own file. :D

rapus95 avatar Mar 06 '18 00:03 rapus95

honestly, i am a bit confused by the description. for each you file you want sort by the first 2 characters or sort by every character after the first 2?

Then you want to break it up into smaller files? IO is a huge bottleneck, so I don't see why that might be faster.

xiaodaigh avatar Mar 06 '18 05:03 xiaodaigh

as i was saying before, i still only want to sort by the first 2 caracters, not those after the first 2! bcs that way i want to set up an external sort: sort every file (only for 2 chars instead of whole length should be faster) regroup parts of the files (some sort of multiway merge) finally sort the contents of the resulting files while maintaining a good file structure based on the first 2 chars

rapus95 avatar Mar 06 '18 12:03 rapus95

In that case you can just use SortingAlgorithms.jl and not use SortingLab.jl. I can get a 2x speed up.

a = randstring()
rand(2:128)
nn = abs.(randn(4_300_000))
x = randstring.(Int.(round.(nn .* 106)).+2)
@time sort(x, by = x-> x |> pointer |> Ptr{UInt16} |> unsafe_load |> hton); # 2
@time sort(x); # 4
using SortingAlgorithms
@time sort(x, by = x-> x |> pointer |> Ptr{UInt16} |> unsafe_load |> hton, alg = RadixSort); # 1.1

xiaodaigh avatar Mar 06 '18 12:03 xiaodaigh