s5cmd
s5cmd copied to clipboard
command: use external sort for comparison in sync
It uses external sort instead of in-memory sort to dramatically reduce memory usage in the expense of the speed. It uses encoding/gob format to store to disk and restore from there.
Fixes #441 Fixes #447
I ran a few tests with sync (when it was using encoding/xml, but now uses encoding/json).
I created 1,000,000 files each with 16 bytes size (using truncate
& split
). Then I run sync
command with old version and current (with extsort) version that copies these 1,000,000 to empty s3 bucket (prefix).
Using old version, sync operation took about 500 seconds. It used about 10 GB RAM. Using external sort with chunk size[^chunk] of 50,000, sync operation took about 600 seconds. It used up to 1 GB RAM. Using external sort with chunk size of 100,000, sync operation took about 600 seconds. It used up to 2 GB RAM. External sort's channel buffer size was 1,000 by default. Increasing it 5,000 did not have an observable change in speed.
I'm planning to set default chunk size to 100,000 which uses up to 2 GB RAM. Together with the file's content (when real files with significant size not merely 16 bytes are used), I guess 8 GB RAM would be sufficient in general. I also plan to add flags to specify chunk size and channel buffer size.
[^chunk]: "amount of records to store in each chunk [subset] which will be written to disk". see also
Updates
it was using encoding/xml, but now uses encoding/json format to write disk. it now allows specifying chunk size. If a nonpositive chunk size is provided than, it sorts objects in memory.
1,000,000 objects, each one is 16 bytes
For sync'ing 1,000,000 local objects to s3.
- Internal sort uses upto 13 GB RAM.
- External sort with chunk size 100,000 uses upto 1.5 GB RAM.
When all those 1,000,000 objects are supposed to be uploaded
- External sort takes about 600±20 seconds (current default)
- lanrat/extsort's internal sort takes about 575±20 seconds
- Ordinary internal sort[^int] takes about 575±20 seconds (current internal sort algorithm)
- Previous algoritmh takes 490±20 seconds
When none of the objects are supposed to be uploaded, with
- External sort takes about 90±5 seconds (current default)
- lanrat/extsort's internal sort takes about 85±5 seconds
- my custom internal sort takes about 80±5 seconds (current internal sort algorithm)
- Previous algoritmh takes 75±5 seconds
Extra commentaries
- changing size of channel buffers don't have a measurable effect,
- changing extsort's NumWorkers has no measurable effect (2 vs 16)
9.1 GB semi-natural directory
A directory of 9.1 GB in 52250 objects. Directory content consists of the some repositories cloned from github, an .iso. and a script file:
Hyperfine single run results (no reproducible difference):
'Old internal sort' ran 1.01 times faster than 'New internal sort' 1.02 times faster than 'External Sort'
hyperfine --show-output -w 0 -r 1
-n 'Old internal sort' './s5cmdOLD --stat --log error sync ~/repos/ s3://s5cmd-benchmark///dest0/'
-n 'External Sort' './s5cmdNEW --stat --log error sync ./repos s3://s5cmd-benchmark///dest1/'
-n 'New internal sort' './s5cmdNEW --stat --log error sync --chunk-size "-1" ./repos s3://s5cmd-benchmark///dest2/'
Benchmark 1: Old internal sort
Operation Total Error Success
cp 52250 0 52250
sync 1 0 1
Time (abs ≡): 42.306 s [User: 178.126 s, System: 41.943 s]
Benchmark 2: External Sort
ERROR "sync ./repos s3://s5cmd-benchmark///dest1/": no object found
Operation Total Error Success
cp 52250 0 52250
sync 1 0 1
Time (abs ≡): 43.070 s [User: 195.419 s, System: 44.529 s]
Benchmark 3: New internal sort
ERROR "sync --chunk-size=-1 ./repos s3://s5cmd-benchmark///dest2/": no object found
Operation Total Error Success
cp 52250 0 52250
sync 1 0 1
Time (abs ≡): 42.603 s [User: 182.621 s, System: 42.220 s]
Summary
'Old internal sort' ran
1.01 times faster than 'New internal sort'
1.02 times faster than 'External Sort'
'External Sort' ran 1.07 times faster than 'Old internal sort' 1.10 times faster than 'New internal sort''
hyperfine --show-output -w 0 -r 1
-n 'Old internal sort' './s5cmdOLD --stat --log error sync ~/repos/ s3://s5cmd-benchmark///dest0f/'
-n 'External Sort' './s5cmdNEW --stat --log error sync ./repos s3://s5cmd-benchmark///dest1f/'
-n 'New internal sort' './s5cmdNEW --stat --log error sync --chunk-size "-1" ./repos s3://s5cmd-benchmark///dest2f/'
Benchmark 1: Old internal sort
Operation Total Error Success
cp 52250 0 52250
sync 1 0 1
Time (abs ≡): 43.098 s [User: 179.197 s, System: 40.396 s]
Benchmark 2: External Sort
ERROR "sync ./repos s3://s5cmd-benchmark///dest1f/": no object found
Operation Total Error Success
cp 52250 0 52250
sync 1 0 1
Time (abs ≡): 40.405 s [User: 196.294 s, System: 44.523 s]
Benchmark 3: New internal sort
ERROR "sync --chunk-size=-1 ./repos s3://s5cmd-benchmark///dest2f/": no object found
Operation Total Error Success
cp 52250 0 52250
sync 1 0 1
Time (abs ≡): 44.270 s [User: 176.710 s, System: 40.735 s]
Summary
'External Sort' ran
1.07 times faster than 'Old internal sort'
1.10 times faster than 'New internal sort'
[^int]: I didn't know how to say "just using the go's sort library as it was done earlier, except that reading from channels and sending to a channel".
I guess using StableSort would make difference only if the filepath.ToSlash was called before the sort operation, not after.
Furthermore I think using filepath.ToSlash is problematic. Because it, potentially, breaks the assumption that both of the lists/channels are listed. For example, in a windows device we may have such an order: "ab" "a\b" but in compareObjects we use them after applying filepath.ToSlash which converts them to "ab" "a/b" but in ascii "/" comes before "b", so practically it is not sorted as we expect and assume.
So I propose these changes as fix: https://github.com/Kucukaslan/s5cmd/commit/40a22c91df08cea3f150455f6c2d060480ba745c
Warning I'm wrong. The filepath.ToSlash is already called at the url.New so the path seperators are converted to slash well in advance. My, so-called, fix would be redundant, so I won't include that in this PR. For the record, I'm not sure if this behaviour of url.New is a good practice or not. https://github.com/peak/s5cmd/blob/2a6c7cc858cc350965bd0bc51abf81498118ef02/storage/url/url.go#L78-L81
I've implemented another variant that uses gob encoding instead of the json encoding in this branch.
I've also written a test code to compare their respective encoding/decoding[^bytes] speeds. The gob encoding uses much less space than the json encoding (88 vs ~150 bytes for the test input).
In my first tests, the json encoding was much faster than the gob encoding. For 1 million objects the json encoding took about 8.5 seconds, while gob encoding took about 30 seconds. After perusing their implementations[^almost], I identified the main reason of the difference: The gob's default time.Time encoding seems to be slower than storing the time as RFC3339Nano formatted string. Now gob takes about 10 seconds for the same test values.
I currently do not know how the encoded slice sizes will affect the external sorting.
Update: Tested to sync. 500000. local files to s3 bucket which has only one file. Benchmark 1: s5cmdgob --dry-run sync /tmp/s5cmd/small500K/ s3://... Time (abs ≡): 1266.216 s [User: 803.825 s, System: 1065.423 s]
Benchmark 2: s5cmdjson --dry-run sync /tmp/s5cmd/small500K/ s3://... Time (abs ≡): 1299.062 s [User: 956.893 s, System: 1111.326 s]
Summary 's5cmdgob' ran 1.03 times faster than 's5cmdjson'
test code:
func TestFromToBytes(t *testing.T) {
key := "s3://bucket/key/w*ldcard/\net"
size := int64(53861)
modTime := time.Now()
ty := ObjectType{1}
count := 1000000
u, err := url.New(key, url.WithRaw(true))
if err != nil {
t.Fail()
}
o1 := Object{
URL: u,
Size: size,
ModTime: &modTime,
Type: ty,
}
b := o1.ToBytes()
fmt.Println(len(b), b)
start := time.Now()
for i := 0; i < count; i++ {
_ = FromBytes(o1.ToBytes()).(Object)
}
elapsed := time.Since(start)
fmt.Printf("Processing %d objects took %s", count, elapsed)
t.Fail()
}
benchmark
cpu: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
case | sample size | time per op | ? | ? |
---|---|---|---|---|
json encoding | 125584 | 8531 ns/op | 3996 B/op | 77 allocs/op |
improved gob encoding | 107913 | 10309 ns/op | 5856 B/op | 113 allocs/op |
raw gob encoding | 26679 | 41970 ns/op | 11462 B/op | 263 allocs/op |
func BenchmarkFromToBytes(b *testing.B) {
key := "s3://bucket/key/w*ldcard/\net"
size := int64(53861)
modTime := time.Now()
ty := ObjectType{1}
u, err := url.New(key, url.WithRaw(true))
if err != nil {
b.Fail()
}
o1 := Object{
URL: u,
Size: size,
ModTime: &modTime,
Type: ty,
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = FromBytes(o1.ToBytes()).(Object)
}
}
[^bytes]: More precisely the (Object).ToBytes
and the storage.FromBytes
methods.
[^almost]: almost by luck
Thanks a lot for trying gob for encoding/decoding. Let's go with gob, because:
-
ToBytes
andFromBytes
functions are more readable and type-safe. - Gop implementation uses less space although it has almost the same speed as json.
Thanks a lot for trying gob for encoding/decoding. Let's go with gob, because:
ToBytes
andFromBytes
functions are more readable and type-safe.- Gop implementation uses less space although it has almost the same speed as json.
My pleasure. I've rebased the gob encoding branch to this PR.
ps. Time is converted to RFC3339Nano formatted string before gob encoding is called, and parsed from that string after gob decode.
hello. any chance this fix will be merged and released? it's very useful for us
I'd also be interested in seeing this merged!
Sorry for the long wait @Kucukaslan, and thank you for your hard work on this 😄
Sorry for the long wait @Kucukaslan, and thank you for your hard work on this 😄
It was my pleasure, thanks 😊
I love you for this change