s5cmd command: use external sort for comparison in sync

It uses external sort instead of in-memory sort to dramatically reduce memory usage in the expense of the speed. It uses encoding/gob format to store to disk and restore from there.

Fixes #441 Fixes #447

Aug 09 '22 15:08 kucukaslan

I ran a few tests with sync (when it was using encoding/xml, but now uses encoding/json).

I created 1,000,000 files each with 16 bytes size (using truncate & split). Then I run sync command with old version and current (with extsort) version that copies these 1,000,000 to empty s3 bucket (prefix).

Using old version, sync operation took about 500 seconds. It used about 10 GB RAM. Using external sort with chunk size[^chunk] of 50,000, sync operation took about 600 seconds. It used up to 1 GB RAM. Using external sort with chunk size of 100,000, sync operation took about 600 seconds. It used up to 2 GB RAM. External sort's channel buffer size was 1,000 by default. Increasing it 5,000 did not have an observable change in speed.

I'm planning to set default chunk size to 100,000 which uses up to 2 GB RAM. Together with the file's content (when real files with significant size not merely 16 bytes are used), I guess 8 GB RAM would be sufficient in general. I also plan to add flags to specify chunk size and channel buffer size.

[^chunk]: "amount of records to store in each chunk [subset] which will be written to disk". see also

Aug 10 '22 10:08 kucukaslan

Updates

it was using encoding/xml, but now uses encoding/json format to write disk. it now allows specifying chunk size. If a nonpositive chunk size is provided than, it sorts objects in memory.

1,000,000 objects, each one is 16 bytes

For sync'ing 1,000,000 local objects to s3.

Internal sort uses upto 13 GB RAM.
External sort with chunk size 100,000 uses upto 1.5 GB RAM.

When all those 1,000,000 objects are supposed to be uploaded

External sort takes about 600±20 seconds (current default)
lanrat/extsort's internal sort takes about 575±20 seconds
Ordinary internal sort[^int] takes about 575±20 seconds (current internal sort algorithm)
Previous algoritmh takes 490±20 seconds

When none of the objects are supposed to be uploaded, with

External sort takes about 90±5 seconds (current default)
lanrat/extsort's internal sort takes about 85±5 seconds
my custom internal sort takes about 80±5 seconds (current internal sort algorithm)
Previous algoritmh takes 75±5 seconds

Extra commentaries

changing size of channel buffers don't have a measurable effect,
changing extsort's NumWorkers has no measurable effect (2 vs 16)

9.1 GB semi-natural directory

A directory of 9.1 GB in 52250 objects. Directory content consists of the some repositories cloned from github, an .iso. and a script file:

QuantumLibraries Telegram digitalgov.gov fireship.io go openimage.py s5cmd sql-server-samples ubuntu.iso

Hyperfine single run results (no reproducible difference):

'Old internal sort' ran 1.01 times faster than 'New internal sort' 1.02 times faster than 'External Sort'

hyperfine --show-output -w 0 -r 1
-n 'Old internal sort' './s5cmdOLD --stat --log error sync ~/repos/ s3://s5cmd-benchmark///dest0/'
-n 'External Sort' './s5cmdNEW --stat --log error sync ./repos s3://s5cmd-benchmark///dest1/'
-n 'New internal sort' './s5cmdNEW --stat --log error sync --chunk-size "-1" ./repos s3://s5cmd-benchmark///dest2/'

Benchmark 1: Old internal sort

Operation	Total	Error	Success
cp		52250	0	52250
sync		1	0	1

  Time (abs ≡):        42.306 s               [User: 178.126 s, System: 41.943 s]

Benchmark 2: External Sort
ERROR "sync ./repos s3://s5cmd-benchmark///dest1/": no object found

Operation	Total	Error	Success
cp		52250	0	52250
sync		1	0	1

  Time (abs ≡):        43.070 s               [User: 195.419 s, System: 44.529 s]

Benchmark 3: New internal sort
ERROR "sync --chunk-size=-1 ./repos s3://s5cmd-benchmark///dest2/": no object found

Operation	Total	Error	Success
cp		52250	0	52250
sync		1	0	1

  Time (abs ≡):        42.603 s               [User: 182.621 s, System: 42.220 s]

Summary
  'Old internal sort' ran
    1.01 times faster than 'New internal sort'
    1.02 times faster than 'External Sort'

'External Sort' ran 1.07 times faster than 'Old internal sort' 1.10 times faster than 'New internal sort''

hyperfine --show-output -w 0 -r 1
-n 'Old internal sort' './s5cmdOLD --stat --log error sync ~/repos/ s3://s5cmd-benchmark///dest0f/'
-n 'External Sort' './s5cmdNEW --stat --log error sync ./repos s3://s5cmd-benchmark///dest1f/'
-n 'New internal sort' './s5cmdNEW --stat --log error sync --chunk-size "-1" ./repos s3://s5cmd-benchmark///dest2f/'

Benchmark 1: Old internal sort

Operation	Total	Error	Success
cp		52250	0	52250
sync		1	0	1

  Time (abs ≡):        43.098 s               [User: 179.197 s, System: 40.396 s]

Benchmark 2: External Sort
ERROR "sync ./repos s3://s5cmd-benchmark///dest1f/": no object found

Operation	Total	Error	Success
cp		52250	0	52250
sync		1	0	1

  Time (abs ≡):        40.405 s               [User: 196.294 s, System: 44.523 s]

Benchmark 3: New internal sort
ERROR "sync --chunk-size=-1 ./repos s3://s5cmd-benchmark///dest2f/": no object found

Operation	Total	Error	Success
cp		52250	0	52250
sync		1	0	1

  Time (abs ≡):        44.270 s               [User: 176.710 s, System: 40.735 s]

Summary
  'External Sort' ran
    1.07 times faster than 'Old internal sort'
    1.10 times faster than 'New internal sort'

[^int]: I didn't know how to say "just using the go's sort library as it was done earlier, except that reading from channels and sending to a channel".

Aug 11 '22 16:08 kucukaslan

I guess using StableSort would make difference only if the filepath.ToSlash was called before the sort operation, not after.

Furthermore I think using filepath.ToSlash is problematic. Because it, potentially, breaks the assumption that both of the lists/channels are listed. For example, in a windows device we may have such an order: "ab" "a\b" but in compareObjects we use them after applying filepath.ToSlash which converts them to "ab" "a/b" but in ascii "/" comes before "b", so practically it is not sorted as we expect and assume.

So I propose these changes as fix: https://github.com/Kucukaslan/s5cmd/commit/40a22c91df08cea3f150455f6c2d060480ba745c

Warning I'm wrong. The filepath.ToSlash is already called at the url.New so the path seperators are converted to slash well in advance. My, so-called, fix would be redundant, so I won't include that in this PR. For the record, I'm not sure if this behaviour of url.New is a good practice or not. https://github.com/peak/s5cmd/blob/2a6c7cc858cc350965bd0bc51abf81498118ef02/storage/url/url.go#L78-L81

Aug 15 '22 14:08 kucukaslan

I've implemented another variant that uses gob encoding instead of the json encoding in this branch.

I've also written a test code to compare their respective encoding/decoding[^bytes] speeds. The gob encoding uses much less space than the json encoding (88 vs ~150 bytes for the test input).

In my first tests, the json encoding was much faster than the gob encoding. For 1 million objects the json encoding took about 8.5 seconds, while gob encoding took about 30 seconds. After perusing their implementations[^almost], I identified the main reason of the difference: The gob's default time.Time encoding seems to be slower than storing the time as RFC3339Nano formatted string. Now gob takes about 10 seconds for the same test values.

I currently do not know how the encoded slice sizes will affect the external sorting.

Update: Tested to sync. 500000. local files to s3 bucket which has only one file. Benchmark 1: s5cmdgob --dry-run sync /tmp/s5cmd/small500K/ s3://... Time (abs ≡): 1266.216 s [User: 803.825 s, System: 1065.423 s]

Benchmark 2: s5cmdjson --dry-run sync /tmp/s5cmd/small500K/ s3://... Time (abs ≡): 1299.062 s [User: 956.893 s, System: 1111.326 s]

Summary 's5cmdgob' ran 1.03 times faster than 's5cmdjson'

test code:

func TestFromToBytes(t *testing.T) {
	key := "s3://bucket/key/w*ldcard/\net"
	size := int64(53861)
	modTime := time.Now()
	ty := ObjectType{1}

	count := 1000000

	u, err := url.New(key, url.WithRaw(true))
	if err != nil {
		t.Fail()
	}

	o1 := Object{
		URL:     u,
		Size:    size,
		ModTime: &modTime,
		Type:    ty,
	}
	b := o1.ToBytes()
	fmt.Println(len(b), b)
	start := time.Now()
	for i := 0; i < count; i++ {
		_ = FromBytes(o1.ToBytes()).(Object)

	}
	elapsed := time.Since(start)
	fmt.Printf("Processing %d objects took %s", count, elapsed)
	t.Fail()
}

benchmark

cpu: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz

case	sample size	time per op	?	?
json encoding	125584	8531 ns/op	3996 B/op	77 allocs/op
improved gob encoding	107913	10309 ns/op	5856 B/op	113 allocs/op
raw gob encoding	26679	41970 ns/op	11462 B/op	263 allocs/op

func BenchmarkFromToBytes(b *testing.B) {
	key := "s3://bucket/key/w*ldcard/\net"
	size := int64(53861)
	modTime := time.Now()
	ty := ObjectType{1}

	u, err := url.New(key, url.WithRaw(true))
	if err != nil {
		b.Fail()
	}

	o1 := Object{
		URL:     u,
		Size:    size,
		ModTime: &modTime,
		Type:    ty,
	}

	b.ResetTimer()

	for i := 0; i < b.N; i++ {
		_ = FromBytes(o1.ToBytes()).(Object)
	}
}

[^bytes]: More precisely the (Object).ToBytes and the storage.FromBytes methods. [^almost]: almost by luck

Sep 07 '22 08:09 kucukaslan

Thanks a lot for trying gob for encoding/decoding. Let's go with gob, because:

ToBytes and FromBytes functions are more readable and type-safe.
Gop implementation uses less space although it has almost the same speed as json.

Sep 15 '22 19:09 ilkinulas

Thanks a lot for trying gob for encoding/decoding. Let's go with gob, because:

ToBytes and FromBytes functions are more readable and type-safe.

Gop implementation uses less space although it has almost the same speed as json.

My pleasure. I've rebased the gob encoding branch to this PR.

ps. Time is converted to RFC3339Nano formatted string before gob encoding is called, and parsed from that string after gob decode.

Sep 15 '22 19:09 kucukaslan

hello. any chance this fix will be merged and released? it's very useful for us

Jan 23 '23 19:01 arturrez

I'd also be interested in seeing this merged!

Apr 27 '23 17:04 kbarrette

Sorry for the long wait @Kucukaslan, and thank you for your hard work on this 😄

Jun 16 '23 08:06 igungor

Sorry for the long wait @Kucukaslan, and thank you for your hard work on this 😄

It was my pleasure, thanks 😊

Jun 16 '23 23:06 kucukaslan

I love you for this change

Jul 20 '23 12:07 antonsmelyanskiy

s5cmd s5cmd copied to clipboard

command: use external sort for comparison in sync

Updates

1,000,000 objects, each one is 16 bytes

9.1 GB semi-natural directory

test code:

benchmark

s5cmd
s5cmd copied to clipboard