s5cmd runtime: failed to create new OS thread

Hi,

I'm using this tool for streaming data from S3 into Pytorch Dataloader workers (with multiple workers using multiprocessing), which appears to work for a while, before throwing the following failure:

runtime stack:
runtime.throw(0xb5e9bc, 0x9)
	/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/panic.go:1116 +0x72
runtime.newosproc(0xc000180700)
	/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/os_linux.go:161 +0x1ba
runtime.newm1(0xc000180700)
	/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/proc.go:1779 +0xdc
runtime.templateThread()
	/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/proc.go:1830 +0x71
runtime.mstart1()
	/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/proc.go:1112 +0xc3
runtime.mstart()
	/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/proc.go:1077 +0x6e

goroutine 1 [runnable]:
internal/poll.errnoErr(...)
	/opt/hostedtoolcache/go/1.14.12/x64/src/internal/poll/errno_unix.go:32
internal/poll.(*pollDesc).init(0xc00014ceb8, 0xc00014cea0, 0xc000027401, 0xc00014cea0)
	/opt/hostedtoolcache/go/1.14.12/x64/src/internal/poll/fd_poll_runtime.go:45 +0x8c
internal/poll.(*FD).Init(0xc00014cea0, 0xb5a3fb, 0x4, 0x1, 0x0, 0x0)
	/opt/hostedtoolcache/go/1.14.12/x64/src/internal/poll/fd_unix.go:63 +0x5f
os.newFile(0x3, 0xc000027460, 0x11, 0x1, 0x0)
	/opt/hostedtoolcache/go/1.14.12/x64/src/os/file_unix.go:155 +0xf6
os.openFileNolog(0xc000027460, 0x11, 0x0, 0x0, 0xad0fa0, 0x0, 0xc0002d9868)
	/opt/hostedtoolcache/go/1.14.12/x64/src/os/file_unix.go:226 +0x18d
os.OpenFile(0xc000027460, 0x11, 0x0, 0x0, 0x7f83ca746108, 0x0, 0x9)
	/opt/hostedtoolcache/go/1.14.12/x64/src/os/file.go:307 +0x63
os.Open(...)
	/opt/hostedtoolcache/go/1.14.12/x64/src/os/file.go:287
github.com/aws/aws-sdk-go/internal/ini.OpenFile(0xc000027460, 0x11, 0x0, 0x0, 0x0)
	/home/runner/go/pkg/mod/github.com/aws/[email protected]/internal/ini/ini.go:13 +0x83
github.com/aws/aws-sdk-go/aws/session.loadSharedConfigIniFiles(0xc0002da398, 0x2, 0x2, 0x80, 0xad0e60, 0x1, 0xc000204880, 0xc0002d9ba8)
	/home/runner/go/pkg/mod/github.com/aws/[email protected]/aws/session/shared_config.go:163 +0xd2
github.com/aws/aws-sdk-go/aws/session.loadSharedConfig(0xb5cc0b, 0x7, 0xc0002da398, 0x2, 0x2, 0x1, 0x0, 0x0, 0x0, 0x0, ...)
	/home/runner/go/pkg/mod/github.com/aws/[email protected]/aws/session/shared_config.go:145 +0xb8
github.com/aws/aws-sdk-go/aws/session.newSession(0x0, 0x0, 0xc000396c50, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/runner/go/pkg/mod/github.com/aws/[email protected]/aws/session/session.go:446 +0x1d1
github.com/aws/aws-sdk-go/aws/session.NewSessionWithOptions(0x0, 0x0, 0xc000396c50, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/runner/go/pkg/mod/github.com/aws/[email protected]/aws/session/session.go:333 +0x282
github.com/peak/s5cmd/storage.(*SessionCache).newSession(0x11d9410, 0xd8e560, 0xc00015cf40, 0xa, 0x0, 0x0, 0x0, 0x7ffeb7b374b3, 0x23, 0x0, ...)
	/home/runner/work/s5cmd/s5cmd/storage/s3.go:668 +0x4a9
github.com/peak/s5cmd/storage.newS3Storage(0xd8e560, 0xc00015cf40, 0xa, 0x0, 0x0, 0x0, 0x7ffeb7b374b3, 0x23, 0xc0002dba78, 0x5a8f06, ...)
	/home/runner/work/s5cmd/s5cmd/storage/s3.go:84 +0xda
github.com/peak/s5cmd/storage.NewRemoteClient(0xd8e560, 0xc00015cf40, 0xc000204580, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/runner/work/s5cmd/s5cmd/storage/storage.go:57 +0xeb
github.com/peak/s5cmd/command.Cat.Run(0xc000204580, 0xb59d05, 0x3, 0xc00002e140, 0x9f, 0xa, 0x0, 0x0, 0x0, 0x0, ...)
	/home/runner/work/s5cmd/s5cmd/command/cat.go:74 +0x9c
github.com/peak/s5cmd/command.glob..func5(0xc00015d400, 0x0, 0x0)
	/home/runner/work/s5cmd/s5cmd/command/cat.go:59 +0x308
github.com/urfave/cli/v2.(*Command).Run(0x11d1a20, 0xc00015d340, 0x0, 0x0)
	/home/runner/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:164 +0x4e0
github.com/urfave/cli/v2.(*App).RunContext(0x11d2a80, 0xd8e560, 0xc00015cf40, 0xc00001e0a0, 0x5, 0x5, 0x0, 0x0)
	/home/runner/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:306 +0x814
github.com/peak/s5cmd/command.Main(0xd8e560, 0xc00015cf40, 0xc00001e0a0, 0x5, 0x5, 0x0, 0x0)
	/home/runner/work/s5cmd/s5cmd/command/app.go:140 +0x197
main.main()
	/home/runner/work/s5cmd/s5cmd/main.go:23 +0xaf

goroutine 6 [runnable]:
os/signal.signal_enable(0x3bb5377700000002)
	/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/sigqueue.go:219 +0x6c
os/signal.enableSignal(...)
	/opt/hostedtoolcache/go/1.14.12/x64/src/os/signal/signal_unix.go:51
os/signal.Notify.func2(0x2)
	/opt/hostedtoolcache/go/1.14.12/x64/src/os/signal/signal.go:150 +0x8e
os/signal.Notify(0xc000194000, 0xc00012c7a8, 0x2, 0x2)
	/opt/hostedtoolcache/go/1.14.12/x64/src/os/signal/signal.go:162 +0x170
main.main.func1(0xc0003965d0)
	/home/runner/work/s5cmd/s5cmd/main.go:17 +0xa6
created by main.main
	/home/runner/work/s5cmd/s5cmd/main.go:15 +0x73

goroutine 17 [syscall]:
os/signal.signal_recv(0x0)
	/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/sigqueue.go:147 +0x9c
os/signal.loop()
	/opt/hostedtoolcache/go/1.14.12/x64/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.Notify.func1
	/opt/hostedtoolcache/go/1.14.12/x64/src/os/signal/signal.go:127 +0x44

goroutine 7 [chan receive]:
github.com/peak/s5cmd/log.(*Logger).out(0xc00000dd60)
	/home/runner/work/s5cmd/s5cmd/log/log.go:88 +0x127
created by github.com/peak/s5cmd/log.New
	/home/runner/work/s5cmd/s5cmd/log/log.go:61 +0xc7

runtime: failed to create new OS thread (have 6 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

runtime: failed to create new OS thread (have 2 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

I'm not sure why this error is being thrown by s5cmd, since the memory and CPU of the server I'm using (AWS p4d EC2 instance) is nowhere close to being fully utilized. Also, I'm running this in a Docker container within k8s.

Would appreciate any pointers, thank you!

Jul 11 '21 22:07 thecooltechguy

Hi,

Would you mind sharing the exact command that got this error? Also please share s5cmd version and ulimit -n output.

Jul 14 '21 10:07 igungor

Hi @igungor ,

Thanks for your reply. The s5cmd version was 1.2.1 and the output of ulimit -a when run on the actual host node of the docker container is:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 30446
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 8192
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I thought my ulimits were fine because the ulimit -u from the error message is set to unlimited, but since you asked about ulimit -n, does that mean I have to increase my ulimit -n value on the host?

Also, the command I was running was: s5cmd --numworkers 1 cat <s3 file>

Some background context: Essentially, I'm currently using this library called WebDataset (https://github.com/webdataset/webdataset/), which provides a way to be able to efficiently stream large amounts of data (which in my case is stored in S3) into Pytorch multiprocessing DataLoader workers. https://github.com/webdataset/webdataset/blob/master/notebooks/sources.ipynb shows an example usage of loading data from gcs (cell 4), but basically I do something similar like:

dataset = wds.WebDataset([
    "pipe:s5cmd --numworkers 1 cat s3://somebucket/dataset-000.tar",
    "pipe:s5cmd --numworkers 1 cat s3://somebucket/dataset-001.tar",
    "pipe:s5cmd --numworkers 1 cat s3://somebucket/dataset-002.tar",
   # etc. in my case, im using 10s of thousands of such files
)

So, during my training and validation phases, I use 12 multi-processing Dataloader workers per gpu on each server (and each server has 8 gpus, so this would be a total of 96 dataloading processes using s5cmd to pull in data from S3).

Somehow, this s5cmd error never shows up after the first epoch, but only as soon as the second epoch finishes.

Jul 14 '21 17:07 thecooltechguy

Any updates on this issue? or any workarounds or solutions? @igungor

Aug 26 '23 10:08 hattiq

Hi,

I can report that I am facing the same issue in a pretty minimal context. The command I am running is just:

/home/leonidbelyaev/s5cmd-install/s5cmd --endpoint-url=https://my.s3.host cp my_dir/ s3://my/s3/prefix

Version is:

v2.2.2-48f7e59

ulimit for user procs is:

max user processes              (-u) 300

I get the error:

runtime: failed to create new OS thread (have 298 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime: failed to create new OS thread (have 298 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime: failed to create new OS thread (have 298 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime: failed to create new OS thread (have 298 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime: failed to create new OS thread (have 298 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime: failed to create new OS thread (have 298 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

This is in a somewhat restricted HPC login node, and the directory is sprawling and massive, so I can appreciate why this might happen. Not an ideal UX though.

Apr 03 '24 14:04 aeblyve

s5cmd s5cmd copied to clipboard

runtime: failed to create new OS thread

s5cmd
s5cmd copied to clipboard