s5cmd
s5cmd copied to clipboard
runtime: failed to create new OS thread
Hi,
I'm using this tool for streaming data from S3 into Pytorch Dataloader workers (with multiple workers using multiprocessing), which appears to work for a while, before throwing the following failure:
runtime stack:
runtime.throw(0xb5e9bc, 0x9)
/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/panic.go:1116 +0x72
runtime.newosproc(0xc000180700)
/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/os_linux.go:161 +0x1ba
runtime.newm1(0xc000180700)
/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/proc.go:1779 +0xdc
runtime.templateThread()
/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/proc.go:1830 +0x71
runtime.mstart1()
/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/proc.go:1112 +0xc3
runtime.mstart()
/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/proc.go:1077 +0x6e
goroutine 1 [runnable]:
internal/poll.errnoErr(...)
/opt/hostedtoolcache/go/1.14.12/x64/src/internal/poll/errno_unix.go:32
internal/poll.(*pollDesc).init(0xc00014ceb8, 0xc00014cea0, 0xc000027401, 0xc00014cea0)
/opt/hostedtoolcache/go/1.14.12/x64/src/internal/poll/fd_poll_runtime.go:45 +0x8c
internal/poll.(*FD).Init(0xc00014cea0, 0xb5a3fb, 0x4, 0x1, 0x0, 0x0)
/opt/hostedtoolcache/go/1.14.12/x64/src/internal/poll/fd_unix.go:63 +0x5f
os.newFile(0x3, 0xc000027460, 0x11, 0x1, 0x0)
/opt/hostedtoolcache/go/1.14.12/x64/src/os/file_unix.go:155 +0xf6
os.openFileNolog(0xc000027460, 0x11, 0x0, 0x0, 0xad0fa0, 0x0, 0xc0002d9868)
/opt/hostedtoolcache/go/1.14.12/x64/src/os/file_unix.go:226 +0x18d
os.OpenFile(0xc000027460, 0x11, 0x0, 0x0, 0x7f83ca746108, 0x0, 0x9)
/opt/hostedtoolcache/go/1.14.12/x64/src/os/file.go:307 +0x63
os.Open(...)
/opt/hostedtoolcache/go/1.14.12/x64/src/os/file.go:287
github.com/aws/aws-sdk-go/internal/ini.OpenFile(0xc000027460, 0x11, 0x0, 0x0, 0x0)
/home/runner/go/pkg/mod/github.com/aws/[email protected]/internal/ini/ini.go:13 +0x83
github.com/aws/aws-sdk-go/aws/session.loadSharedConfigIniFiles(0xc0002da398, 0x2, 0x2, 0x80, 0xad0e60, 0x1, 0xc000204880, 0xc0002d9ba8)
/home/runner/go/pkg/mod/github.com/aws/[email protected]/aws/session/shared_config.go:163 +0xd2
github.com/aws/aws-sdk-go/aws/session.loadSharedConfig(0xb5cc0b, 0x7, 0xc0002da398, 0x2, 0x2, 0x1, 0x0, 0x0, 0x0, 0x0, ...)
/home/runner/go/pkg/mod/github.com/aws/[email protected]/aws/session/shared_config.go:145 +0xb8
github.com/aws/aws-sdk-go/aws/session.newSession(0x0, 0x0, 0xc000396c50, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/home/runner/go/pkg/mod/github.com/aws/[email protected]/aws/session/session.go:446 +0x1d1
github.com/aws/aws-sdk-go/aws/session.NewSessionWithOptions(0x0, 0x0, 0xc000396c50, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/home/runner/go/pkg/mod/github.com/aws/[email protected]/aws/session/session.go:333 +0x282
github.com/peak/s5cmd/storage.(*SessionCache).newSession(0x11d9410, 0xd8e560, 0xc00015cf40, 0xa, 0x0, 0x0, 0x0, 0x7ffeb7b374b3, 0x23, 0x0, ...)
/home/runner/work/s5cmd/s5cmd/storage/s3.go:668 +0x4a9
github.com/peak/s5cmd/storage.newS3Storage(0xd8e560, 0xc00015cf40, 0xa, 0x0, 0x0, 0x0, 0x7ffeb7b374b3, 0x23, 0xc0002dba78, 0x5a8f06, ...)
/home/runner/work/s5cmd/s5cmd/storage/s3.go:84 +0xda
github.com/peak/s5cmd/storage.NewRemoteClient(0xd8e560, 0xc00015cf40, 0xc000204580, 0xa, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/home/runner/work/s5cmd/s5cmd/storage/storage.go:57 +0xeb
github.com/peak/s5cmd/command.Cat.Run(0xc000204580, 0xb59d05, 0x3, 0xc00002e140, 0x9f, 0xa, 0x0, 0x0, 0x0, 0x0, ...)
/home/runner/work/s5cmd/s5cmd/command/cat.go:74 +0x9c
github.com/peak/s5cmd/command.glob..func5(0xc00015d400, 0x0, 0x0)
/home/runner/work/s5cmd/s5cmd/command/cat.go:59 +0x308
github.com/urfave/cli/v2.(*Command).Run(0x11d1a20, 0xc00015d340, 0x0, 0x0)
/home/runner/go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:164 +0x4e0
github.com/urfave/cli/v2.(*App).RunContext(0x11d2a80, 0xd8e560, 0xc00015cf40, 0xc00001e0a0, 0x5, 0x5, 0x0, 0x0)
/home/runner/go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:306 +0x814
github.com/peak/s5cmd/command.Main(0xd8e560, 0xc00015cf40, 0xc00001e0a0, 0x5, 0x5, 0x0, 0x0)
/home/runner/work/s5cmd/s5cmd/command/app.go:140 +0x197
main.main()
/home/runner/work/s5cmd/s5cmd/main.go:23 +0xaf
goroutine 6 [runnable]:
os/signal.signal_enable(0x3bb5377700000002)
/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/sigqueue.go:219 +0x6c
os/signal.enableSignal(...)
/opt/hostedtoolcache/go/1.14.12/x64/src/os/signal/signal_unix.go:51
os/signal.Notify.func2(0x2)
/opt/hostedtoolcache/go/1.14.12/x64/src/os/signal/signal.go:150 +0x8e
os/signal.Notify(0xc000194000, 0xc00012c7a8, 0x2, 0x2)
/opt/hostedtoolcache/go/1.14.12/x64/src/os/signal/signal.go:162 +0x170
main.main.func1(0xc0003965d0)
/home/runner/work/s5cmd/s5cmd/main.go:17 +0xa6
created by main.main
/home/runner/work/s5cmd/s5cmd/main.go:15 +0x73
goroutine 17 [syscall]:
os/signal.signal_recv(0x0)
/opt/hostedtoolcache/go/1.14.12/x64/src/runtime/sigqueue.go:147 +0x9c
os/signal.loop()
/opt/hostedtoolcache/go/1.14.12/x64/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.Notify.func1
/opt/hostedtoolcache/go/1.14.12/x64/src/os/signal/signal.go:127 +0x44
goroutine 7 [chan receive]:
github.com/peak/s5cmd/log.(*Logger).out(0xc00000dd60)
/home/runner/work/s5cmd/s5cmd/log/log.go:88 +0x127
created by github.com/peak/s5cmd/log.New
/home/runner/work/s5cmd/s5cmd/log/log.go:61 +0xc7
runtime: failed to create new OS thread (have 6 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime: failed to create new OS thread (have 2 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
I'm not sure why this error is being thrown by s5cmd, since the memory and CPU of the server I'm using (AWS p4d EC2 instance) is nowhere close to being fully utilized. Also, I'm running this in a Docker container within k8s.
Would appreciate any pointers, thank you!
Hi,
Would you mind sharing the exact command that got this error? Also please share s5cmd version and ulimit -n output.
Hi @igungor ,
Thanks for your reply. The s5cmd version was 1.2.1 and the output of ulimit -a when run on the actual host node of the docker container is:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 30446
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 8192
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I thought my ulimits were fine because the ulimit -u from the error message is set to unlimited, but since you asked about ulimit -n, does that mean I have to increase my ulimit -n value on the host?
Also, the command I was running was: s5cmd --numworkers 1 cat <s3 file>
Some background context: Essentially, I'm currently using this library called WebDataset (https://github.com/webdataset/webdataset/), which provides a way to be able to efficiently stream large amounts of data (which in my case is stored in S3) into Pytorch multiprocessing DataLoader workers. https://github.com/webdataset/webdataset/blob/master/notebooks/sources.ipynb shows an example usage of loading data from gcs (cell 4), but basically I do something similar like:
dataset = wds.WebDataset([
"pipe:s5cmd --numworkers 1 cat s3://somebucket/dataset-000.tar",
"pipe:s5cmd --numworkers 1 cat s3://somebucket/dataset-001.tar",
"pipe:s5cmd --numworkers 1 cat s3://somebucket/dataset-002.tar",
# etc. in my case, im using 10s of thousands of such files
)
So, during my training and validation phases, I use 12 multi-processing Dataloader workers per gpu on each server (and each server has 8 gpus, so this would be a total of 96 dataloading processes using s5cmd to pull in data from S3).
Somehow, this s5cmd error never shows up after the first epoch, but only as soon as the second epoch finishes.
Any updates on this issue? or any workarounds or solutions? @igungor
Hi,
I can report that I am facing the same issue in a pretty minimal context. The command I am running is just:
/home/leonidbelyaev/s5cmd-install/s5cmd --endpoint-url=https://my.s3.host cp my_dir/ s3://my/s3/prefix
Version is:
v2.2.2-48f7e59
ulimit for user procs is:
max user processes (-u) 300
I get the error:
runtime: failed to create new OS thread (have 298 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime: failed to create new OS thread (have 298 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime: failed to create new OS thread (have 298 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime: failed to create new OS thread (have 298 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime: failed to create new OS thread (have 298 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
runtime: failed to create new OS thread (have 298 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc
This is in a somewhat restricted HPC login node, and the directory is sprawling and massive, so I can appreciate why this might happen. Not an ideal UX though.