odp icon indicating copy to clipboard operation
odp copied to clipboard

fallocate is interrupted by signal at startup

Open chrhong opened this issue 3 years ago • 8 comments

A pool create failed issue is detected in our system, error shows the system call fallocate is interruptted: "odp_ishm.c:707:create_file():Huge page memory allocation failed: fd=582, file=/dev/hugepages/0/odp-16-ishm-pool_008_pkt-rx:7-0, err="Interrupted system call""

Is that better to retry the system call after getting the error return ? While the signal is raised is unknown yet...

chrhong avatar Sep 08 '21 08:09 chrhong

@MatiasElo Do you have any comments for this ?

chrhong avatar Sep 08 '21 08:09 chrhong

Hmm, this is the first time I've seen this failure. Does this happen constantly or was it a random occurrence? Also, what was the return code of fallocate() and the size of allocated shm block?

MatiasElo avatar Sep 08 '21 08:09 MatiasElo

The error occurs easily on k8s env, 10% recurrence. I think fallocate return core is EINTR(Interrupted system call)。 Size is around 4M

chrhong avatar Sep 08 '21 08:09 chrhong

Thanks for the info. Looks like a good solution would be to add a number of retries if EINTR is received.

MatiasElo avatar Sep 08 '21 10:09 MatiasElo

Does this change fix the issue you are seeing?

MatiasElo avatar Sep 08 '21 12:09 MatiasElo

strange that the issue is not reproduced after I recompile...update later....

chrhong avatar Sep 10 '21 09:09 chrhong

Update:

  1. When I recompile odp and copy new libs to my docker, the issue cannot be detected even in hundreds of restart;
  2. When I not update odp, the issue occurs easily. The most important thing is, there is nothing changed related with startup between new and old odp libs. Matias, do you know any method to trace which/why signal interrupt the system call ? I want to dig why the call is only interrupted with older libs. I use linux strace to trace my process, but didn't see any signal in my process...
mkdir("/dev/hugepages/0", 0744)         = -1 EEXIST (File exists)
open("/dev/hugepages/0/odp-48-ishm-far_pool", O_RDWR|O_CREAT|O_TRUNC, 0644) = 602
fallocate(602, 0, 0, 618659840)         = -1 EINTR (**Interrupted system call**)
write(2, "odp_ishm.c:707:create_file():Hug"..., 151) = 151
close(602)                              = 0
unlink("/dev/hugepages/0/odp-48-ishm-far_pool") = 0
write(2, "odp_ishm.c:1168:_odp_ishm_reserv"..., 112) = 112
mkdir("/dev/shm/0", 0744)               = -1 EEXIST (File exists)
open("/dev/shm/0/odp-48-ishm-far_pool", O_RDWR|O_CREAT|O_TRUNC, 0644) = 602
fallocate(602, 0, 0, 618139648)         = -1 ENOSPC (No space left on device)
write(2, "odp_ishm.c:707:create_file():Nor"..., 147) = 147
close(602)                              = 0
unlink("/dev/shm/0/odp-48-ishm-far_pool") = 0

The other issue, similar to this is that I sometimes meet SIGSEGV in dpdk which is called odp_pktio_start() at startup. Since the pktio handler is created by odp_pktio_open(), so I do not think this is app codes issue. I wonder if this is related with my env initialize ? do you have any env initialize example ? Currently, we only create hugepages and load pmd for DPDK.

Thanks.

chrhong avatar Sep 16 '21 07:09 chrhong

Hmm, I haven't had to trace signals before, so unfortunately I cannot help much. Usually I just isolate the data plane cores and redirect all signals to a set of control cores.

One thing which pops out in your log is No space left on device error. Perhaps you are running out space in /dev/shm. In the ODP CI Docker images we set --shm-size 8g to be on the safe side. I don't do any special environment setup for DPDK. I just map the huge pages and bind NICs as you have done.

MatiasElo avatar Sep 17 '21 10:09 MatiasElo