XRT
XRT copied to clipboard
CR-1128916 Download xclbin failed for pre-emptible kernel
Problem solved by the commit
While downling xclbin in pre-emptible kernel copy_from_user fails to copy buffers with size greater than 4K bytes, fixed it by copying 4K chunks at a time when preemption is enabled
Bug / issue (if any) fixed, which PR introduced the bug, how it was discovered
Fixed copying xclbin sections from user pointer failures
How problem was solved, alternative solutions (if any) and why they were rejected
copying 4K chunk buffers at a time solved this issue
Risks (if any) associated the changes in the commit
none as the change is added under if preemption condition
What has been tested and how, request additional testing if necessary
tested yolov4 application with preemption enabled and the test passes
Documentation impact (if any)
NA
Build Failed! :(
This is just a workaround till we find the real root cause, right?
retest this please
Hi @rbramand-xilinx , Can you explain why copy_from_user() doesn't work with CONFIG_PREEMPT when size is large than 4K? I didn't get any helpful information from internet and Linux source code. Please share your knowledge.
Build Passed!
Hi @mamin506 , @maxzhen, @houlz0507 copy_from_user inside zocl_read_sect function fails for all sections that have size greater than 4K(PAGE_SIZE) bytes when Preemption is enabled (AIE_RESOURCES and AIE_METADATA in this case). Also I came across some google pages where they are talking about copy_from_user and 4k (page size) buffers eg: https://www.spinics.net/lists/newbies/msg00058.html . So I made some experiments like copying 4k chunks at a time and copy_from_user passes so I made this change. There are not many articles about copy_from_user behavior when preemption is enabled. Also user application uses threads and reading xclbin sections fails only when xclbin is programmed from application not from command line. Please let me know your thoughts on this.
Hi @mamin506 , @maxzhen, @houlz0507 copy_from_user inside zocl_read_sect function fails for all sections that have size greater than 4K(PAGE_SIZE) bytes when Preemption is enabled (AIE_RESOURCES and AIE_METADATA in this case). Also I came across some google pages where they are talking about copy_from_user and 4k (page size) buffers eg: https://www.spinics.net/lists/newbies/msg00058.html . So I made some experiments like copying 4k chunks at a time and copy_from_user passes so I made this change. There are not many articles about copy_from_user behavior when preemption is enabled. Also user application uses threads and reading xclbin sections fails only when xclbin is programmed from application not from command line. Please let me know your thoughts on this.
Can you please elaborate "copy_from_user inside zocl_read_sect function fails for all sections that have size greater than 4K(PAGE_SIZE) bytes"? What failure are we talking about? I understand that if the user buffer crosses 4K boundary, copy_from_user may sleep. But it does not necessarily means we can not pass user buffer that cross page boundary. Just make sure we are not under atomic context.
Hi @mamin506 , @maxzhen, @houlz0507 copy_from_user inside zocl_read_sect function fails for all sections that have size greater than 4K(PAGE_SIZE) bytes when Preemption is enabled (AIE_RESOURCES and AIE_METADATA in this case). Also I came across some google pages where they are talking about copy_from_user and 4k (page size) buffers eg: https://www.spinics.net/lists/newbies/msg00058.html . So I made some experiments like copying 4k chunks at a time and copy_from_user passes so I made this change. There are not many articles about copy_from_user behavior when preemption is enabled. Also user application uses threads and reading xclbin sections fails only when xclbin is programmed from application not from command line. Please let me know your thoughts on this.
Can you please elaborate "copy_from_user inside zocl_read_sect function fails for all sections that have size greater than 4K(PAGE_SIZE) bytes"? What failure are we talking about? I understand that if the user buffer crosses 4K boundary, copy_from_user may sleep. But it does not necessarily means we can not pass user buffer that cross page boundary. Just make sure we are not under atomic context.
copy_from_user is returning number of bytes it is failing to copy and this happens only when preemption is enabled. It is failing only for those sections whose size is greater than 4KB, so I have added to copy 4K chunks at a time. Also we are not under atomic context while doing the copying part.
Hi @mamin506 , @maxzhen, @houlz0507 copy_from_user inside zocl_read_sect function fails for all sections that have size greater than 4K(PAGE_SIZE) bytes when Preemption is enabled (AIE_RESOURCES and AIE_METADATA in this case). Also I came across some google pages where they are talking about copy_from_user and 4k (page size) buffers eg: https://www.spinics.net/lists/newbies/msg00058.html . So I made some experiments like copying 4k chunks at a time and copy_from_user passes so I made this change. There are not many articles about copy_from_user behavior when preemption is enabled. Also user application uses threads and reading xclbin sections fails only when xclbin is programmed from application not from command line. Please let me know your thoughts on this.
Can you please elaborate "copy_from_user inside zocl_read_sect function fails for all sections that have size greater than 4K(PAGE_SIZE) bytes"? What failure are we talking about? I understand that if the user buffer crosses 4K boundary, copy_from_user may sleep. But it does not necessarily means we can not pass user buffer that cross page boundary. Just make sure we are not under atomic context.
copy_from_user is returning number of bytes it is failing to copy and this happens only when preemption is enabled. It is failing only for those sections whose size is greater than 4KB, so I have added to copy 4K chunks at a time. Also we are not under atomic context while doing the copying part.
We'd better find out why it fails before we write some code to workaround it.
sure @larry9523 I will debug further on why it fails and get back.
Marking this PR as do-not-merge until it is root caused.
Please raise a PR after figuring out reason behind this issue. Closing this PR