XRT icon indicating copy to clipboard operation
XRT copied to clipboard

Workaround dead-lock in hardware-emulation shutdown

Open keryell opened this issue 3 years ago • 22 comments

Problem solved by the commit

Workaround dead-lock in hardware-emulation shutdown

Bug / issue (if any) fixed, which PR introduced the bug, how it was discovered

Just running some simple SYCL code on xsjsycl41 server.

How problem was solved, alternative solutions (if any) and why they were rejected

This is solved by an outrageous hack.

The user needs to define XRT_PCIE_HW_EMU_FORCE_SHUTDOWN environment variable before running an XRT program.

Risks (if any) associated the changes in the commit

This is probably not to be merged but to be used as a branch by other people hitting this problem while there is a real fix.

What has been tested and how, request additional testing if necessary

This works for us™. :-)

Documentation impact (if any)

keryell avatar Feb 12 '22 03:02 keryell

Can one of the admins verify this patch?

gbuildx avatar Feb 12 '22 03:02 gbuildx

ok to test

sonals avatar Feb 14 '22 05:02 sonals

Build Failed! :(

gbuildx avatar Feb 14 '22 05:02 gbuildx

retest this please

kbhardwa-xilinx avatar Feb 14 '22 19:02 kbhardwa-xilinx

Build Failed! :(

gbuildx avatar Feb 14 '22 21:02 gbuildx

retest this please - infra errors

stsoe avatar Feb 16 '22 20:02 stsoe

Build Passed!

gbuildx avatar Feb 17 '22 02:02 gbuildx

Build Passed!

gbuildx avatar Feb 22 '22 20:02 gbuildx

Rebased on master just before https://github.com/Xilinx/XRT/pull/6297 to avoid a test not working with PS kernel.

keryell avatar Feb 23 '22 00:02 keryell

Build Failed! :(

gbuildx avatar Feb 23 '22 03:02 gbuildx

retest this please.

dayeh-xilinx avatar Feb 23 '22 17:02 dayeh-xilinx

Build Passed!

gbuildx avatar Feb 24 '22 01:02 gbuildx

While I understand the desire here, I think this is a hack around an underlying problem. @akasat @venkatp-xilinx shouldn't you be addressing the hang itself?

Yes Soren, we have a CR for this, and we will handle it as part of the CR fix.

venkatp-xilinx avatar Feb 24 '22 11:02 venkatp-xilinx

Added hemantk-xilinx as reviewer. Basic issue is in the hw_emu model which is being worked upon the hw emulation team. The solution seems very hacky, we do not want to merge the change.

venkatp-xilinx avatar Feb 25 '22 10:02 venkatp-xilinx

The solution seems very hacky, we do not want to merge the change.

I totally agree, as suggested in the PR description. ;-)

keryell avatar Feb 25 '22 19:02 keryell

@keryell , Can we close this PR as this is open from long and no updates.

chvamshi-xilinx avatar Apr 08 '22 03:04 chvamshi-xilinx

Is there any alternative? Is this problem solved otherwise?

keryell avatar Apr 08 '22 05:04 keryell

Is there any alternative? Is this problem solved otherwise?

I am not sure whether this is solved or not. @venkatp-xilinx , Can you please take a look and confirm. @keryell , Can you please file a CR and inform @venkatp-xilinx .

chvamshi-xilinx avatar Apr 08 '22 05:04 chvamshi-xilinx

Is there any alternative? Is this problem solved otherwise?

This is resolved now, hw_emu model corrected the double free memory corruption issue and with that we would not see this issue any more. As a general fix in driver, we are trying various ways to identity the sock aliveness (when device process abruptly crashes) and do a clear exit from the driver. So we actively working on it and will provide proper solution in master branch shortly.

venkatp-xilinx avatar Apr 08 '22 14:04 venkatp-xilinx

Build Failed! :(

gbuildx avatar May 06 '22 18:05 gbuildx

@keryell Per @venkatp-xilinx comment this PR is no longer needed?

stsoe avatar May 06 '22 20:05 stsoe

Build Passed!

gbuildx avatar May 06 '22 22:05 gbuildx

I believe this has been addressed to some degree :-)

stsoe avatar Aug 25 '22 17:08 stsoe

I believe this has been addressed to some degree :-)

Our CI might be wrong then. :-( I have not seen the rewrite of all the XRT-emulation layer either... :-(

keryell avatar Aug 26 '22 09:08 keryell