fcl icon indicating copy to clipboard operation
fcl copied to clipboard

Build problems for several processor architectures

Open lepalom opened this issue 4 years ago • 19 comments

We are trying to incorporate 0.6.1 to Debian and we have found several build problems in some platforms. You can check it here: https://buildd.debian.org/status/package.php?p=fcl&suite=experimental

For instance,

  • In arm64: 92% tests passed, 3 tests failed out of 37 The following tests FAILED: 9 - test_fcl_capsule_capsule (Failed) 32 - test_gjk_libccd-inl_epa (Failed) 34 - test_gjk_libccd-inl_gjk_doSimplex2 (Failed)

  • In i386: 81% tests passed, 7 tests failed out of 37 The following tests FAILED: 9 - test_fcl_capsule_capsule (Failed) 21 - test_fcl_signed_distance (Failed) 30 - test_convex (Failed) 31 - test_capsule (Failed) 32 - test_gjk_libccd-inl_epa (Failed) 35 - test_sphere_box (Failed) 36 - test_sphere_cylinder (Failed)

  • In mips64: Does not make test.

  • In mipsel, Does not build because memory problems.

  • In ppc64el, 92% tests passed, 3 tests failed out of 37 The following tests FAILED: 9 - test_fcl_capsule_capsule (Failed) 32 - test_gjk_libccd-inl_epa (Failed)

  • In s390x, 95% tests passed, 2 tests failed out of 37 The following tests FAILED: 9 - test_fcl_capsule_capsule (Failed) 32 - test_gjk_libccd-inl_epa (Failed)

Please, could you take one eye in the build logs and help us to make run fcl in that platforms?

lepalom avatar Jun 01 '20 21:06 lepalom

Random observation: lots of warnings like this in the logs:

/include/fcl/geometry/bvh/BVH_model-inl.h:141:11: 
warning: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of type 
‘struct fcl::BVNode<fcl::RSS<double> >’ with no trivial copy-assignment; 
use copy-assignment or copy-initialization instead [-Wclass-memaccess]
  141 |     memcpy(bvs, other.bvs, sizeof(BVNode<BV>) * num_bvs);

Not clear whether those might be causing the test failures.

sherm1 avatar Jun 01 '20 21:06 sherm1

Two months till the feature freeze for the next Debian version, would be nice to get 0.6 in!

simonschmeisser avatar Nov 08 '20 21:11 simonschmeisser

@jamiesnape @SeanCurtis-TRI any thoughts on diagnosing these problems?

sherm1 avatar Nov 10 '20 17:11 sherm1

Sorry for the long delay in responding to this. You can consider that delay a reflection of the current level of maintainer's resource availability. In principle, we like the idea of broad platform support, in practice that can be problematic. As such, we propose the following, and invite further insights and arguments from the community.

Currently supported

  • amd64 (currently under test in CI)

Probably will explicitly support

Probably won't explicitly support -- available on Travis-CI but not worth the CI "cost"

  • ppc64el
  • s390x

Almost definitely will not provide formal support

  • i386
  • mips64
  • mipsel

For arm64, we're willing to extend the CI to cover that architecture (and deal with whatever failures arise). We're loathe to claim support for any platform that we don't have legitimate CI support for. We're unsure about the utility of the power pc platforms (even though TravisCI does support them) and are reluctant to force every PR to run through them without some good reasons for explicit support.

This is our first pass on "resolving" this issue and, again, we're open to other views. But the message that I really hope to make clear is that we're going to have to rely on the community's help (which has been great as recent build PRs attest) if we want to extend support across more platforms.

SeanCurtis-TRI avatar Nov 10 '20 19:11 SeanCurtis-TRI

There was historically some utility in running i386 builds to flush out a few obscure bugs that assume 64-bits, but now ARM is mostly at 64-bits, maybe the assumption is fine regardless. Looks (unsurprisingly) like a lot of tolerance issues, but anything mentioning memcpy raises red flags for me with multi-architecture support.

jamiesnape avatar Nov 11 '20 14:11 jamiesnape

FYI. @roehling did a good job patching fcl to be build in several platforms. Check:

https://buildd.debian.org/status/package.php?p=fcl&suite=experimental

lepalom avatar Dec 14 '20 15:12 lepalom

So these three patches? https://sources.debian.org/src/fcl/0.6.1-3/debian/patches/ (the last one being the most relevant)

jamiesnape avatar Dec 17 '20 18:12 jamiesnape

Yes. If you want, please use them.

lepalom avatar Dec 17 '20 18:12 lepalom

They all look good to me. I can PR them if @SeanCurtis-TRI and/or @sherm1 like the idea.

jamiesnape avatar Dec 17 '20 18:12 jamiesnape

Seems like a definite improvement. I'm in favor of it.

SeanCurtis-TRI avatar Dec 17 '20 18:12 SeanCurtis-TRI

Seems like it should be easy enough, so I will put up a PR shortly.

jamiesnape avatar Dec 17 '20 18:12 jamiesnape

https://github.com/flexible-collision-library/fcl/pull/517

jamiesnape avatar Dec 17 '20 18:12 jamiesnape

I'm glad you find my work useful. Still, I want to point out that those patches do not really eliminate the test failures. I have to confess I "fixed" the failing builds by ignoring the failing test results on non-amd64 architectures…

I believe that the test failures are largely due to differences in hardware floating point formats which lead to unexpected loss of precision. You can eliminate all test failures on i386 with -mfpmath=sse -msse2, so the infamous x87 FPU does not mess up intermediate results with gratuitous conversions between double and extended precision. The s390x architecture has a non-IEEE floating point register format as well.

The most interesting test failures are those of arm64, both because of your stated intention to support this architecture, and because it seems to be the common denominator in all test failures, so if you fix the problem there, there is a good chance that the improved numerical stability will also fix the other problematic architectures:

The following tests FAILED:
	  9 - test_fcl_capsule_capsule (Failed)
	 32 - test_gjk_libccd-inl_epa (Failed)
	 34 - test_gjk_libccd-inl_gjk_doSimplex2 (Failed)

For reference, I'll post the outputs of the failing tests here. Maybe someone with a better grasp of the codebase has a good idea on how to tackle those:

[ RUN      ] CapsuleCapsuleSegmentTest/0.OverlappingCenterLines
./test/test_fcl_capsule_capsule.cpp:755: Failure
Value of: this->RunTestConfiguration( CapsuleCapsuleSegmentTest<TypeParam>::kMultipleOverlap)
  Actual: false (Values at (0, 0) exceed tolerance: 11.862063780292655 vs. 12.178490743795633, diff = 0.31642696350297861, tolerance = 2.0097183471152322e-14
m1 =
 11.862063780292655
 13.293042338207796
-2.8160494855694154
m2 =
 12.178490743795633
 13.201144346700222
-1.6102598123986924
delta=
-0.31642696350297861
0.091897991507574162
 -1.2057896731707229)
Expected: true
[  FAILED  ] CapsuleCapsuleSegmentTest/0.OverlappingCenterLines, where TypeParam = double (0 ms)
[ RUN      ] FCL_GJK_EPA.ComputeVisiblePatchColinearNewVertex
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_epa.cpp:731: Failure
Value of: std::unordered_set<ccd_pt_edge_t*>( {&(tet.e(1)), &(tet.e(4)), &(tet.e(5))})
  Actual: { 0xaaaaedb29090, 0xaaaaedb29110, 0xaaaaedb28090 }
Expected: border_edges
Which is: { 0xaaaaedb28110, 0xaaaaedb29110, 0xaaaaedb27320 }
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_epa.cpp:733: Failure
Value of: 3u
  Actual: 3
Expected: visible_faces.size()
Which is: 1
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_epa.cpp:734: Failure
Value of: std::unordered_set<ccd_pt_face_t*>( {&(tet.f(0)), &(tet.f(1)), &(tet.f(3))})
  Actual: { 0xaaaaedb27c10, 0xaaaaedb27560, 0xaaaaedb27500 }
Expected: visible_faces
Which is: { 0xaaaaedb27c10 }
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_epa.cpp:736: Failure
Value of: 3u
  Actual: 3
Expected: internal_edges.size()
Which is: 0
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_epa.cpp:737: Failure
Value of: std::unordered_set<ccd_pt_edge_t*>( {&(tet.e(0)), &(tet.e(2)), &(tet.e(3))})
  Actual: { 0xaaaaedb27320, 0xaaaaedb28110, 0xaaaaedb27c70 }
Expected: internal_edges
Which is: {}
[  FAILED  ] FCL_GJK_EPA.ComputeVisiblePatchColinearNewVertex (1 ms)
[ RUN      ] DoSimplex2Test.OriginInSimplex
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_gjk_doSimplex2.cpp:72: Failure
Value of: 0
Expected: norm_OB_.dot(phat_OB_)
Which is: 7.5711e-18
[  FAILED  ] DoSimplex2Test.OriginInSimplex (0 ms)
[ RUN      ] DoSimplex2Test.NeedMoreComputing
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_gjk_doSimplex2.cpp:72: Failure
Value of: 0
Expected: norm_OB_.dot(phat_OB_)
Which is: 7.5711e-18
[  FAILED  ] DoSimplex2Test.NeedMoreComputing (0 ms)
[ RUN      ] DoSimplex2Test.Region1Boundary1
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_gjk_doSimplex2.cpp:72: Failure
Value of: 0
Expected: norm_OB_.dot(phat_OB_)
Which is: 7.5711e-18
[  FAILED  ] DoSimplex2Test.Region1Boundary1 (5 ms)
[ RUN      ] DoSimplex2Test.Region1Boundary2
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_gjk_doSimplex2.cpp:72: Failure
Value of: 0
Expected: norm_OB_.dot(phat_OB_)
Which is: 7.5711e-18
[  FAILED  ] DoSimplex2Test.Region1Boundary2 (5 ms)
[ RUN      ] DoSimplex2Test.Region1Boundary3
./test/narrowphase/detail/convexity_based_algorithm/test_gjk_libccd-inl_gjk_doSimplex2.cpp:72: Failure
Value of: 0
Expected: norm_OB_.dot(phat_OB_)
Which is: 7.5711e-18
[  FAILED  ] DoSimplex2Test.Region1Boundary3 (4 ms)

roehling avatar Dec 17 '20 19:12 roehling

@roehling Thanks for the elaboration. While it doesn't get us to the targeted support, at the very least, it's a monotonic improvement of the code. :) So, I'll take the little victory today even if we still haven't won the battle.

SeanCurtis-TRI avatar Dec 17 '20 21:12 SeanCurtis-TRI

We can automatically pass -mfpmath=sse -msse2 on i386 only, that is pretty trivial, and won't affect anyone else, but it won't be exercised by CI here, so if anything else i386 breaks it is not going to get caught.

jamiesnape avatar Dec 18 '20 13:12 jamiesnape

Regarding my issues trying to compiling for MSYS2/Mingw64 with gcc 10.2 (https://github.com/flexible-collision-library/fcl/issues/516) , this patch solved many problems . But some dll imports/exports definition problems are remaining

sancelot avatar Dec 19 '20 18:12 sancelot

The test result on my Apple M1 box:

95% tests passed, 2 tests failed out of 41 The following tests FAILED: 21 - test_fcl_signed_distance (Failed) 38 - test_sphere_box (Failed)

We can make the tests pass with a slight increase of tolerances.

mahiuchun avatar Feb 12 '21 04:02 mahiuchun

@mahiuchun We would welcome a PR with the exact tolerance changes.

jamiesnape avatar Feb 12 '21 14:02 jamiesnape

#522

mahiuchun avatar Feb 13 '21 02:02 mahiuchun