iceoryx icon indicating copy to clipboard operation
iceoryx copied to clipboard

icehello example may happen bug when system time rollback

Open 1su1 opened this issue 2 years ago • 11 comments

Required information

Operating system: E.g. Ubuntu 20.04.3 LTS

Compiler version: E.g. gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Observed result or behaviour:

iox-cpp-publisher-helloworld sent value: 8
iox-cpp-publisher-helloworld sent value: 9
iox-cpp-publisher-helloworld sent value: 10
iox-cpp-publisher-helloworld sent value: 11
iox-cpp-publisher-helloworld sent value: 12
iox-cpp-publisher-helloworld sent value: 13
/home/sxy/ros2_galactic/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/posix_wrapper/mutex.cpp:92 { bool iox::posix::mutex::lock() }  :::  [ 22 ]  Success
2022-03-16 15:58:22.422 [ Error ]: ICEORYX error! POPO__CHUNK_LOCKING_ERROR
iox-cpp-publisher-helloworld: /home/sxy/ros2_galactic/src/eclipse-iceoryx/iceoryx/iceoryx_utils/source/error_handling/error_handling.cpp:56: static void iox::ErrorHandler::ReactOnErrorLevel(iox::ErrorLevel, const char*): Assertion `false' failed.
Aborted

image

Expected result or behaviour: When system time rollback happened, all as usual.

iox-cpp-publisher-helloworld sent value: 4
iox-cpp-publisher-helloworld sent value: 5
iox-cpp-publisher-helloworld sent value: 6
iox-cpp-publisher-helloworld sent value: 7
iox-cpp-publisher-helloworld sent value: 8
iox-cpp-publisher-helloworld sent value: 9
iox-cpp-publisher-helloworld sent value: 10
iox-cpp-publisher-helloworld sent value: 11
iox-cpp-publisher-helloworld sent value: 12
iox-cpp-publisher-helloworld sent value: 13
iox-cpp-publisher-helloworld sent value: 14
iox-cpp-publisher-helloworld sent value: 15
iox-cpp-publisher-helloworld sent value: 16
iox-cpp-publisher-helloworld sent value: 17
^C2022-03-16 18:28:50.752 [Verbose]: RouDi cleaned up resources of iox-cpp-publisher-helloworld. Shutting down gracefully.

Conditions where it occurred / Performed steps:

1. iox-toudi
2. ./iox-cpp-subscriber-helloworld
3. iox-cpp-publisher-helloworld
4. timedatectl set-ntp false 
5. timedatectl set-time 15:58:30 (and maybe need "timedatectl set-time 15:58:20"), to make system time rollback

1su1 avatar Mar 16 '22 05:03 1su1

We used iceoryx in TDA4, but in TDA4 sometimes happend system time rollback in few days, and then ice will throw exception. So I try it in x86 Linux, and it also happened, need help.

1su1 avatar Mar 16 '22 06:03 1su1

@1su1

I think we never considered this use case and I think this is something iceoryx does not support. The reason is that we have cyclic task which run with a defined period. Here we use functions which are sleep_until like to solve the problem of jittering or moving periodicity.

Stuff like this can happen:

  1. Time is 11:00 pm, RouDi waits for the heartbeat which has to come every 500ms.
  2. Now we have a time rollback and it is 10:00 pm, our central broker RouDi assumes now that it waits for over an hour for the heartbeat and will kill the application since it seems like it is not responding anymore.

We use internal timers which wait for requests (for instance when you create/destroy ports like the subscriber/publisher) It may happen when you turn back the time while roudi is waiting that the waiting time increases by several ours and from a user perspective it looks like a deadlock.

Iceoryx does not support a change of the time while it is running. When it is working then it is just by accident.

I will investigate this further and keep this as a bug open. But please do not count on a quick fix in the next days!

For the time being I sadly can only recommend to not change the system time while iceoryx is running.

elfenpiff avatar Mar 16 '22 11:03 elfenpiff

Thanks for the analysis, that's enough. I haven't delved into iceoryx at the moment, I just want to know why it crashes in this case now. and then I'll try to look into iceoryx.

1su1 avatar Mar 16 '22 13:03 1su1

I fixed this bug by modified "iceoryx_posh_types.hpp : 254" from using BaseClock_t = std::chrono::steady_clock; to using BaseClock_t = std::chrono::system_clock;

Will it affect other modules?

1su1 avatar Aug 05 '22 07:08 1su1

@1su I don't think this is a viable solution. The system clock has the issue that it can be changed by the user while the steady clock should only monotonically increase

elBoberido avatar Aug 08 '22 17:08 elBoberido

But using this approach does solve the problem

1su1 avatar Aug 11 '22 11:08 1su1

@1su1 it may solve your specific problem but creates another one. The steady clock is not supposed to change and therefore that one is used. I'm not even sure why this problem happens on your side since you wrote that there is a system time rollback. Is this correct or is there be steady clock reset?

A proper fix for this would be to refactor the liveliness detection to not rely on any timestamps (#1361) since this also makes debugging harder. A quick fix for you would be to disable the monitoring and start RouDi with -m off. Is this a viable workaround until we have a proper fix?

elBoberido avatar Aug 11 '22 11:08 elBoberido

@1su1 @elBoberido The system time rollback will cause m_keepAliveTask to work abnormally, because the underlying part of m_keepAliveTask uses sem_timedwait to delay sending keeplive packets.I have now changed to using std::this_thread::sleep_for to delay sending the keeplive package while adapting to thread exit logic. The test results are normal.

yanxiaochun1005 avatar Mar 14 '23 09:03 yanxiaochun1005

inline void PeriodicTask<T>::run() noexcept { do { IOX_DISCARD_RESULT(m_callable()); std::this_thread::sleep_for(std::chrono::milliseconds(m_interval.toMilliseconds())); } while (!m_stop_flag) } m_stop_flag is adapting to thread exit logic

yanxiaochun1005 avatar Mar 14 '23 09:03 yanxiaochun1005

@elfenpiff I've met the same issue, when time rollback on version 2.0.3. I guess this issue may be related to the different clocks used by roudi & helloworld apps. Roudi use steady_clock when checking keepAlive signal, while helloworld apps use PeriodicTask to send keepAlive signal which use TimeSpecReference::Epoch. I suggest that PeriodicTask/Semaphore shall use TimeSpecReference::Monotonic instead.

wjswjq avatar Jul 13 '23 08:07 wjswjq

@wjswjq This is sadly not so easily possible since the posix semaphore does not support monotonic clocks in sem_timedwait(), see: https://pubs.opengroup.org/onlinepubs/9699919799/functions/sem_timedwait.html

I think we have to use here sem_clockwait (gnu libc extension) and sem_timedwait_monotonic (qnx) but this will not play well with all the other platforms :/ ... I will try to think of something.

elfenpiff avatar Jul 13 '23 09:07 elfenpiff