uftrace icon indicating copy to clipboard operation
uftrace copied to clipboard

Random failure in ./runtest 222

Open aPatchyDev opened this issue 1 year ago • 2 comments

Reproduced random failure for ./runtest -vdik 222 on x64 WSL2 Ubuntu 20.04 The info below is from one instance of failure where the external data comment is missing

uftrace replay -f time output

#     TIMESTAMP       FUNCTION
   426883.003490100 | main() {
   426883.003490500 |   a() {
   426883.003490600 |     b() {
   426883.003490700 |       c() {
   426883.003491400 |         getpid();
   426883.003493200 |       } /* c */
   426883.003493500 |     } /* b */
   426883.003493600 |   } /* a */
   426883.003493700 | } /* main */

Expected output

#     TIMESTAMP       FUNCTION
   426883.003490100 | main() {
   426883.003490101 | /* external-data: user message */
   426883.003490500 |   a() {
   426883.003490600 |     b() {
   426883.003490700 |       c() {
   426883.003491400 |         getpid();
   426883.003493200 |       } /* c */
   426883.003493500 |     } /* b */
   426883.003493600 |   } /* a */
   426883.003493700 | } /* main */

extern.dat content: 426883.3490101 user message

  • Editing the file to add 2 leading zeros after the decimal does not solve the problem

Above failure might be due to parsing error when timestamp has leading zero (unverified claim)

Originally posted by @JHH20 in https://github.com/namhyung/uftrace/issues/1063#issuecomment-1206474208

So far, this test seems to fail in 2 different ways:

  • The external-data message is missing entirely
  • The message is misplaced between 2 functions rather than directly below main()

I originally thought it might be related to buffering differences between tty and non-tty, but passing a pseudo-tty in ./runtest.py via pty.openpty did not solve the issue.

aPatchyDev avatar Aug 05 '22 13:08 aPatchyDev

Adding output when external-data message is misplaced between 2 functions.

Test ran as ./runtest.py -dk 222 and the following was the 4th test

#     TIMESTAMP       FUNCTION
    35775.262034200 | __monstartup();
    35775.262035200 | __cxa_atexit();
    35775.262035800 | main() {
    35775.262035800 |   a() {
    35775.262035801 | /* external-data: user message */
    35775.262035900 |     b() {
    35775.262035900 |       c() {
    35775.262036000 |         getpid();
    35775.262036400 |       } /* c */
    35775.262036500 |     } /* b */
    35775.262036600 |   } /* a */
    35775.262036600 | } /* main */

Note that main() and a() have the exact same timestamp, so the current method of adding 1 from main() to create the external-data timestamp does not guarantee the intended ordering.

If runtest.py always sort the lines using stable sorting algorithm, then it would be better to make external-data timestamp identical to main() and insert it in the correct location.

aPatchyDev avatar Aug 06 '22 12:08 aPatchyDev

It looks like a clock precision issue on WSL and nothing much we can do about it. Please fix the problem with the leading 0s.

namhyung avatar Aug 07 '22 20:08 namhyung

The following unit test shows the problem.

diff --git a/utils/utils.c b/utils/utils.c
index aa0ef2f4..66267350 100644
--- a/utils/utils.c
+++ b/utils/utils.c
@@ -1065,4 +1065,21 @@ TEST_CASE(utils_strv)

        return TEST_OK;
 }
+
+TEST_CASE(parse_timestamp)
+{
+       uint64_t ts;
+
+       ts = parse_timestamp("8960070.725168293");
+       TEST_EQ(ts, 8960070725168293);
+
+       ts = parse_timestamp("8959832.082682731");
+       TEST_EQ(ts, 8959832082682731);    // fails here
+
+       ts = parse_timestamp("426883.003490100");
+       TEST_EQ(ts, 426883003490100);     // fails here
+
+       return TEST_OK;
+}
+
 #endif /* UNIT_TEST */

It fails as follows.

$ make TESTARG="-v parse_timestamp" unittest
  TEST     test_unit
Running 1 test cases
======================
Testing parse_timestamp...
test failed at /home/honggyu/work/uftrace/utils/utils.c:1077: ts == 8959832082682731
  ts               = 8959832000000000
  value_2          = 8959832082682731
[001] parse_timestamp               : FAIL

honggyukim avatar Aug 15 '22 07:08 honggyukim

This means parse_timestamp("8959832.082682731") returns incorrect result. It looks there is an error when handling leading 0 in the timestamp.

honggyukim avatar Aug 15 '22 07:08 honggyukim

125 report_range also has the same issue as @jonghyeon-an reported.

$ uftrace replay -f +time
# DURATION     TID        TIMESTAMP       FUNCTION
   1.534 us [ 50740]  8959832.082671765 | __monstartup();
   0.906 us [ 50740]  8959832.082677104 | __cxa_atexit();
            [ 50740]  8959832.082678806 | main() {
            [ 50740]  8959832.082679093 |   a() {
            [ 50740]  8959832.082679299 |     b() {
            [ 50740]  8959832.082679493 |       c() {
   2.119 us [ 50740]  8959832.082679756 |         getpid();
   3.238 us [ 50740]  8959832.082682731 |       } /* c */
   3.877 us [ 50740]  8959832.082683176 |     } /* b */
   4.332 us [ 50740]  8959832.082683425 |   } /* a */
   4.830 us [ 50740]  8959832.082683636 | } /* main */

$ uftrace report -F main -r ~8959832.082682731
  Total time   Self time       Calls  Function
  ==========  ==========  ==========  ====================

honggyukim avatar Aug 15 '22 07:08 honggyukim

We have to fix the logic in parse_timestamp. https://github.com/namhyung/uftrace/blob/4d6fe0d68e47aec5b70d91cf29732e1027817993/utils/utils.c#L529-L564

honggyukim avatar Aug 18 '22 12:08 honggyukim

parse_timestamp

When I checked the code, it seems that the base should be changed from 0 to 10 in the strtoull function. After the change, the test case you shared was passed.

diff --git a/utils/utils.c b/utils/utils.c
index aa0ef2f4..e7f6cb2a 100644
--- a/utils/utils.c
+++ b/utils/utils.c
@@ -532,12 +532,12 @@ uint64_t parse_timestamp(char *arg)
        uint64_t ts, tmp;
        int len;
 
        tmp = strtoull(arg, &sep, 0);
        ts = tmp * NSEC_PER_SEC;
 
        if (*sep == '.') {
                arg = sep + 1;
-               tmp = strtoull(arg, &sep, 0);
+               tmp = strtoull(arg, &sep, 10);
 
                len = 0;
                while (isdigit(*arg)) {

After changes

$ make TESTARG="-v parse_timestamp" unittest
  TEST     test_unit
Running 1 test cases
======================
Testing parse_timestamp...
[001] parse_timestamp               : PASS
-------------

unit test stats
====================
  1 ran successfully
  0 failed
  0 skipped
  0 signal caught
  0 unknown result

It also passed the 222 and 125 tests several times.

Jonghyun-An avatar Aug 19 '22 03:08 Jonghyun-An

After a few more tests, NG occurs when there are 8 decimal places like 1544510.78979574 in test222. The comparison syntax seems to indicate that 1544510078979574 is correct, not 1544510789795740. When removing the code as below, 8 decimal places pass through the case.

diff --git a/utils/utils.c b/utils/utils.c
index aa0ef2f4..ede7db40 100644
--- a/utils/utils.c
+++ b/utils/utils.c
@@ -537,7 +537,7 @@ uint64_t parse_timestamp(char *arg)
 
        if (*sep == '.') {
                arg = sep + 1;
-               tmp = strtoull(arg, &sep, 0);
+               tmp = strtoull(arg, &sep, 10);
 
                len = 0;
                while (isdigit(*arg)) {
@@ -546,10 +546,10 @@ uint64_t parse_timestamp(char *arg)
                }
 
                /* if resolution is lower than nsec */
-               while (len < 9) {
-                       tmp *= 10;
-                       len++;
-               }
+               // while (len < 9) {
+               //      tmp *= 10;
+               //      len++;
+               // }
 
                /* if resolution is higher than nsec */
                while (len > 9) {

Jonghyun-An avatar Aug 19 '22 05:08 Jonghyun-An

Thanks for the analysis. Please fix it with a test case!

namhyung avatar Aug 19 '22 06:08 namhyung

I suggest also setting the strtoull base to 10 @ L535, which parses the number before the decimal point, for consistency. Octal in a timestamp doesn't make much sense anyways.

https://github.com/namhyung/uftrace/blob/9b05004b3ede81579b3aaa05008a425e8e287fa0/utils/utils.c#L535-L540

aPatchyDev avatar Aug 19 '22 15:08 aPatchyDev

I thought the implementor of the parse_timestamp function coded the argument value of line 535 as 0 to cover multiple bases. Is it right to fix?

Also, in the parse_time function, the base value of strtoull was coded as 0, so it would be nice to discuss whether it is necessary to fix it together.

https://github.com/namhyung/uftrace/blob/9b05004b3ede81579b3aaa05008a425e8e287fa0/utils/utils.c#L473-L484

Jonghyun-An avatar Aug 19 '22 15:08 Jonghyun-An