gocql Investigate issue with tablet_integration

tablet_integration_test.go is being flakky, ocasionally failing with the following error:

    tablet_integration_test.go:135: trace should show only one shard but it showed 0 shards, shards: [] trace:
        Tracing session 003a78b03e8111ef8cdda3e1b5278e98 (coordinator: , duration: 0s):

We need to figure out where this issue comes from driver or scylla.

Jul 10 '24 05:07 dkropachev

The initial guess was that the test is flaky because there is some kind of delay in getting trace and the fix would be to add some sleeps between executing query and reading from the trace buffer. But my tests showed that it is not the case. If the tracing is empty as in the example in the description it is not going to be not empty after any delay. This is how I changed the test to check for how much of a delay is needed:

for k := 0; k < 100; k++ {
		i = 0
		for i < 50 {
			i = i + 1

			var pk int
			var ck int
			var v int

			buf := &bytes.Buffer{}
			trace := NewTraceWriter(session, buf)

			err = session.Query(`SELECT pk, ck, v FROM test1.table1 WHERE pk = ?;`, i).WithContext(ctx).Consistency(One).Trace(trace).Scan(&pk, &ck, &v)
			if err != nil {
				t.Fatal(err)
			}

			queriedHosts := 0
			for j := 0; j < 2000; j++ {
				for _, hostAddress := range hostAddresses {
					if strings.Contains(buf.String(), hostAddress) {
						queriedHosts = queriedHosts + 1
					}
				}
				if queriedHosts != 0 {
					if j != 0 {
						fmt.Println("IT IS HERE: ", j)
					}
					break
				}

				time.Sleep(100 * time.Millisecond)
				if j == 1999 {
					fmt.Println("not enough time: ", j)
				}
			}

			if queriedHosts != 1 {
				t.Fatalf("trace should show only one host but it showed %d hosts, hosts: %s trace:\n%s", queriedHosts, hostAddresses, buf.String())
			}
		}
	}

It fails on every execution:

not enough time:  1999
--- FAIL: TestTablets (207.73s)
    tablet_integration_test.go:88: trace should show only one host but it showed 0 hosts, hosts: [192.168.100.12 192.168.100.14 192.168.100.13] trace:
        Tracing session 38c33c904d7b11efb93e0a0ec56fcac0 (coordinator: , duration: 0s):
FAIL

@dkropachev @roydahan

Jul 29 '24 07:07 sylwiaszunejko

I compared the code in gocql to code in different drivers and that's my findings:

in python-driver there is a mechanism to retry the session detail fetch, because tracing details are stored asynchronously, if the trace is still not available after max_wait seconds, TraceUnavailable will be raised - https://github.com/scylladb/python-driver/blob/8a4387ae3f36522cc10841c607abb8182c6f8286/cassandra/query.py#L981-L994
in rust-driver there is a tracing_info_fetch_attempts field that defines how many times the driver will try to fetch tracing information with sleeps in between https://github.com/scylladb/scylla-rust-driver/blob/36d5edd3a06a6808ce19dd594ea4904c8b48c0c2/scylla/src/transport/session.rs#L1411-L1422

In gocql there is no such mechanism and there is a need for it. Personally I would prefer to go with the approach similar to rust-driver. @dkropachev let me know if you would suggest otherwise

Aug 05 '24 10:08 sylwiaszunejko

I compared the code in gocql to code in different drivers and that's my findings:

in python-driver there is a mechanism to retry the session detail fetch, because tracing details are stored asynchronously, if the trace is still not available after max_wait seconds, TraceUnavailable will be raised - https://github.com/scylladb/python-driver/blob/8a4387ae3f36522cc10841c607abb8182c6f8286/cassandra/query.py#L981-L994

in rust-driver there is a tracing_info_fetch_attempts field that defines how many times the driver will try to fetch tracing information with sleeps in between https://github.com/scylladb/scylla-rust-driver/blob/36d5edd3a06a6808ce19dd594ea4904c8b48c0c2/scylla/src/transport/session.rs#L1411-L1422

In gocql there is no such mechanism and there is a need for it. Personally I would prefer to go with the approach similar to rust-driver. @dkropachev let me know if you would suggest otherwise

100% there should be retry. Another problem that needs to be fixed in the same PR is reading of tracing info should be done asynchoniusly, otherwise enabling tracing will lead to query processing slow down. Also I realized that traceWriter is actually not exactly usefull, it just write custom formatted messages into a buffer. So, probably we should abondone traceWriter and stop using it even in tests.

And instead come up with an API that will solve things easily and without running tracing subroutines. Have an API something around this:

func (t *tracer) Trace(traceId []byte)
func (t *tracer) AllTraceIDs() [][]byte
func (t *tracer) GetCoordinatorTime(traceId []byte) (coordinator, duration time.Duration, err error)
func (t *tracer) GetActivity(traceId []byte) (timestamp time.Time, activity string, source string, elapsed int, thread string, err error)

The idea of how to use it is following:

		trace := NewTraceWriter(session, buf)
		err = session.Query(`SELECT pk, ck, v FROM test1.table1 WHERE pk = ?;`, i).WithContext(ctx).Consistency(One).Trace(trace).Scan(&pk, &ck, &v)
		if err != nil {
			t.Fatal(err)
		}
		for _, traceID := range trace.AllTraceIDs() {
				coordinator, d, err := trace.GetCoordinatorTime(traceID)
				.....
				timestamp, activity, source, elapsed, thread, err := trace.GetActivity(traceID)
				.....
		}

And probably, since we are on the table of tracing we could have split between different type of traces: prepared statement, query, batch_query.

Regarding what to do next, suggestion is following:

Open a separate issue on changes we want to do in tracer and continue discussion there.
And close this one as bug, wontfix

Aug 05 '24 11:08 dkropachev

Regarding what to do next, suggestion is following:

Open a separate issue on changes we want to do in tracer and continue discussion there.

And close this one as bug, wontfix

done

Aug 05 '24 12:08 sylwiaszunejko

Investigate issue with tablet_integration_test.go