redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

"Unexpected EOF for client ID" errors in logs (multiple tests)

Open jcsp opened this issue 2 years ago • 9 comments

Bunch of tests affected by this in the last 24h, all running in clustered ducktape:

FAIL test: OpenBenchmarkTest.test_multiple_topics_omb (1/2 runs) failure at 2022-07-07T10:51:23.073Z: <BadLogLines nodes=ip-172-31-49-192(1) example="ERROR 2022-07-07 07:46:12,798 [shard 0] rpc - server.cc:119 - kafka rpc protocol - Error[applying protocol] remote address: 205.210.31.33:41517 - std::runtime_error (Unexpected EOF for client ID)"> in job https://buildkite.com/redpanda/vtools/builds/2806#0181d77e-2fe1-4594-876a-062eb75b50c0

FAIL test: ManyClientsTest.test_many_clients (1/2 runs) failure at 2022-07-06T10:53:03.664Z: <BadLogLines nodes=ip-172-31-59-231(1) example="ERROR 2022-07-06 09:23:11,802 [shard 1] rpc - server.cc:119 - kafka rpc protocol - Error[applying protocol] remote address: 205.210.31.27:46681 - std::runtime_error (Unexpected EOF for client ID)"> in job https://buildkite.com/redpanda/vtools/builds/2780#0181d253-0b91-4bf5-9c2b-2a1c91031e77

FAIL test: TopicRecreateTest.test_topic_recreation_while_producing.workload=ACKS_ALL.cleanup_policy=compact (1/58 runs) failure at 2022-07-06T10:53:03.664Z: <BadLogLines nodes=ip-172-31-56-85(1) example="ERROR 2022-07-06 10:20:22,681 [shard 0] rpc - server.cc:119 - kafka rpc protocol - Error[applying protocol] remote address: 170.130.187.22:50217 - std::runtime_error (Unexpected EOF for client ID)"> in job https://buildkite.com/redpanda/vtools/builds/2780#0181d253-0b91-4bf5-9c2b-2a1c91031e77

FAIL test: MaintenanceTest.test_maintenance_sticky.use_rpk=True (1/58 runs) failure at 2022-07-06T10:53:03.664Z: <BadLogLines nodes=ip-172-31-60-107(1) example="ERROR 2022-07-06 09:49:33,637 [shard 0] rpc - server.cc:119 - kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.138:56205 - std::runtime_error (Unexpected EOF for client ID)"> in job https://buildkite.com/redpanda/vtools/builds/2780#0181d253-0b91-4bf5-9c2b-2a1c91031e77

It's not obvious why we're seeing more of these all of a sudden -- the code that generates that exception hasn't changed recently. Previously I see only 3 instances of this in the entire last 30 days.

jcsp avatar Jul 07 '22 15:07 jcsp

@graphcareful did you end up finding something here? i recall a discussion on slack about this.

dotnwat avatar Jul 13 '22 06:07 dotnwat

I've been taking a look at the nightly build runs since @ajfabbri 's commit of vtools/785, but I haven't seen anything. My guess is, it was a rouge connection. 785 might help us track down what machine made that request.

graphcareful avatar Jul 13 '22 15:07 graphcareful

Closing as this hasn't been seen since, will assume rogue connection was the cause of the spurious error logs

graphcareful avatar Jul 26 '22 13:07 graphcareful

More incidences:

FAIL test: ClusterConfigTest.test_restart (1/2 runs) failure at 2022-07-31T11:04:20.402Z: <BadLogLines nodes=ip-172-31-62-254(1) example="ERROR 2022-07-31 06:41:46,763 [shard 0] rpc - server.cc:126 - kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.143:34111 - std::runtime_error (Unexpected EOF for client ID)"> in job https://buildkite.com/redpanda/vtools/builds/3073#018252da-dd82-4dac-bad9-ea37369ded6e FAIL test: FranzGoVerifiableWithSiTest.test_si_with_timeboxed.segment_size=20971520 (1/2 runs) failure at 2022-07-31T11:04:20.402Z: <BadLogLines nodes=ip-172-31-59-221(1) example="ERROR 2022-07-31 06:53:29,415 [shard 0] rpc - server.cc:126 - kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.13:46633 - std::runtime_error (Unexpected EOF for client ID)"> in job https://buildkite.com/redpanda/vtools/builds/3073#018252da-dd82-4dac-bad9-ea37369ded6e

I can buy that this is a symptom of a rogue client, but we still need to get to the bottom of it

jcsp avatar Aug 01 '22 12:08 jcsp

Agreed, I'm seeing IP addresses that start with 198.x.x.x in the logs

report.txt:    <BadLogLines nodes=ip-172-31-62-254(1) example="ERROR 2022-07-31 06:41:46,763 [shard 0] rpc - server.cc:126 
- kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.143:34111 - std::runtime_error (Unexpected EOF f
or client ID)">                                                                                                            

However when taking a look at the associated cluster.json in the results root, I do not see any IPs that match. Can we confirm then that this client is not a portion of our deployed infra for CDT?

➜  ducktape cat cluster.json                                 
{                                                            
    "nodes": [                                               
        {                                                    
            "externally_routable_ip": "35.86.201.197",
            "ssh_config": {                                  
                "host": "ip-172-31-59-139",
                "hostname": "172.31.59.139",                 
                "identityfile": "/home/ubuntu/.ssh/id_rsa",
                "port": 22,                                  
                "user": "root",                              
                "password": null                           
            }                                                
        },                                                   
        {                                                    
            "externally_routable_ip": "44.242.173.148",
            "ssh_config": {                                  
                "host": "ip-172-31-51-160",
                "hostname": "172.31.51.160",                 
                "identityfile": "/home/ubuntu/.ssh/id_rsa",
                "port": 22,                                  
                "user": "root",                              
                "password": null                           
            }                                                
        },                                                   
        {                                                    
            "externally_routable_ip": "35.87.222.235",
            "ssh_config": {                                  
                "host": "ip-172-31-55-1",
                "hostname": "172.31.55.1",                   
                "identityfile": "/home/ubuntu/.ssh/id_rsa",
                "port": 22,                                  
                "user": "root",                              
                "password": null                           
            }                                                
        },                                                   
....
etc

graphcareful avatar Aug 01 '22 15:08 graphcareful

Can we confirm then that this client is not a portion of our deployed infra for CDT?

No idea -- it might be interesting if you try and to spin up a clustered ducktape environment and e.g. poke around in netstat to see if you can sniff out any traffic from funny looking IPs, and also check that the nodes aren't listening to the outside world: they're not supposed to be (terraform controls security group rules for these instances) but worth checking.

jcsp avatar Aug 01 '22 15:08 jcsp

https://ci-artifacts.dev.vectorized.cloud/vtools/01827b68-ebc9-4ff8-867f-d9461479bafe/vbuild/ducktape/results/2022-08-08--001/TopicRecreateTest/test_topic_recreation_while_producing/workload=ACKS_1.cleanup_policy=delete/188/

BenPope avatar Aug 08 '22 08:08 BenPope

Slight variation here:

FAIL test: TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=delete (1/29 runs)
  failure at 2022-08-07T08:32:31.679Z: <BadLogLines nodes=ip-172-31-56-167(1) example="ERROR 2022-08-07 06:32:58,874 [shard 0] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 162.142.125.211:36738 - std::runtime_error (Invalid client_id size -27453)">

in job https://buildkite.com/redpanda/vtools/builds/3133#01827642-8678-41b8-a243-c532aa5f80ce

And same variation here:

  failure at 2022-08-08T07:13:23.597Z: <BadLogLines nodes=ip-172-31-60-231(1) example="ERROR 2022-08-08 03:41:32,002 [shard 0] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.3:60085 - std::runtime_error (Unexpected EOF for client ID)">

in job https://buildkite.com/redpanda/vtools/builds/3139#01827b68-ebc9-4ff8-867f-d9461479bafe

And here:

FAIL test: NodeOperationFuzzyTest.test_node_operations.enable_failures=True.num_to_upgrade=3 (1/2 runs)
 failure at 2022-08-08T07:13:23.597Z: <BadLogLines nodes=ip-172-31-48-220(1) example="ERROR 2022-08-08 04:26:36,936 [shard 2] rpc - server.cc:116 - kafka rpc protocol - Error[applying protocol] remote address: 205.210.31.141:52343 - std::runtime_error (Unexpected EOF for client ID)">

in job https://buildkite.com/redpanda/vtools/builds/3139#01827b68-ebc9-4ff8-867f-d9461479bafe

BenPope avatar Aug 08 '22 09:08 BenPope

https://buildkite.com/redpanda/vtools/builds/3139#01827b68-ebc9-4ff8-867f-d9461479bafe

dotnwat avatar Aug 09 '22 04:08 dotnwat

https://buildkite.com/redpanda/vtools/builds/3156#018285b5-c1c2-4484-b724-4dcd4e03d7a3

BenPope avatar Aug 11 '22 09:08 BenPope

https://buildkite.com/redpanda/vtools/builds/3156#018285b5-c1c2-4484-b724-4dcd4e03d7a3

Module: rptest.tests.partition_move_interruption_test
Class:  PartitionMoveInterruption
Method: test_cancelling_partition_move
Arguments:
{
  "recovery": "restart_recovery",
  "replication_factor": 1,
  "unclean_abort": false
}
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-58-85(1) example="ERROR 2022-08-10 05:11:29,018 [shard 3] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 162.142.125.222:54538 - std::runtime_error (Invalid client_id size -29814)">

And

Module: rptest.tests.partition_movement_test
Class:  SIPartitionMovementTest
Method: test_cross_shard
Arguments:
{
  "num_to_upgrade": 2
}
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-52-45(1) example="ERROR 2022-08-10 05:02:05,921 [shard 0] rpc - server.cc:116 - kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.142:57045 - std::runtime_error (Unexpected EOF for client ID)">

BenPope avatar Aug 11 '22 10:08 BenPope

https://buildkite.com/redpanda/vtools/builds/3183#01828adc-074b-45e5-b224-1245d61997dc

Module: rptest.scale_tests.ht_partition_movement_test
Class:  HighThroughputPartitionMovementTest
Method: test_interrupting_partition_movement_under_load
Arguments:
{
  "replication_factor": 3
}
rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-63-199(1) example="ERROR 2022-08-11 05:40:47,472 [shard 0] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 198.235.24.141:43031 - std::runtime_error (Unexpected EOF for client ID)">

BenPope avatar Aug 11 '22 12:08 BenPope

https://buildkite.com/redpanda/vtools/builds/3220#01829002-7fe5-4775-aba3-8fa06d20b3d3

Module: rptest.scale_tests.ht_partition_movement_test
Class:  HighThroughputPartitionMovementTest
Method: test_interrupting_partition_movement_under_load
Arguments:
{
  "replication_factor": 1
}

BenPope avatar Aug 12 '22 10:08 BenPope

After doing an IP address lookup on one of the client IPs:

IP Address: 162.142.125.222
Hostname: scanner-25.ch1.censys-scanner.com
Location: Michigan, USA

looks like it is a port scanning tool. I will add some additional logging in redpanda to confirm the data parsed before the failure is inconsistent with the Kafka protocol.

I think devprod (@ivotron ) should chime in here. I see the CDT installation code has all of the nodes able to be publicly accessible from the outside internet (advertise_public_ips: true), probably useful for integration with buildkite and our tooling, but it isn't necessary for all ports to be accessible, like 9092. I think we should only really expose what is necessary to start and obtain results from buildkite, probably only 22 and have the other ports only reachable from other CDT nodes.

graphcareful avatar Aug 12 '22 16:08 graphcareful

I think we should only really expose what is necessary to start and obtain results from buildkite, probably only 22 and have the other ports only reachable from other CDT nodes.

I agree. I think we've ended up in an in-between state where we're going to the trouble of running ducktape remotely in an EC2 instance in order to avoid relying on public access to all the nodes, but nothing was actually put in place to prevent the nodes being publicly accessible.

In fact, 9092 is explicitly allowed in from cluster.tf (most other ports are not)

  ingress {
    from_port   = 9092
    to_port     = 9092
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

Maybe we just remove that cidr_blocks from the 9092 ingress rule.

jcsp avatar Aug 15 '22 09:08 jcsp

Seen again here

FAIL test: HighThroughputPartitionMovementTest.test_interrupting_partition_movement_under_load.replication_factor=3 (1/1 runs) failure at 2022-08-17T07:38:31.224Z: <BadLogLines nodes=ip-172-31-63-238(1) example="ERROR 2022-08-17 06:08:28,327 [shard 0] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 104.206.128.42:62057 - std::runtime_error (Unexpected EOF for client ID, client_id_size: 11825, header: {key:12064, version:18516, correlation_id:1414541105, client_id:nullopt, number_of_tagged_fields: 0, tags_size_bytes:0})"> in job https://buildkite.com/redpanda/vtools/builds/3271#0182a9c3-a7aa-4850-a6b7-65bee8152d80

andrwng avatar Aug 17 '22 18:08 andrwng

Maybe we just remove that cidr_blocks from the 9092 ingress rule.

Will this affect our ability to run the ducktape runner from our laptops?

dotnwat avatar Aug 19 '22 17:08 dotnwat

Maybe we just remove that cidr_blocks from the 9092 ingress rule.

Will this affect our ability to run the ducktape runner from our laptops?

Yes, if this was working at all currently? I think most of the other opened ports were not exposed to the open internet already, so I'd be surprised if all of the tests were working if you ran ducktape from your laptop.

I don't love that we run ducktape from a dedicated VM, but I think really supporting running from a remote machine will require a bit of thought, possibly defaulting to mTLS for kafka connections and/or refactoring all the places tests currently expect to speak port 9092 directly to nodes under test (rather than using a producer/consumer service)

jcsp avatar Aug 20 '22 11:08 jcsp

Another instance: https://buildkite.com/redpanda/vtools/builds/3339#0182c381-f5d0-413b-a501-4d7eae01ff51

FAIL test: TestMirrorMakerService.test_consumer_group_mirroring.source_type=kafka (1/17 runs)
  failure at 2022-08-22T07:24:31.179Z: <BadLogLines nodes=ip-172-31-54-137(1) example="ERROR 2022-08-22 03:41:15,042 [shard 3] rpc - server.cc:127 - kafka rpc protocol - Error[applying protocol] remote address: 205.210.31.2:63489 - std::runtime_error (Unexpected EOF for client ID, client_id_size: 11825, header: {key:12064, version:18516, correlation_id:1414541105, client_id:nullopt, number_of_tagged_fields: 0, tags_size_bytes:0})">
      in job https://buildkite.com/redpanda/vtools/builds/3339#0182c381-f5d0-413b-a501-4d7eae01ff51

bharathv avatar Aug 22 '22 21:08 bharathv

@graphcareful this is still assigned to you: are you working on closing the ports?

jcsp avatar Aug 23 '22 08:08 jcsp

@jcsp i've actually been trying to get in touch with the author(s) of ducky.py or CDT to get more information. I'll ask around on devprod channels

graphcareful avatar Aug 23 '22 14:08 graphcareful

Two more occurrences: https://buildkite.com/redpanda/vtools/builds/3386#0182cdd0-bf0c-491d-8f68-f302f435c994

bharathv avatar Aug 25 '22 05:08 bharathv

Leaving an update here, so i've been running a few tests from my working PR here https://github.com/redpanda-data/vtools/pull/900

The last run is here: https://buildkite.com/redpanda/vtools/builds/3408#0182d532-5135-4032-9a61-4fa5640f4a1a

With the last batch of changes it looks like all tests are passing with the exceptions of the ones that use the kgo verifier.

FAIL: TimeoutError('Timed out waiting for status endpoint KgoVerifierProducer-0-140154481953040 to be available')

[DEBUG - 2022-08-25 15:55:05,324 - kgo_verifier_services - is_ready - lineno:226]: Status endpoint KgoVerifierProducer-0-140154681963376 not ready: HTTPConnectionPool(host='ip-172-31-60-73', port=8080): Max retries exceeded with url: /status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f78421b3070>: Failed to establish a new connection: [Errno 111] Connection refused'))

I'm thinking maybe kgo-verifier isn't ready because it has been blocked due to the closure of 9092 somehow?

EDIT: After seeing some of the recent modifications to the kgo-verifier class and how the dependency itself was recently updated I think I just needed to rebase my vtools branch against dev. Re-running new test now

graphcareful avatar Aug 26 '22 00:08 graphcareful

Sorry, those "Status endpoint unavailable" ones are my fault, I forgot to do this: https://github.com/redpanda-data/vtools/pull/910

jcsp avatar Aug 26 '22 13:08 jcsp

Looks like it worked:

SESSION REPORT (ALL TESTS)
  | ducktape version: 0.8.8
  | session_id:       2022-08-26--001
  | run time:         74 minutes 53.254 seconds
  | tests run:        506
  | passed:           451
  | failed:           0
  | ignored:          0
  | opassed:          55
  | ofailed:          0

graphcareful avatar Aug 26 '22 13:08 graphcareful

Manually closing this issue as resolved due to merge of: https://github.com/redpanda-data/vtools/pull/900 please re-open if issue is still observed

graphcareful avatar Aug 26 '22 13:08 graphcareful