openwec icon indicating copy to clipboard operation
openwec copied to clipboard

No retry on unwritten events to TCP-output

Open semyonkozlov opened this issue 3 months ago • 4 comments

Setup

1 running openwec daemon that receiving events from Windows machine. Config:

[[collectors]]
listen_address = "0.0.0.0"
listen_port = 5985

[collectors.authentication]
type = "Tls"
ca_certificate = "/mnt/persist/openwec-certs/ca-cert.pem"
server_certificate = "/mnt/persist/openwec-certs/server-cert.pem"
server_private_key = "/mnt/persist/openwec-certs/server-key.pem"

[logging]
verbosity = "info"
server_logs = "stdout"
access_logs = "stdout"

[database]
type = "SQLite"
path = "/var/lib/openwec/db/openwec.sqlite"

[monitoring]
listen_address = "0.0.0.0"
listen_port = 9090

1 active subscription:

Subscription test-tcp-json
        UUID: b199461f-1394-4e90-b03a-8b14d7e2d93d
        Internal version: 06bc69ad-ae36-480f-a75c-7f30e4d8524a
        Public version: d52e481e-723c-bfe8-d52e-481e723cbfe8
        Revision: Not configured
        URI: Not configured
        Heartbeat interval: 3600s
        Connection retry count: 5
        Connection retry interval: 60s
        Max time without heartbeat/events: 5s
        Max events in a batch: Not configured
        Max envelope size: 512000 bytes
        Read existing events: false
        Content format: Raw
        Ignore channel error: true
        Locale: Not configured
        Data Locale: Not configured
        Principal filter: Not configured
        Outputs:
        - 0: Enabled: true, Format: json, Driver: Tcp(TcpConfiguration { host: "localhost", port: 5000, tls_enabled: false, tls_certificate_authorities: [], tls_certificate: None, tls_key: None })
        Enabled: true
        Event filter query:

<QueryList><Query Id="0" Path="System"><Select Path="System">*</Select></Query></QueryList>

Receiving events on localhost with socat:

socat TCP-LISTEN:5000,reuseaddr,fork -

Problem

Loosing events that were sent while socat server is down.

How to reproduce

  1. Run socat
  2. Receive some events
  3. Ctlr+C socat
  4. Wait for Broken pipe message in openwec logs
WARN server::logic - Failed to process output and send event: Failed to write event to output format: Json, driver: Tcp(TcpConfiguration { host: "localhost", port: 5000, tls_enabled: false, tls_certificate_authorities: [], tls_certificate: None, tls_key: None })
Caused by:
     Failed to write in TCP connection (localhost:5000): Broken pipe (os error 32)
  1. Run socat
  2. Receive some events, confirm gap
$ socat TCP-LISTEN:5000,reuseaddr,fork - | jq | grep EventRecordID
    "EventRecordID": 40886,
    "EventRecordID": 40887,
    "EventRecordID": 40888,
    "EventRecordID": 40889,
    "EventRecordID": 40890,
    "EventRecordID": 40891,
    "EventRecordID": 40892,
    "EventRecordID": 40893,
^C
$ socat TCP-LISTEN:5000,reuseaddr,fork - | jq | grep EventRecordID
    "EventRecordID": 40902,
    "EventRecordID": 40903,
    "EventRecordID": 40904,
    "EventRecordID": 40905,
    "EventRecordID": 40906,

semyonkozlov avatar Sep 19 '25 12:09 semyonkozlov

Hi! Thanks for bringing up the issue.

When an output fails, OpenWEC returns a status code 500 to clients and discards the newly received bookmark. I believe each client then drops the subscription and tries to create a new one. However, we know (#291) that there is an issue with the TLS authentication method that makes clients ignore the bookmark sent by the server during the subscription creation, resulting in events loss.

To confirm this theory, could you replay the scenario with the openwec log level set to trace?

vruello avatar Sep 19 '25 15:09 vruello

Depending on how socat handles SIGINT, it is also possible to lose a few messages (the content of the kernel buffer) on plain TCP due to the lack of application-level acknowledgment.

MrAnno avatar Sep 19 '25 19:09 MrAnno

To confirm this theory, could you replay the scenario with the openwec log level set to trace?

Hi!

$ socat TCP-LISTEN:6789,reuseaddr,fork - | jq | grep EventRecordID
    "EventRecordID": 42333,
    "EventRecordID": 42334,
    "EventRecordID": 42335,
    "EventRecordID": 42336,
    "EventRecordID": 42337,
^C
$ socat TCP-LISTEN:6789,reuseaddr,fork - | jq | grep EventRecordID
    "EventRecordID": 42345,
    "EventRecordID": 42346,
    "EventRecordID": 42347,
    "EventRecordID": 42348,
    "EventRecordID": 42349,
    "EventRecordID": 42350,
    "EventRecordID": 42351,
    "EventRecordID": 42352,

openwec.log

semyonkozlov avatar Sep 26 '25 14:09 semyonkozlov

The trace confirms that openwec correctly received the missing events, so my previous assumption was incorrect.

  • 2025-09-26T14:13:46.220951774+00:00: events 42333 -> 42337 have been received
  • 2025-09-26T14:15:22.956897092+00:00: events 42338 -> 42344 have been received
  • 2025-09-26T14:16:24.184468448+00:00: event 42345 has been received, but the output fails (Failed to write in TCP connection (localhost:6789): Broken pipe (os error 32)), so openwec sends an error to the client
  • 2025-09-26T14:17:24.230420210+00:00: event 42345 has been received (again), but the output fails (Failed to connect to localhost:6789: Failed to establish TCP connection), so openwec sends an error back to the client
  • 2025-09-26T14:18:24.243395486+00:00: same
  • 2025-09-26T14:19:24.249624981+00:00: event 42345 has been received (again), and the output worked as expected
  • 2025-09-26T14:19:24.262068876+00:00: events 42346 -> 42355 have been received

I guess you stopped socat between 14:13 and 14:15.

The missing events have been dropped at some point. OpenWEC writes directly to the TcpStream, which is a simple wrapper over write() without any buffering (so no need to call flush()). So the events have most likely reached the sender's kernel buffer. After that, everything depends on the kernel TCP stack. Currently, openwec has no way to guarantee that the events placed in the kernel sending buffer have effectively been received by the application on the other side. I don't know if this is possible without an application-level acknowledgment system, which is beyond the scope of the TCP driver.

If you need more guarantees (which you should), I strongly advise using a more robust queuing system such as Kafka.

vruello avatar Sep 26 '25 18:09 vruello