loki icon indicating copy to clipboard operation
loki copied to clipboard

JSON unmarshalling of labels fails to parse boolean values

Open hypesystem opened this issue 2 years ago • 13 comments

Describe the bug From version 2.9.0, JSON payloads sent to the HTTP push endpoint fail when there are boolean values in the labels section (under the field stream).

To Reproduce

Given a locally running loki instance on port 3100, execute the following bash script:

export data='{
    "streams": [
        {
            "stream": {
                "namespace": "hey",
                "has_errors": true
            }
        }
    ]
}'

curl -v http://localhost:3100/loki/api/v1/push -H "Content-Type: application/json" --data "$data"

The script outputs:

*    Trying 127.0.0.1:3100...
* Connected to localhost (127.0.0.1) port 3101 (#0)
> POST /loki/api/v1/push HTTP/1.1
> Host: localhost:3100
> User-Agent: curl/7.81.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 155
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Bad Request
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Thu, 28 Sep 2023 13:17:42 GMT
< Content-Length: 264
< 
loghttp.PushRequest.Streams: []loghttp.LogProtoStream: unmarshalerDecoder: Value is string, but can't find closing '"' symbol, error found in #10 byte of ...|
        }
    ]
}|..., bigger context ...|        "has_errors": true
            }
        }
    ]
}|...
* Connection #0 to host localhost left intact

I used a simple local docker-compose to spin up Loki instances, and confirmed that the issue does not exist in 2.8.5 but exists in both 2.9.0 and 2.9.1:

version: "3"

networks:
  loki:

services:
  loki:
    image: grafana/loki:2.9.0
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - loki

Expected behavior In 2.8.5 and previous versions the label is accepted (response code 204 No Content). I am not entirely sure what happens to boolean values or how they are included in the data...

Environment:

  • Infrastructure: docker-compose on laptop
  • Deployment tool: N/A

hypesystem avatar Sep 28 '23 13:09 hypesystem

On first sight it looks like changes in pgk/loghttp/entry.go are responsible for this. Here the individual keys and values are tried being parsed as string, so any other datatype as value should fail.

I will take a shot at this later today. :)

aurifi avatar Oct 18 '23 04:10 aurifi

Sorry, my first sight was incorrect. :D Your issue should actually already been fixed with this commit:

https://github.com/grafana/loki/commit/5ab9515020658860053590c9e38e0262be78a9b1

However this is not part of v2.9.2

aurifi avatar Oct 19 '23 17:10 aurifi

Ah that looks like it would do it, yes, @szczepad !

How do I track which release it goes out in? I'll test as soon as it becomes available in a release :smile:

hypesystem avatar Oct 24 '23 07:10 hypesystem

@szczepad fwiw this still seems to be a bug in 2.9.6 - but it works in 3.0.0 :smile:

I'd love to get it out in 2.x so we can get the fix without having to upgrade major version :smile:

hypesystem avatar Apr 09 '24 08:04 hypesystem

This seems like the opposite of what you'd probably want btw? 3.0.0 seems like the obvious time to get rid of non-string values in labels?

hypesystem avatar Apr 09 '24 13:04 hypesystem

hey sorry to bother you , this bug is ok in 2.9.7 :) we don't have log since this error .... thanks for all !

TimotheDavid avatar Apr 10 '24 14:04 TimotheDavid

I am still seeing the error in 2.9.7.

docker-compose.yml:

version: "3"

networks:
  loki:

services:
  loki-2.9.7:
    image: grafana/loki:2.9.7
    ports:
      - "3199:3100"
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - loki

curl.sh

export data='{
    "streams": [
        {
            "stream": {
                "namespace": "hey",
                "has_errors": true
            }
        }
    ]
}'

echo -e '\nLOKI 2.9.7 TEST BOOLEAN PARSE\n'
curl -v -X POST http://localhost:3199/loki/api/v1/push -H "Content-Type: application/json" --data "$data"

Running my test:

$ ./curl.sh
LOKI 2.9.7 TEST BOOLEAN PARSE

Note: Unnecessary use of -X or --request, POST is already inferred.
*   Trying 127.0.0.1:3199...
* Connected to localhost (127.0.0.1) port 3199 (#0)
> POST /loki/api/v1/push HTTP/1.1
> Host: localhost:3199
> User-Agent: curl/7.81.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 155
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 400 Bad Request
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Wed, 10 Apr 2024 14:42:14 GMT
< Content-Length: 264
< 
loghttp.PushRequest.Streams: []loghttp.LogProtoStream: unmarshalerDecoder: Value is string, but can't find closing '"' symbol, error found in #10 byte of ...|
        }
    ]
}|..., bigger context ...|        "has_errors": true
            }
        }
    ]
}|...
* Connection #0 to host localhost left intact

hypesystem avatar Apr 10 '24 14:04 hypesystem

thanks for all :) ok so we need to go to 3.0 ... I will begin to move :/

TimotheDavid avatar Apr 10 '24 14:04 TimotheDavid

Same issue here with Loki 3.0.0 and Logging-Operator 4.9.0.

R-Studio avatar Aug 22 '24 06:08 R-Studio

@R-Studio I checked 3.0.0 and didn't find the error there, c.f. https://github.com/grafana/loki/issues/10738#issuecomment-2044384294 - can you share a minimal reproduction of your setup to see if it is the same issue?

hypesystem avatar Aug 22 '24 06:08 hypesystem

So I have installed Logging-Operator version 4.9.0 with following ClusterFlow & ClusterOutput (Helm values):

...
  clusterFlows:
    - name: all-logs
      spec:
        filters:
          - tag_normaliser: {}
        match:
          - exclude: 
              labels:
                app.kubernetes.io/name: event-tailer
                app.kubernetes.io/instance: audit-host-tailer
          - select: {}
        globalOutputRefs: 
          - loki-all-output
  clusterOutputs:
    - name: loki-all-output
      spec:
        loki:
          url: http://loki-gateway.loki.svc.cluster.local
          line_format: key_value
          configure_kubernetes_labels: true
          extract_kubernetes_labels: true
          include_thread_label: false
          drop_single_key: true
          remove_keys: 
            - logtag
            - stream
            - time
          buffer:
            disable_chunk_backup: true
            flush_thread_count: 8
            flush_mode: immediate
            timekey: 1m
            timekey_wait: 5s
            retry_forever: false
            retry_timeout: 24h
...

And I get the following error in the Fluentd pods of the Logging-Operator:

2024-08-22 12:12:39 +0000 [warn]: #0 [clusterflow:logging-operator:all-logs:clusteroutput:logging-operator:loki-all-output] failed to write post to http://loki-gateway.loki.svc.cluster.local/loki/api/v1/push (400 Bad Request loghttp.PushRequest.Streams: []loghttp.LogProtoStream: unmarshalerDecoder: Value is string, but can't find closing '"' symbol, error found in #10 byte of ...|eged"}}]]}]}|..., bigger context ...|-security.kubernetes.io/enforce":"privileged"}}]]}]}|...
)

We have Loki installed with the Helm chart version 6.6.3.

R-Studio avatar Aug 22 '24 12:08 R-Studio

I upgrade Loki to the newest Helm chart 6.10.0 (no difference):

message="level=error ts=2024-08-22T12:24:40.459112136Z caller=manager.go:49 component=distributor path=write msg=\"write operation failed\" details=\"couldn't parse push request: loghttp.PushRequest.Streams: []loghttp.LogProtoStream: unmarshalerDecoder: Value is string, but can't find closing '\\\"' symbol, error found in #10 byte of ...|\\\\\\\"}}\\\\\\\"\\\"]]}]}|..., bigger context ...|ernetes.io/enforce\\\\\\\\\\\\\\\"=>\\\\\\\\\\\\\\\"privileged\\\\\\\\\\\\\\\"}}\\\\\\\"\\\"]]}]}|...\" org_id=fake" kubernetes_namespace="{\"name\"=>\"loki\", \"labels\"=>{\"kubernetes.io/metadata.name\"=>\"loki\"}}"

R-Studio avatar Aug 22 '24 12:08 R-Studio

I found a solution/workaround by removing the key kubernetes_namespace. So my new ClusterOutput looks like this:

...
  clusterOutputs:
    - name: loki-all-output
      spec:
        loki:
          url: http://loki-gateway.loki.svc.cluster.local
          line_format: key_value
          configure_kubernetes_labels: true
          extract_kubernetes_labels: true
          include_thread_label: false
          drop_single_key: true
          remove_keys: 
            - logtag
            - stream
            - time
            - kubernetes_namespace   # This fixes namespace labels extraction (https://github.com/grafana/loki/issues/10738)
          buffer:
            disable_chunk_backup: true
            flush_thread_count: 8
            flush_mode: immediate
            timekey: 1m
            timekey_wait: 5s
            retry_forever: false
            retry_timeout: 24h

R-Studio avatar Aug 22 '24 13:08 R-Studio

We are experiencing the same behavior as described, workaround with kubernetes_namespace helped as well. Installed Helm-chart versions are:

  • Logging-operator: 5.1.1
  • Loki: 6.25.1

viktordaniel avatar Feb 12 '25 15:02 viktordaniel

We have been seeing an almost identical error using recent versions of Logging Operator and Loki (v3.4.2).

level=error ts=2025-05-14T11:48:03.623136385Z caller=manager.go:50 component=distributor path=write msg="write operation failed" details="couldn't parse push request: loghttp.PushRequest.Streams: []loghttp.LogProtoStream: unmarshalerDecoder: Value is string, but can't find closing '\"' symbol, error found in #10 byte of ...|3\",null]]}]}|..., bigger context ...|06\",\"(no headers)\"],[\"1747223280349181203\",null]]}]}|..." org_id=fake

Since we never see the entire log message it is very hard for us to pinpoint the exact root cause or any shared traits between the messages. We do however see an increased occurrences of null values in the short bits of the log message.

We have adjusted our ClusterFlow to drop empty messages and now we have not seen any unmarshalerDecoder errors for almost an hour and they used to be a steady stream of them.

  filters:
    # drop empty messages
    - grep:
        exclude:
          - key: message
            pattern: "/^\\s*$/"
Image

Starefossen avatar May 14 '25 12:05 Starefossen

I think we're hitting the same bug, but not finding any error messages, so not 100% sure. But here goes:

We're using kube-logging as well, have a basic flow that parses a message as json.

On the output, we try to set .spec.loki.labels.blocked: $.transaction.is_interrupted, where $.transaction.is_interrupted is a boolean value. Logs end up in Loki as expected, except that the blocked label is missing. The $.transaction.is_interrupted field is no longer in the record, so it has been parsed.

We currently circumvented this by using a record modifier to turn the boolean into a string. Doing this, the blocked label does show up as expected. But it's kind of an ugly hack. IMHO, if Loki can't take non-string values for labels, the fluentd plugin should normalize labels into strings.

dmaes avatar Aug 25 '25 14:08 dmaes