logstash-codec-json Optionally support handling of \x escape codes

Some clever ways of generating JSON are observed in nginx and apache httpd configs by using the log formatting function of these webservers to write json instead of plain text. The way this occurs is to use the normal log formatter functions of these programs.

However, because nginx and apache try to log readable text, any non-printable text will be visible as a two-digit, hex-encoded escapes, like \x09

We already make concessions for "invalid json" due to often-misconfigured nxlog json (via charset settings), so I think we can make another concession here for \xNN codes.

Proposal:

input {
  redis {
    codec => json { escaped_hex_codes => true }
  }
}

New setting, escaped_hex_codes, when true, would automatically process \xNN text before passing to the JSON parser.

Jan 15 '15 19:01 jordansissel

I am open to input on a better setting name than escaped_hex_codes

Jan 15 '15 19:01 jordansissel

Example nginx log_format that can potentially create json that makes parsing barf:

log_format json '{"time":"$time_iso8601",'
                   '"host":"$http_host",'
                   '"remote":"$remote_addr",'
                   '"origin":"$http_x_forwarded_for",'
                   '"request":"$request",'
                   '"code":$status,'
                   '"size":$body_bytes_sent,'
                   '"referrer":"$http_referer",'
                   '"agent":"$http_user_agent"}';

Usually in the $http_user_agent variable.

Jan 15 '15 19:01 tylerjl

Example output that causes exception:

{
  ...[snip]...
  "agent":"User-Agent\x09Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
}

Jan 15 '15 19:01 tylerjl

Another sample nginx log message with escaped double quotes. This is generated by logging the body of a JSON POST:

{"time":"2015-01-15T19:29:42+00:00","host":"www.sdlkjsdlkj.com","remote":"10.20.30.40","origin":"12.34.56.78","request":"POST / HTTP/1.1","code":200,"size":0,"referrer":"-","agent":"curl/7.30.0","body":"{\x22foo\x22:\x22bar\x22}"}

Jan 15 '15 19:01 kurtado

Has there been any advancement or workarounds for this?

Sep 30 '15 22:09 srizzling

just tried to add some app related things to nginx access log, to avoid uuid joining between 2 logs, and found this =( any plans making it real? Same as @kurtado we got one of fields json, and how i think we'll use ruby filter to unescape it.

Jan 05 '16 10:01 burdandrei

so guys find a solution yet? My head hurts now ~_~!

example text: {\x22name\x22:\x22tom\x5Ccat\x22} If the results obtained should be run properly this: {"name":"tom\cat"} Directly in the ruby runtime it is possible, but in an environment logstash!

Jan 18 '16 02:01 bravecharge

and now, I use this way to solve: ruby { code => 'str=event["request_body"]; str=str.gsub("\x22",""").gsub("\x5C", ""); event["request_body"]=str;' }

Jan 18 '16 05:01 bravecharge

ruby {
  code => "a = eval(%Q{\"#{event['escaped_parsed_request'][1..-2]}\"}); event['unescaped_parsed_request'] = (a == '-') ? '{}' : a"
}

Jan 18 '16 09:01 burdandrei

@jordansissel is it the same codec that is used in json filter? cause json filter is failing for me the same way =(

Feb 24 '16 15:02 burdandrei

I will work on this soon. I think we can come up with a workable solution; will report back soon.

Feb 24 '16 17:02 jordansissel

Great thanks, If there's anything particular you need help with, I'd like to volunteer to help (I mean this task, not yellow leafs on your back yard :grin: )

Feb 25 '16 08:02 burdandrei

@burdandrei haha, thank you for the offer of help! I'm hoping we can make a patch of some kind available soon that we'd love help testing. Stay tuned. :)

Mar 03 '16 18:03 jordansissel

Heres a ruby filter based solution

        ruby {
                code => "event['message'] = event['message'].scan(/(?:\\x([a-f0-9][a-f0-9]))|(.)/i).map { |x| if x[0] == '22' then '\\"' elsif x[0] == '27' then '\\\'' elsif (x[0] == '5C' || x[0] == '5c') then '\\\\' else x[1] || x[0].hex.chr end}.join"
        }
        json {
                source => "message"
#               charset => "UTF-8"
        }

Probably not 100% UTF8/MB safe.

Apr 14 '16 09:04 splitice

Thanks to @splitice for the ruby code filter. I have replaced my GSub filter with the ruby one and I am no longer receiving the JSON decode errors in my log file, and access to resources now correctly appear in Kibana.

May 06 '16 01:05 ruddj

👍

Jun 01 '16 06:06 jonathanpmartins

@splitice I'm trying to use your solution, but I get:

{:timestamp=>"2016-09-29T10:57:10.380000+0000", :message=>"Ruby exception occurred: undefined method `scan' for nil:NilClass", :level=>:error}

Sep 29 '16 10:09 endlesslight0

I had to change input codec from json to plain. @yalok, I think this is what you missed.

I also had to tweak the ruby code to avoid saturating all 24 cores:

filter {
  ruby {
    code => "if event['message'].include?('\x') then event['message'] = event['message'].scan(/(?:\\x([a-f0-9][a-f0-9]))|(.)/i).map { |x| if x[0] == '22' then '\\"' elsif x[0] == '27' then '\\\'' elsif (x[0] == '5C' || x[0] == '5c') then '\\\\' elsif x[0] == '0A' then '\\n' else x[1] || x[0].hex.chr end}.join end"
  }
  json {
    source => "message"
  }
  mutate {
    remove_field => ["message"]
  }
}

This adds \n encoding and checking for \x before doing any changes. Otherwise it's what @splitice posted.

Dec 10 '16 00:12 bobrik

As a further addendum to generalize this, you could rewrite the scan logic as:

event['message'].scan(/(?:\\x([a-f0-9]{2}))|(.)/i).map { |x| x[0] ? [x[0].to_i(16)].pack('U'): x[1] }.join

Dec 14 '16 20:12 kreynolds

Can confirm we are still using a variant of the code I posted way back in April. We loose a small percentage of messages due to invalid sequences unfortunately,

I'll probably have a go about fixing that in 2017 unless someone with more time than I goes about fixing the more fundamental character decoding issue.

Dec 14 '16 22:12 splitice

@splitice @yalok

We got the same error now.

=_=

        ruby {
                code => "event['message'] = event['message'].scan(/(?:\\x([a-f0-9][a-f0-9]))|(.)/i).map { |x| if x[0] == '22' then '\\"' elsif x[0] == '27' then '\\\'' elsif (x[0] == '5C' || x[0] == '5c') then '\\\\' else x[1] || x[0].hex.chr end}.join"
        }
        json {
                source => "message"
#               charset => "UTF-8"
        }

Ruby exception occurred: undefined method `scan' for nil:NilClass {:level=>:error}

Dec 16 '16 10:12 vus520

Well the message is telling you message is null. I've never had that, and if you are feeding in a source with message that can't happen. It's probably missconf of your input.

Dec 16 '16 10:12 splitice

Maybe the best way is url decode before json decode.

Dec 16 '16 12:12 vus520

@vus520 urlencoding is entirely different format. %20 etc.

Dec 17 '16 03:12 splitice

@splitice

how about replace \x to % first then urldecode .

Dec 17 '16 05:12 vus520

that would result strings containing %[num] being incorrectly translated.

Dec 17 '16 05:12 splitice

Good news for nginx users: log_format supports json encoding starting from nginx 1.11.8:

http://nginx.org/en/docs/http/ngx_http_log_module.html#log_format

Jan 01 '17 22:01 bobrik

Has anyone done any compatibility testing with strange encoding or real-world data?

I noticed there isnt any json support for nginx's error_log yet

Jan 04 '17 12:01 splitice

@splitice

try this one.

filter{

    ruby {
        code => " if event['message']
            event['message'] = event['message'].gsub('\x22','').gsub('\x9B','').gsub('\x08','').gsub('\x09','').gsub('\x','')
            end
        "
    }

    json{
        source => "message"
    }
}

Not the best way, replace one by one ....

Jan 12 '17 10:01 vus520

    `code =>  'event["request_body"] = event["request_body"].gsub("\\x22","\"").gsub("\\x5C","/")'`

test alot. this work..

Apr 25 '17 06:04 jinleileiking

logstash-codec-json logstash-codec-json copied to clipboard

Optionally support handling of \x escape codes

logstash-codec-json
logstash-codec-json copied to clipboard