logstash-codec-json icon indicating copy to clipboard operation
logstash-codec-json copied to clipboard

Optionally support handling of \x escape codes

Open jordansissel opened this issue 10 years ago • 36 comments

Some clever ways of generating JSON are observed in nginx and apache httpd configs by using the log formatting function of these webservers to write json instead of plain text. The way this occurs is to use the normal log formatter functions of these programs.

However, because nginx and apache try to log readable text, any non-printable text will be visible as a two-digit, hex-encoded escapes, like \x09

We already make concessions for "invalid json" due to often-misconfigured nxlog json (via charset settings), so I think we can make another concession here for \xNN codes.

Proposal:

input {
  redis {
    codec => json { escaped_hex_codes => true }
  }
}

New setting, escaped_hex_codes, when true, would automatically process \xNN text before passing to the JSON parser.

jordansissel avatar Jan 15 '15 19:01 jordansissel

I am open to input on a better setting name than escaped_hex_codes

jordansissel avatar Jan 15 '15 19:01 jordansissel

Example nginx log_format that can potentially create json that makes parsing barf:

log_format json '{"time":"$time_iso8601",'
                   '"host":"$http_host",'
                   '"remote":"$remote_addr",'
                   '"origin":"$http_x_forwarded_for",'
                   '"request":"$request",'
                   '"code":$status,'
                   '"size":$body_bytes_sent,'
                   '"referrer":"$http_referer",'
                   '"agent":"$http_user_agent"}';

Usually in the $http_user_agent variable.

tylerjl avatar Jan 15 '15 19:01 tylerjl

Example output that causes exception:

{
  ...[snip]...
  "agent":"User-Agent\x09Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
}

tylerjl avatar Jan 15 '15 19:01 tylerjl

Another sample nginx log message with escaped double quotes. This is generated by logging the body of a JSON POST:

{"time":"2015-01-15T19:29:42+00:00","host":"www.sdlkjsdlkj.com","remote":"10.20.30.40","origin":"12.34.56.78","request":"POST / HTTP/1.1","code":200,"size":0,"referrer":"-","agent":"curl/7.30.0","body":"{\x22foo\x22:\x22bar\x22}"}

kurtado avatar Jan 15 '15 19:01 kurtado

Has there been any advancement or workarounds for this?

srizzling avatar Sep 30 '15 22:09 srizzling

just tried to add some app related things to nginx access log, to avoid uuid joining between 2 logs, and found this =( any plans making it real? Same as @kurtado we got one of fields json, and how i think we'll use ruby filter to unescape it.

burdandrei avatar Jan 05 '16 10:01 burdandrei

so guys find a solution yet? My head hurts now ~_~!

example text: {\x22name\x22:\x22tom\x5Ccat\x22} If the results obtained should be run properly this: {"name":"tom\cat"} Directly in the ruby runtime it is possible, but in an environment logstash!

bravecharge avatar Jan 18 '16 02:01 bravecharge

and now, I use this way to solve: ruby { code => 'str=event["request_body"]; str=str.gsub("\x22",""").gsub("\x5C", ""); event["request_body"]=str;' }

bravecharge avatar Jan 18 '16 05:01 bravecharge

ruby {
  code => "a = eval(%Q{\"#{event['escaped_parsed_request'][1..-2]}\"}); event['unescaped_parsed_request'] = (a == '-') ? '{}' : a"
}

burdandrei avatar Jan 18 '16 09:01 burdandrei

@jordansissel is it the same codec that is used in json filter? cause json filter is failing for me the same way =(

burdandrei avatar Feb 24 '16 15:02 burdandrei

I will work on this soon. I think we can come up with a workable solution; will report back soon.

jordansissel avatar Feb 24 '16 17:02 jordansissel

Great thanks, If there's anything particular you need help with, I'd like to volunteer to help (I mean this task, not yellow leafs on your back yard :grin: )

burdandrei avatar Feb 25 '16 08:02 burdandrei

@burdandrei haha, thank you for the offer of help! I'm hoping we can make a patch of some kind available soon that we'd love help testing. Stay tuned. :)

jordansissel avatar Mar 03 '16 18:03 jordansissel

Heres a ruby filter based solution

        ruby {
                code => "event['message'] = event['message'].scan(/(?:\\x([a-f0-9][a-f0-9]))|(.)/i).map { |x| if x[0] == '22' then '\\"' elsif x[0] == '27' then '\\\'' elsif (x[0] == '5C' || x[0] == '5c') then '\\\\' else x[1] || x[0].hex.chr end}.join"
        }
        json {
                source => "message"
#               charset => "UTF-8"
        }

Probably not 100% UTF8/MB safe.

splitice avatar Apr 14 '16 09:04 splitice

Thanks to @splitice for the ruby code filter. I have replaced my GSub filter with the ruby one and I am no longer receiving the JSON decode errors in my log file, and access to resources now correctly appear in Kibana.

ruddj avatar May 06 '16 01:05 ruddj

👍

jonathanpmartins avatar Jun 01 '16 06:06 jonathanpmartins

@splitice I'm trying to use your solution, but I get:

{:timestamp=>"2016-09-29T10:57:10.380000+0000", :message=>"Ruby exception occurred: undefined method `scan' for nil:NilClass", :level=>:error}

endlesslight0 avatar Sep 29 '16 10:09 endlesslight0

I had to change input codec from json to plain. @yalok, I think this is what you missed.

I also had to tweak the ruby code to avoid saturating all 24 cores:

filter {
  ruby {
    code => "if event['message'].include?('\x') then event['message'] = event['message'].scan(/(?:\\x([a-f0-9][a-f0-9]))|(.)/i).map { |x| if x[0] == '22' then '\\"' elsif x[0] == '27' then '\\\'' elsif (x[0] == '5C' || x[0] == '5c') then '\\\\' elsif x[0] == '0A' then '\\n' else x[1] || x[0].hex.chr end}.join end"
  }
  json {
    source => "message"
  }
  mutate {
    remove_field => ["message"]
  }
}

This adds \n encoding and checking for \x before doing any changes. Otherwise it's what @splitice posted.

bobrik avatar Dec 10 '16 00:12 bobrik

As a further addendum to generalize this, you could rewrite the scan logic as:

event['message'].scan(/(?:\\x([a-f0-9]{2}))|(.)/i).map { |x| x[0] ? [x[0].to_i(16)].pack('U'): x[1] }.join

kreynolds avatar Dec 14 '16 20:12 kreynolds

Can confirm we are still using a variant of the code I posted way back in April. We loose a small percentage of messages due to invalid sequences unfortunately,

I'll probably have a go about fixing that in 2017 unless someone with more time than I goes about fixing the more fundamental character decoding issue.

splitice avatar Dec 14 '16 22:12 splitice

@splitice @yalok

We got the same error now.

=_=

        ruby {
                code => "event['message'] = event['message'].scan(/(?:\\x([a-f0-9][a-f0-9]))|(.)/i).map { |x| if x[0] == '22' then '\\"' elsif x[0] == '27' then '\\\'' elsif (x[0] == '5C' || x[0] == '5c') then '\\\\' else x[1] || x[0].hex.chr end}.join"
        }
        json {
                source => "message"
#               charset => "UTF-8"
        }
Ruby exception occurred: undefined method `scan' for nil:NilClass {:level=>:error}

vus520 avatar Dec 16 '16 10:12 vus520

Well the message is telling you message is null. I've never had that, and if you are feeding in a source with message that can't happen. It's probably missconf of your input.

splitice avatar Dec 16 '16 10:12 splitice

Maybe the best way is url decode before json decode.

vus520 avatar Dec 16 '16 12:12 vus520

@vus520 urlencoding is entirely different format. %20 etc.

splitice avatar Dec 17 '16 03:12 splitice

@splitice

how about replace \x to % first then urldecode .

vus520 avatar Dec 17 '16 05:12 vus520

that would result strings containing %[num] being incorrectly translated.

splitice avatar Dec 17 '16 05:12 splitice

Good news for nginx users: log_format supports json encoding starting from nginx 1.11.8:

  • http://nginx.org/en/docs/http/ngx_http_log_module.html#log_format

bobrik avatar Jan 01 '17 22:01 bobrik

Has anyone done any compatibility testing with strange encoding or real-world data?

I noticed there isnt any json support for nginx's error_log yet

splitice avatar Jan 04 '17 12:01 splitice

@splitice

try this one.

filter{

    ruby {
        code => " if event['message']
            event['message'] = event['message'].gsub('\x22','').gsub('\x9B','').gsub('\x08','').gsub('\x09','').gsub('\x','')
            end
        "
    }

    json{
        source => "message"
    }
}

Not the best way, replace one by one ....

vus520 avatar Jan 12 '17 10:01 vus520

    `code =>  'event["request_body"] = event["request_body"].gsub("\\x22","\"").gsub("\\x5C","/")'`

test alot. this work..

jinleileiking avatar Apr 25 '17 06:04 jinleileiking