quickwit icon indicating copy to clipboard operation
quickwit copied to clipboard

Create a custom tokenizer that is more log friendly

Open fulmicoton opened this issue 3 years ago • 2 comments

For logs we want a nicer tokenizer.

We probably should not cut on _ or -, and . within a number.

fulmicoton avatar Feb 23 '22 12:02 fulmicoton

Here is an example of a log:

{
  "event_id": "123e4567-e89b-12d3-a456-426614174000", 
  "payload": "1331901000.000000    CHEt7z3AzG4gyCNgci    192.168.202.79    50465    192.168.229.251    80    1    HEAD 192.168.229.251    /DEASLog02.nsf    -    Mozilla/5.0 (compatible; Nmap Scripting Engine; [http://nmap.org/book/nse.html](http://nmap.org/book/nse.html))    0    0    404    Not Found    -    -    -    (empty)    -    -    -    -    -    -    -"
}

We could potentially isolate IP addresses like 192.168.202.79.

fmassot avatar May 20 '22 19:05 fmassot

Another example of a log taken from https://github.com/elastic/rally-tracks/tree/master/http_logs

{"message" : "211.11.9.0 - - [1998-06-21T15:00:01-05:00] \"GET /english/index.html HTTP/1.0\" 304 0"}

Or https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs

54.36.149.41 - - [22/Jan/2019:03:56:14 +0330] "GET /filter/27|13%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,27|%DA%A9%D9%85%D8%AA%D8%B1%20%D8%A7%D8%B2%205%20%D9%85%DA%AF%D8%A7%D9%BE%DB%8C%DA%A9%D8%B3%D9%84,p53 HTTP/1.1" 200 30577 "-" "Mozilla/5.0 (compatible; AhrefsBot/6.1; +http://ahrefs.com/robot/)" "-"
31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] "GET /image/60844/productModel/200x200 HTTP/1.1" 200 5667 "https://www.zanbil.ir/m/filter/b113" "Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36" "-"
31.56.96.51 - - [22/Jan/2019:03:56:16 +0330] "GET /image/61474/productModel/200x200 HTTP/1.1" 200 5379 "https://www.zanbil.ir/m/filter/b113" "Mozilla/5.0 (Linux; Android 6.0; ALE-L21 Build/HuaweiALE-L21) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.158 Mobile Safari/537.36" "-"
40.77.167.129 - - [22/Jan/2019:03:56:17 +0330] "GET /image/14925/productModel/100x100 HTTP/1.1" 200 1696 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" "-"
91.99.72.15 - - [22/Jan/2019:03:56:17 +0330] "GET /product/31893/62100/%D8%B3%D8%B4%D9%88%D8%A7%D8%B1-%D8%AE%D8%A7%D9%86%DA%AF%DB%8C-%D9%BE%D8%B1%D9%86%D8%B3%D9%84%DB%8C-%D9%85%D8%AF%D9%84-PR257AT HTTP/1.1" 200 41483 "-" "Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0)Gecko/16.0 Firefox/16.0" "-"

fmassot avatar Sep 05 '22 13:09 fmassot

@fmassot have we shipped something about that?

fulmicoton avatar Nov 08 '22 07:11 fulmicoton

@fulmicoton: nothing for now but we will need something for happy-plazza. Something simple like "don't split on identifiers and floats" could do the job. I will do some testing at the end of the week.

fmassot avatar Nov 08 '22 07:11 fmassot