mtail icon indicating copy to clipboard operation
mtail copied to clipboard

Runtime errors should be more visible to users

Open jakirpatel opened this issue 5 years ago • 14 comments

Details : Problem : Metrics are not getting exposed.

Mtail version: mtail version v3.0.0-rc16 git revision 77eeba903db0ffd90e29c1ee778fdefb0112bfab go version go1.11

Filesystem: NFS

Logs from the mtail:

I0325 01:19:17.267495   14907 file.go:171] /home/coherence_app_log/jmxreports/2019032510-memory-status.txt: file type does not support deadline
I0325 01:19:17.267802   14907 tail.go:246] Tailing /home/coherence_app_log/jmxreports/2019032510-memory-status.txt
I0325 01:19:17.267993   14907 file.go:112] open succeeded /home/coherence_app_log/jmxreports/2019032510-memory-status.txt
I0325 01:19:17.268001   14907 tail.go:236] Adding a file watch on "/home/coherence_app_log/jmxreports/2019032510-memory-status.txt"
I0325 01:19:17.268005   14907 log_watcher.go:277] Resolved path for lookup "/home/coherence_app_log/jmxreports/2019032510-memory-status.txt"
I0325 01:19:17.590447   14907 log_watcher.go:174] No modtime change for /home/coherence_app_log/jmxreports/2019032510-memory-status.txt, no send
I0325 01:19:17.820442   14907 file.go:171] /home/coherence_app_log/jmxreports/2019032510-memory-status.txt: file type does not support deadline

Log type:

1962	?? 3 25 10:16:32 JST 2019	597277688	246	ParNew	2243	1.0	58409	18.0	597137142	19.0	4093640704	4294967296	4093640704	1635710024
1962	?? 3 25 10:16:32 JST 2019	597273217	247	ParNew	2237	2.0	58401	37.0	597176319	17.0	4093640704	4294967296	4093640704	1252907928
1962	?? 3 25 10:16:32 JST 2019	597270204	248	ParNew	2242	1.0	58623	15.0	597112242	15.0	4093640704	4294967296	4093640704	1715023800
1962	?? 3 25 10:16:32 JST 2019	597267193	249	ParNew	2232	2.0	58527	37.0	597159391	18.0	4093640704	4294967296	4093640704	1267338656
1962	?? 3 25 10:16:32 JST 2019	597262215	250	ParNew	2228	1.0	58006	19.0	597092171	19.0	4093640704	4294967296	4093640704	1730664056
1962	?? 3 25 10:16:32 JST 2019	597258907	251	ParNew	2233	1.0	57737	23.0	597098542	23.0	4093640704	4294967296	4093640704	1639189488
1962	?? 3 25 10:16:32 JST 2019	597255899	252	ParNew	2242	1.0	58491	17.0	597095164	16.0	4093640704	4294967296	4093640704	1697487040

Mtail program:

counter total_heap_committed by node_id, gc_name
counter total_heap_used by node_id, gc_name
counter total_heap_max by node_id, gc_name
counter total_heap_init by node_id, gc_name

def coherenceMemoryStatus {
    /[\d]+\s[\?]+\s[\d]+\s[\d]+\s[\d]+:[\d]+:[\d]+\s[A-Za-z]+\s[\d]+\s[\d]+\s(?P<node_id>[\d]+)\s(?P<gc_name>[A-Za-z]+)\s[\d]+\s[\d+\.\d+]+\s[\d]+\s[\d+\.\d+]+\s+[\d]+\s[\d+\.\d+]+\s(?P<heap_committed>[\d]+)\s(?P<heap_init>[\d]+)\s(?P<heap_max>[\d]+)\s(?P<heap_used>[\d]+)/ {
        next
    }
}

@coherenceMemoryStatus {
    total_heap_committed[$node_id][$gc_name] += $heap_committed
    total_heap_used[$node_id][$gc_name] += $heap_used
    total_heap_max[$node_id][$gc_name] += $heap_max
    total_heap_init[$node_id][$gc_name] += $heap_init
}

Some debug steps I used:

I ran the simple line count program with the same file using --one_shot parameter. It given me the exact result.

Also locally I was able to parse the file and was able to get the result. But with NFS filesystem seems I was not able to get the similar result.

jakirpatel avatar Mar 25 '19 01:03 jakirpatel

rc16 seems pretty old. Can you run rc29 (just released today) and see if you can reproduce the problem?

Also, please give the full commandline you used (one of the first log messages in the mtail INFO log.)

jaqx0r avatar Mar 25 '19 02:03 jaqx0r

@jaqx0r Still not getting the metrics. Version: Build: mtail version v3.0.0-rc29 git revision b5aa95d10fb3af7b843e3814bc65aedd983425d9 go version go1.11.6 go arch amd64 go os linux

The command line:

I0325 11:23:58.965017   32679 main.go:95] Commandline: ["/home/weblogic/test/mtail" "-poll_interval" "1s" "-port" "3102" "-progs" "/home/weblogic/test/old.mtail" "-logs" "/home/weblogic/test/memory-status.txt" "-logtostderr"]

There are no errors while reading. Even I can see X number of lines read. But with the metrics I cant see any counter incremented.

Some more interesting result I got.

Commandline: ["/home/weblogic/test/mtail" "-poll_interval" "1s" "-port" "3102" "-progs" "/home/weblogic/test/old.mtail" "-logs" "/home/weblogic/test/memory-status.txt" "-logtostderr" "-v" "2"]

While logs get written I got following log from mtail:

I0325 11:43:09.947934   21234 log_watcher.go:145] Update fi
I0325 11:43:09.947940   21234 log_watcher.go:130] stat
I0325 11:43:09.949704   21234 log_watcher.go:170] sending update for /home/weblogic/test/memory-status.txt
I0325 11:43:09.949731   21234 log_watcher.go:174] No modtime change for /home/weblogic/test/mtail, no send
I0325 11:43:09.949739   21234 log_watcher.go:174] No modtime change for /home/weblogic/test/old.mtail, no send
I0325 11:43:09.949744   21234 log_watcher.go:145] Update fi
I0325 11:43:09.949749   21234 log_watcher.go:130] stat
I0325 11:43:09.949766   21234 log_watcher.go:145] Update fi
I0325 11:43:09.949778   21234 tail.go:284] Event type watcher.Event{Op:2, Pathname:"/home/weblogic/test/memory-status.txt"}
I0325 11:43:09.949799   21234 tail.go:188] handleLogUpdate /home/weblogic/test/memory-status.txt
I0325 11:43:09.949819   21234 file.go:139] Path /home/weblogic/test/memory-status.txt already being watched, and inode not changed.
I0325 11:43:09.949825   21234 file.go:143] doing the normal read
I0325 11:43:09.949843   21234 file.go:171] /home/weblogic/test/memory-status.txt: file type does not support deadline
I0325 11:43:09.951858   21234 file.go:174] Read count 4096 err <nil>
I0325 11:43:09.951906   21234 file.go:221] Line sent
I0325 11:43:09.951932   21234 file.go:221] Line sent
I0325 11:43:09.952095   21234 file.go:221] Line sent
I0325 11:43:09.952110   21234 file.go:221] Line sent
I0325 11:43:09.952122   21234 file.go:221] Line sent
I0325 11:43:09.952209   21234 file.go:221] Line sent
I0325 11:43:09.952235   21234 file.go:221] Line sent
I0325 11:43:09.952252   21234 file.go:221] Line sent
I0325 11:43:09.952338   21234 file.go:221] Line sent
I0325 11:43:09.952363   21234 file.go:221] Line sent
I0325 11:43:09.952379   21234 file.go:221] Line sent

I forgot to mention that /home/weblogic/test/memory-status.txt is a symlink to other file.

After looking into the problem more close seems it maybe the issue of NFS file system. ?? I think enabling the soft mount option will help on this.

For NFS mount I am using following options:

(rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=100.66.81.33,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=x.x.x.x)

Some of references I got : https://www.reddit.com/r/golang/comments/8npwin/how_to_implement_timeout_in_file_reads/

jakirpatel avatar Mar 25 '19 02:03 jakirpatel

@jaqx0r

I have applied the soft option while mounting the NFS. But seems the problem of reading the files still persist. Is there any recommendation that how the NFS filesystem should be mounted to read the files through mtail ?

Your inputs will be helpful.

jakirpatel avatar Mar 25 '19 08:03 jakirpatel

I don't think I can at this time, it's not clear to me yet what the problem is. My guess so far is that the files are not being read after polling. I don't believe at this point that changing NFS mount options will fix it.

On Mon., 25 Mar. 2019, 19:01 Jakir Patel, [email protected] wrote:

@jaqx0r https://github.com/jaqx0r

I have applied the soft option while mounting the NFS. But seems the problem of reading the files still persist. Is there any recommendation that how the NFS filesystem should be mounted to read the files through mtail ?

Your inputs will be helpful.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/google/mtail/issues/221#issuecomment-476091632, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5b-zLv5c6N06BRgh4QLpc92TjSs8Znks5vaIJlgaJpZM4cF6z- .

jaqx0r avatar Mar 26 '19 00:03 jaqx0r

@jaqx0r Might be regenerating the problem will be more good.

jakirpatel avatar Mar 26 '19 02:03 jakirpatel

@jaqx0r

Seems it was the regex issue. For somehow the regex was not matching. Please resolve it. Thank you for your support.

There was japanese character in the log timestamp. However with the regex on my local machine I was able to see it as ?

I think I could have tested the regex before direct with the logfile.

jakirpatel avatar Mar 26 '19 09:03 jakirpatel

Um.. what regex issue?

On Tue, 26 Mar 2019 at 20:13, Jakir Patel [email protected] wrote:

@jaqx0r https://github.com/jaqx0r

Seems it was the regex issue. For somehow the regex was not matching. Please resolve it. Thank you for your support.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/mtail/issues/221#issuecomment-476536382, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5b-7yYV-pUN-cFGWpzvukrDPCL8ezYks5vaeS8gaJpZM4cF6z- .

jaqx0r avatar Mar 27 '19 02:03 jaqx0r

@jaqx0r

Basically my regex was not matching with log format. because of that it was not passing from regex block. That was reason my all counters never incremented. However when I have added simple line count counter it got incremented (with no regex).

What I think mtail should provide the info of NO MATCHING lines with regex as well. Some counter maybe so at first place we ll get to know how many lines get matched with regex.

But I observed one thing in the logs I can still see this,


I0325 11:43:09.949843   21234 file.go:171] /home/weblogic/test/memory-status.txt: file type does not support deadline
I0325 11:43:09.951858   21234 file.go:174] Read count 4096 err <nil>

jakirpatel avatar Mar 27 '19 03:03 jakirpatel

https://github.com/google/mtail/blob/master/docs/Testing.md describes one way to test your programs on snippets of logs. It does require making changes to the code to set it up, however but lets you set expectations on what the counters should be after processing a piece of log with your program.

In the mtail language, the otherwise keyword can be used to catch logs that "would otherwise not match" anything in the same block: https://github.com/google/mtail/blob/master/docs/Language.md#otherwise-clauses (the else keyword is similar) You could use otherwise to increment your own counter when no lines match.

In the olden days, mtail did have a special exported counter and a ring buffer per program for debugging log lines that didn't match. The ring buffer turned out to be expensive, and we'd made the system spammy to encourage people to try to handle every single log line for coverage, but that caused programs to become more complex and computationally expensive, so I didn't include it in the open sourcing of mtail. But it is easy to add a counter back for counting the number of times each program didn't take any action at all on a line, so I will do that.

Your log messages look like a red herring here. You've said that you're mounting over NFS, and I am certain that NFS doesn't support a read timeout. You get the following 'Read count' indicating that the read was successful and no error, so it's all working as intended as far as I can tell, although perhaps too noisy.

Going back to your program, after:

@coherenceMemoryStatus {
    total_heap_committed[$node_id][$gc_name] += $heap_committed
    total_heap_used[$node_id][$gc_name] += $heap_used
    total_heap_max[$node_id][$gc_name] += $heap_max
    total_heap_init[$node_id][$gc_name] += $heap_init
}

append these lines

otherwise {
  unmatched_lines++
}

for debugging.

I might also add a flag to enable logging of unmatched lines in test mode, but that will be very noisy in production.

jaqx0r avatar Mar 27 '19 05:03 jaqx0r

I got the following error with the newer version not sure what happened. It says the runtime error:

I0402 08:20:25.573032    9751 main.go:94] mtail version v3.0.0-rc29 git revision b5aa95d10fb3af7b843e3814bc65aedd983425d9 go version go1.11.6 go arch amd64 go os linux
I0402 08:20:25.573058    9751 main.go:95] Commandline: ["/home/gecope/mtail/mtail" "-poll_interval" "1s" "-port" "3902" "-progs" "progs/jmxCacheSize.mtail" "-logs" "/home/gecope/mtail-cron/bsktkvs/cache-size.txt" "-logs" "/home/gecope/mtail-cron/bsktkvsA/cache-size.txt" "-logtostderr" "-v" "2"]
I0402 08:24:59.574501    9751 log_watcher.go:145] Update fi
I0402 08:24:59.574848    9751 file.go:174] Read count 669 err <nil>
I0402 08:24:59.574865    9751 file.go:221] Line sent
I0402 08:24:59.574874    9751 file.go:221] Line sent
I0402 08:24:59.574935    9751 file.go:221] Line sent
I0402 08:24:59.574945    9751 file.go:221] Line sent
I0402 08:24:59.574950    9751 file.go:221] Line sent
I0402 08:24:59.574993    9751 vm.go:90] jmxCacheSize.mtail: Runtime error: panic in thread &vm.thread{pc:24, matched:false, matches:map[int][]string{0:[]string{"/home/gecope/mtail-cron/bsktkvsA/cache-size.txt", "bsktkvsA"}, 1:[]string{"1126\t�� 4 02 17:24:32 JST 2019\tOrderIdDistributedCache\torder-id\t1458924.0\t1458924.0\t1.3913383483886719\t1.0", "OrderIdDistributedCache", "order-id", "1458924.0", "1458924.0", "1.3913383483886719", "1.0"}}, time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}, stack:[]interface {}{true, (*datum.StringDatum)(0xc0002cf740)}} at instr "{cat <nil>}": interface conversion: interface {} is *datum.StringDatum, not string
I0402 08:24:59.575078    9751 vm.go:91] VM stack:
goroutine 42 [running]:
runtime/debug.Stack(0xc00047a190, 0x47, 0xc00047f670)
	/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/google/mtail/internal/vm.(*VM).errorf(0xc00029c280, 0x94512c, 0x23, 0xc00047f670, 0x3, 0x3)
	/go/src/github.com/google/mtail/internal/vm/vm.go:91 +0x113
github.com/google/mtail/internal/vm.(*VM).execute.func1(0xc00029c280, 0xc0000c4cd0, 0x26, 0x0, 0x0)
	/go/src/github.com/google/mtail/internal/vm/vm.go:296 +0x142
panic(0x8ad3a0, 0xc0002cf7d0)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/google/mtail/internal/vm.(*VM).execute(0xc00029c280, 0xc0000c4cd0, 0x26, 0x0, 0x0)
	/go/src/github.com/google/mtail/internal/vm/vm.go:720 +0x5ded
github.com/google/mtail/internal/vm.(*VM).processLine(0xc00029c280, 0xc00055c120)
	/go/src/github.com/google/mtail/internal/vm/vm.go:748 +0x18c
github.com/google/mtail/internal/vm.(*VM).Run(0xc00029c280, 0x6a6d7843, 0xc0001ae420, 0xc0001ae480, 0xc0001ae4e0)
	/go/src/github.com/google/mtail/internal/vm/vm.go:767 +0xf5
created by github.com/google/mtail/internal/vm.(*Loader).CompileAndRun
	/go/src/github.com/google/mtail/internal/vm/loader.go:241 +0x70d
I0402 08:24:59.575097    9751 vm.go:92] Dumping vm state
I0402 08:24:59.575101    9751 vm.go:93] Name: jmxCacheSize.mtail
I0402 08:24:59.575104    9751 vm.go:94] Input: &logline.LogLine{Filename:"/home/gecope/mtail-cron/bsktkvsA/cache-size.txt", Line:"1126\t�� 4 02 17:24:32 JST 2019\tOrderIdDistributedCache\torder-id\t1458924.0\t1458924.0\t1.3913383483886719\t1.0"}
I0402 08:24:59.575115    9751 vm.go:95] Thread:
I0402 08:24:59.575118    9751 vm.go:96]  PC 23
I0402 08:24:59.575121    9751 vm.go:97]  Matched false
I0402 08:24:59.575125    9751 vm.go:98]  Matches map[0:[/home/gecope/mtail-cron/bsktkvsA/cache-size.txt bsktkvsA] 1:[1126	�� 4 02 17:24:32 JST 2019	OrderIdDistributedCache	order-id	1458924.0	1458924.0	1.3913383483886719	1.0 OrderIdDistributedCache order-id 1458924.0 1458924.0 1.3913383483886719 1.0]]
I0402 08:24:59.575135    9751 vm.go:99]  Timestamp 0001-01-01 00:00:00 +0000 UTC
I0402 08:24:59.575145    9751 vm.go:100]  Stack [true ""@1554193499574966437]
I0402 08:24:59.575277    9751 vm.go:101] Prog: jmxCacheSize.mtail
Metrics
        0 Metric: name=total_cache_size program=jmxCacheSize.mtail kind=Counter type=String hidden=false keys=[service cache_name kvs] labelvalues=[LabelValue: [OrderIdDistributedCache order-id bsktkvsA] ""@1554193499574966437] source=jmxCacheSize.mtail:1:9-24 buckets=[]
        1 Metric: name=total_memory_bytes program=jmxCacheSize.mtail kind=Counter type=String hidden=false keys=[service cache_name kvs] labelvalues=[] source=jmxCacheSize.mtail:2:9-26 buckets=[]
        2 Metric: name=total_memory_mb program=jmxCacheSize.mtail kind=Counter type=String hidden=false keys=[service cache_name kvs] labelvalues=[] source=jmxCacheSize.mtail:3:9-23 buckets=[]
Regexps
        0 /[^\t]+/mtail-cron/(?P<kvs>[a-zA-Z]+)[^\t]+/
        1 /\d+\t[^\t]+\t(?P<service>[^\t]+)\t(?P<cache_name>[^\t]+)\t(?P<cache_size>\d.+)\t(?P<memory_bytes>\d.+)\t(?P<memory_mb>\d.+)\t(?P<average_object_size>\d.+)/
Strings
 disasm  l          op  opnd
         0 getfilename     0
         1      smatch     0
         2       match     1
         3         jnm    66
         4  setmatched false
         5        push     1
         6      capref     1
         7        push     1
         8      capref     2
         9        push     0
        10      capref     1
        11       mload     0
        12       dload     3
        13        push     1
        14      capref     1
        15        push     1
        16      capref     2
        17        push     0
        18      capref     1
        19       mload     0
        20       dload     3
        21        push     1
        22      capref     3
        23         cat <nil>
        24        sset <nil>
        25        push     1
        26      capref     1
        27        push     1
        28      capref     2
        29        push     0
        30      capref     1
        31       mload     1
        32       dload     3
        33        push     1
        34      capref     1
        35        push     1
        36      capref     2
        37        push     0
        38      capref     1
        39       mload     1
        40       dload     3
        41        push     1
        42      capref     4
        43         cat <nil>
        44        sset <nil>
        45        push     1
        46      capref     1
        47        push     1
        48      capref     2
        49        push     0
        50      capref     1
        51       mload     2
        52       dload     3
        53        push     1
        54      capref     1
        55        push     1
        56      capref     2
        57        push     0
        58      capref     1
        59       mload     2
        60       dload     3
        61        push     1
        62      capref     5
        63         cat <nil>
        64        sset <nil>
        65  setmatched  true
I0402 08:24:59.575335    9751 file.go:221] Line sent
I0402 08:24:59.575340    9751 file.go:221] Line sent
I0402 08:24:59.575345    9751 file.go:171] /home/gecope/mtail-cron/bsktkvsA/cache-size.txt: file type does not support deadline
I0402 08:24:59.575352    9751 file.go:174] Read count 0 err EOF

jakirpatel avatar Apr 02 '19 08:04 jakirpatel

Hmm figured it out. Seems I was using counter. But it was double. Now seems ok.

jakirpatel avatar Apr 03 '19 04:04 jakirpatel

I'm certain there's still a bug but your last update gives me some idea of whta it was.

I think that traceback for the runtime error can be improved greatly because right now you need a good understanding of the mtail VM internals to debug it.

On Wed, 3 Apr 2019 at 15:36, Jakir Patel [email protected] wrote:

Hmm figured it out. Seems I was using counter. But it was double. Now seems ok.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/mtail/issues/221#issuecomment-479336559, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5b-2S6d9z2s5pfhmiZTN7rbUj67Rrsks5vdC-0gaJpZM4cF6z- .

jaqx0r avatar Apr 05 '19 04:04 jaqx0r

@jaqx0r

The runtime error should be more clear I think so. For me it took time to debug with the messages.

jakirpatel avatar Apr 11 '19 02:04 jakirpatel

Is there any recommendation that how the NFS filesystem should be mounted to read the files through mtail ? --@jakirpatel

Log shipping. By introducing NFS, you are introducing a layer of complexity which is known to be complex and problematic for streaming.

jnovack avatar Sep 09 '19 20:09 jnovack