ocaml-dns icon indicating copy to clipboard operation
ocaml-dns copied to clipboard

`Out_of_memory` with `Dns_resolver_mirage`

Open dinosaure opened this issue 2 years ago • 10 comments

After ~2 weeks, I systematically get an Out_of_memory with this trace:

console 2022-03-02T13:17:44-00:00: Fatal error: exception Out of memory
console 2022-03-02T13:17:44-00:00: Raised by primitive operation at Stdlib__Hashtbl.resize in file "hashtbl.ml", line 148, characters 16-38
console 2022-03-02T13:17:44-00:00: Called from Dns_resolver_mirage.Make.resolver.maybe_tcp.(fun) in file "duniverse/ocaml-dns/mirage/resolver/dns_resolver_mirage.ml", line 85, characters 8-53
console 2022-03-02T13:17:44-00:00: Called from Lwt.Miscellaneous.apply in file "duniverse/lwt/src/core/lwt.ml", line 3074, characters 22-25
console 2022-03-02T13:17:44-00:00: Solo5: solo5_exit(2) called

It seems about an internal Hashtbl. The unikernel is launched with 32MB. I will try to introspect further to understand where is the error.

dinosaure avatar Mar 02 '22 13:03 dinosaure

Do you know what version of ocaml-dns was used?

reynir avatar Mar 02 '22 15:03 reynir

The version is 6.1.4.

dinosaure avatar Mar 02 '22 15:03 dinosaure

Thanks for your report. I've no insight which Hashtbl this may be... here in DNS we're using mostly Map data structures. Are you able to reproduce this behaviour in a shorter period of time by doing a lot of requests? Is the resolver serving UDP or TCP or both protocols? Is there a ~size: argument provided to Dns_resolver.create (or is the default of 10000 used)?

hannesm avatar Mar 03 '22 11:03 hannesm

Currently, my unikernel is available here: https://github.com/mirage/dns-resolver I will relaunch it with -l debug and give you a more precise trace then :+1:.

dinosaure avatar Mar 03 '22 11:03 dinosaure

That would be good -- though I'd expect more useful data if you add some metrics (or periodic log messages) about memory consumption and cache size (since the LRU of the cache is not measured in actual used bytes, but items (where the weight function is the size of the resource record map (and each resource record may be different in size)).

hannesm avatar Mar 03 '22 11:03 hannesm

The reason I asked is because I can't make sense of the line number Dns_resolver_mirage.Make.resolver.maybe_tcp.(fun) in file "duniverse/ocaml-dns/mirage/resolver/dns_resolver_mirage.ml", line 85, characters 8-53 with any of the recent releases so I was curious if you were using some other version :-)

reynir avatar Mar 03 '22 11:03 reynir

I opened #299 that should help to figure out whether the cache is causing (as created) the OOM. Is there a way to have a over-approximation of the memory used by an OCaml value (AFAICT not, esp. not if using bigarrays) -- Obj.size isn't the thing to look for (Obj.size (Obj.magic (Cstruct.create 1024)) is 3).

hannesm avatar Mar 03 '22 12:03 hannesm

Obj.size (Obj.magic (Cstruct.create 1024)) is 3

It's probably due to that Obj.size does not count malloc-ed objects.

dinosaure avatar Mar 03 '22 12:03 dinosaure

This is my last logs just before the Out_of_memory with -l debug:

console 2022-03-12T13:21:44-00:00: EDNS EDNS rcode 0 version 0 dnssec_ok false payload_size 4096 extensions  TSIG no
console 2022-03-12T13:21:44-00:00: 2022-03-12 13:21:34 -00:00: DBG [application] scrubbing (bailiwick facebook.com) data header A60B (response) operation Query rcode 
console 2022-03-12T13:21:44-00:00: no error flags: authoritative question graph.facebook.com A?
console 2022-03-12T13:21:44-00:00: data answer graph.facebook.com.	3600	CNAME	api.facebook.com. authority 
console 2022-03-12T13:21:44-00:00: additional 
console 2022-03-12T13:21:44-00:00: EDNS EDNS rcode 0 version 0 dnssec_ok false payload_size 4096 extensions  TSIG no
console 2022-03-12T13:21:44-00:00: 2022-03-12 13:21:34 -00:00: DBG [application] Dns_cache.set CNAME graph.facebook.com entry CNAME
console 2022-03-12T13:21:44-00:00: 2022-03-12 13:21:34 -00:00: DBG [application] set: graph.facebook.com (CNAME) found rank authoritative answer data insert rank authoritative answer data: 0
console 2022-03-12T13:21:44-00:00: 2022-03-12 13:21:34 -00:00: DBG [application] now querying graph.facebook.com (A)
console 2022-03-12T13:21:44-00:00: 2022-03-12 13:21:34 -00:00: DBG [application] alias while looking up graph.facebook.com (A)
console 2022-03-12T13:21:44-00:00: 2022-03-12 13:21:34 -00:00: DBG [application] follow_cname: cache miss, need to query api.facebook.com
console 2022-03-12T13:21:44-00:00: 2022-03-12 13:21:34 -00:00: DBG [application] go api.facebook.com
console 2022-03-12T13:21:44-00:00: 2022-03-12 13:21:34 -00:00: DBG [application] resolve returned zone facebook.com query api.facebook.com (A), ip 185.89.219.12
console 2022-03-12T13:21:44-00:00: 2022-03-12 13:21:34 -00:00: DBG [application] have to query (zone facebook.com) api.facebook.com (A) using ip 185.89.219.12
console 2022-03-12T13:21:44-00:00: 2022-03-12 13:21:34 -00:00: DBG [application] maybe_query: query 185.89.219.12 api.facebook.com (A)
console 2022-03-12T13:21:44-00:00: Fatal error: exception Out of memory
console 2022-03-12T13:21:44-00:00: Raised by primitive operation at Stdlib__Hashtbl.resize in file "hashtbl.ml", line 148, characters 16-38
console 2022-03-12T13:21:44-00:00: Called from Dns_resolver_mirage.Make.resolver.maybe_tcp.(fun) in file "duniverse/ocaml-dns/mirage/resolver/dns_resolver_mirage.ml", line 85, characters 8-53
console 2022-03-12T13:21:44-00:00: Called from Lwt.Miscellaneous.apply in file "duniverse/lwt/src/core/lwt.ml", line 3074, characters 22-25
console 2022-03-12T13:21:44-00:00: Solo5: solo5_exit(2) called

dinosaure avatar Mar 12 '22 13:03 dinosaure

Not sure if that can help as it seems to take a long time to trigger the memory fault, but I tried to print memory usage, my modifications can be seen https://github.com/palainp/dns-resolver/tree/print-mem-usage.

It displays every 10" 3 memory sizes: the memory size available above top of the heap (including the stack) this can be used by malloc to growth the ocaml memory, the real free memory (above the heap (including the stack) + free memory inside the heap), and total memory available (the --mem arg minus the kernel size). Normally the first should be less than or equal to the second (which indicates that there is some unused memory in the heap, something that may have been taken and may be freed by the GC at a point, malloc will use this before increasing the heap size):

$ opam pin
mirage-solo5.0.7.0          git  git+https://github.com/palainp/mirage-solo5#footprint-alt
ocaml-freestanding.0.7.0    git  git+https://github.com/palainp/ocaml-freestanding#footprint-alt
$ solo5-spt --net:service=tap100 --mem=32 dist/resolver.spt -l debug
            |      ___|
  __|  _ \  |  _ \ __ \
\__ \ (   | | (   |  ) |
____/\___/ _|\___/____/
Solo5: Bindings version v0.7.1
Solo5: Memory map: 32 MB addressable:
Solo5:   reserved @ (0x0 - 0xfffff)
Solo5:       text @ (0x100000 - 0x291fff)
Solo5:     rodata @ (0x292000 - 0x2e8fff)
Solo5:       data @ (0x2e9000 - 0x436fff)
Solo5:       heap >= 0x437000 < stack < 0x2000000
2022-03-25 16:06:04 -00:00: INF [netif] Plugging into service with mac 66:53:d1:de:e0:ad mtu 1500
2022-03-25 16:06:04 -00:00: INF [ethernet] Connected Ethernet interface 66:53:d1:de:e0:ad
2022-03-25 16:06:04 -00:00: INF [ARP] Sending gratuitous ARP for 10.0.0.2 (66:53:d1:de:e0:ad)
2022-03-25 16:06:04 -00:00: INF [ipv6] IP6: Starting
2022-03-25 16:06:04 -00:00: DBG [ndpc6] ND6: Sending RS
...
2022-03-25 16:06:08 -00:00: INF [ipv4] IP.output: no route to 202.12.27.33 (no default gateway is configured)
2022-03-25 16:06:08 -00:00: WRN [ipv4] Write to 202.12.27.33 would require an external route, which was not provided
2022-03-25 16:06:14 -00:00: INF [memory] Memory usage: Above top_heap is 23MiB / free memory 23MiB / total memory 28MiB
2022-03-25 16:06:14 -00:00: INF [ipv4] IP.output: no route to 202.12.27.33 (no default gateway is configured)
2022-03-25 16:06:14 -00:00: WRN [ipv4] Write to 202.12.27.33 would require an external route, which was not provided
2022-03-25 16:06:24 -00:00: INF [memory] Memory usage: Above top_heap is 23MiB / free memory 23MiB / total memory 28MiB
2022-03-25 16:06:26 -00:00: INF [ipv4] IP.output: no route to 202.12.27.33 (no default gateway is configured)
2022-03-25 16:06:26 -00:00: WRN [ipv4] Write to 202.12.27.33 would require an external route, which was not provided
2022-03-25 16:06:34 -00:00: INF [memory] Memory usage: Above top_heap is 22MiB / free memory 23MiB / total memory 28MiB

palainp avatar Mar 25 '22 16:03 palainp