warc-specifications icon indicating copy to clipboard operation
warc-specifications copied to clipboard

IPv6 address representation in WARC-IP-Address field

Open sebastian-nagel opened this issue 1 year ago • 4 comments

This question is about IPv6 address representation in WARC captures.

I'd be in favor of the format specified in RFC5952. But the WARC standard refers to RFC4291 and does not say anything about RFCs superseded or updated by another RFC. Are there any recommendations?

sebastian-nagel avatar Nov 20 '24 21:11 sebastian-nagel

Your suggestion seems sensible. I've added it as a community recommendation.

ato avatar Nov 21 '24 00:11 ato

I'd guess that Browsertrix or other browser-based tools already generate IPv6 traffic -- what does it do with these addresses? Also, wget?

wumpus avatar Nov 21 '24 00:11 wumpus

wget calls inet_ntop which POSIX seems to only require produce "a text string suitable for presentation". It looks like glibc and musl's implementations would produce the canonical form.

The current version of browsertrix-crawler doesn't emit WARC-IP-Address. An older version I had lying around seemed to produce the canonical form.

Heritrix doesn't support IPv6. jwarc currently relies on Java's default which is the expanded old "preferred" form.

ato avatar Nov 21 '24 02:11 ato

Thanks for the clarification!

I can confirm:

  • wget with glibc on Ubuntu 24.04 produces the canonical form
  • for Java (tested versions 11, 17, 21) a custom library is required to write the canonical form, for example Guava's InetAddresses.toAddrString()

sebastian-nagel avatar Nov 21 '24 12:11 sebastian-nagel