IPv6 address representation in WARC-IP-Address field
This question is about IPv6 address representation in WARC captures.
- https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-ip-address refers to RFC4291, and
- https://www.rfc-editor.org/rfc/rfc4291.html#section-2.2 says that the form (
x:x:x:x:x:x:x:x) is the "preferred" one. However, - https://datatracker.ietf.org/doc/html/rfc5952 "updates RFC4291" and "defines a canonical textual representation format", recommending the maximally shortened presentation: - no leading zeros, making use of
::notation, lowercase, and further detailed format specifications.
I'd be in favor of the format specified in RFC5952. But the WARC standard refers to RFC4291 and does not say anything about RFCs superseded or updated by another RFC. Are there any recommendations?
Your suggestion seems sensible. I've added it as a community recommendation.
I'd guess that Browsertrix or other browser-based tools already generate IPv6 traffic -- what does it do with these addresses? Also, wget?
wget calls inet_ntop which POSIX seems to only require produce "a text string suitable for presentation". It looks like glibc and musl's implementations would produce the canonical form.
The current version of browsertrix-crawler doesn't emit WARC-IP-Address. An older version I had lying around seemed to produce the canonical form.
Heritrix doesn't support IPv6. jwarc currently relies on Java's default which is the expanded old "preferred" form.
Thanks for the clarification!
I can confirm:
- wget with glibc on Ubuntu 24.04 produces the canonical form
- for Java (tested versions 11, 17, 21) a custom library is required to write the canonical form, for example Guava's InetAddresses.toAddrString()