Wm_util.Date?
while reviewing the diff between 0.5 and 0.6, I couldn't resist to lookup the specification of HTTP 1.1 and especially the Date header (see (obsoleted) RFC 2616 Sec 3.3 and RFC 7231 Section 7.1.1.1) as references):
An HTTP-date value represents time as an instance of Coordinated Universal Time (UTC).-- why does the parser in Wm_util include (i.e. are there clients out there who actually set a different timezone which webmachine needs to support - and what is the reason to hardcode the 8 timezones below)?
| "" | "Z" | "GMT" | "UTC" | "UT" -> 0
| "PST" -> -480
| "MST" | "PDT" -> -420
| "CST" | "MDT" -> -360
| "EST" | "CDT" -> -300
| "EDT" -> -240
| s -> Scanf.sscanf s "%c%02d%_[:]%02d" (fun sign hour min ->
min + hour * (if sign = '-' then -60 else 60))
HTTP-date is case sensitive.this doesn't seem to be taken care of in webmachine at all- I fail to understand why
yearis bound toif year < 50 then 2000 + year else if year < 1000 then year + 1000after%4dwas used insscanf(which parses a 4 digit number)
The function is named parse_rfc1123_date (which may actually parse a full RFC 1123 timestamp), but in HTTP the HTTP-Date is slightly different (a subset thereof, plus allowing other formats) according to the RFCs mentioned above.
Using sscanf is dangerous: there can be characters at the end of the input which are not matched by the expression, thus a date such as Mon, 10 Mar 1994 10:20:30 a0040abcdef is accepted by webmachine.
Should we revise the implementation to (a) include some test cases and (b) be more strict about its input? It is not clear to me whether there are still clients out there (which we would like to talk to) that don't use IMF-fixdate, but RFC 850 (Sunday, 06-Nov-94 08:49:37 GMT) or asctime () (Sun Nov 6 08:49:37 1994), which are both required by 7231. If this is worthwhile, I suspect using angstrom for the parser would be more convenient than the OCaml stdlib utilities. or is a dependency onto angstrom in webmachine intentionally avoided?
/cc @ansiwen @seliopou
I only looked into RFC 1123 and never verified if this is actually the correct RFC for HTTP headers. And RFC 1123 Section 5.2.14 states:
There is a strong trend towards the use of numeric timezone
indicators, and implementations SHOULD use numeric timezones
instead of timezone names. However, all implementations MUST
accept either notation. If timezone names are used, they MUST
be exactly as defined in RFC-822.
and RFC 822 Section 5.1. states:
zone = "UT" / "GMT" ; Universal Time
; North American : UT
/ "EST" / "EDT" ; Eastern: - 5/ - 4
/ "CST" / "CDT" ; Central: - 6/ - 5
/ "MST" / "MDT" ; Mountain: - 7/ - 6
/ "PST" / "PDT" ; Pacific: - 8/ - 7
/ 1ALPHA ; Military: Z = UT;
; A:-1; (J not used)
; M:-12; N:+1; Y:+12
/ ( ("+" / "-") 4DIGIT ) ; Local differential
; hours+min. (HHMM)
You are right, the year calculations are useless, because it's always parsed with 4 digits. I didn't spend too much time on that parser, I just wanted something independent of CalendarLib that is at least not worse than the old parser. So I avoided to read RFCs as much as possible, which is not a joy for me anyway. So if we need a different parser, and not RFC1123, feel free to replace it. So we should implement an RFC7231 parser instead? Regarding the accepted input I would go with Postel's law and be liberal about it as long as it doesn't impair security. But also here I don't have a strong opinion.
FWIW I haven't been able to find any examples of clients that send timezones alternative to "GMT", and I believe that doing so is wrong.
I checked apache's httpd, and nginx, and they seem to agree:
-
nginxuses this: https://github.com/nginx/nginx/blob/b900cc28fcbb4cf5a32ab62f80b59292e1c85b4b/src/core/ngx_parse_time.c ie they don't pay attention to the timezone specifications from rfc850/822 -
ap_recent_rfc822_date: https://github.com/apache/httpd/blob/0d487913fcf433ac05526330f462a99d29dc9b25/server/util_time.c#L303-L305 -
apr_date_parse_http(which seems to be the main function): https://github.com/apache/apr-util/blob/1.7.x/include/apr_date.h#L73-L98- the documentation for the function implementation is a bit more thorough: https://github.com/apache/apr-util/blob/1.7.x/misc/apr_date.c#L136-L137
- note that they also note that the timezone is ignored, for the "sake of handling Netscapeness", but the timezone configured is
0(ie GMT, at the bottom of the function).
-
I
curllocally, and it seems to send "GMT" when it does send a date.
to address @hannesm's points:
-
Ignoring output after the timestamp seems to be what both nginx and apache does (see apache's use of
*in theapr_date_checkmask(date, "## @$$ #### ##:##:## *")calls), so this is compliant with that. Specifying anything but GMT seems to be illegal, so the question is if you want to penalize illegal clients by failing the request, or silently use their broken date header. Given that the Date header is not extremely important for anything, and is mostly useful for theIf-headers, I don't think ignoring broken clients is a problem here. -
I can't think of any HTTP/1.1 headers names that aren't case-insensitive, so if webmachine doesn't handle that, it seems to be a bug. It sure is annoying since it increases time spent in the parser, but that's what you get for using a crappy protocol.
-
The
50 yearsthing: RFC7231 (HTTP/1.1) says:Recipients of a timestamp value in rfc850-date format, which uses a two-digit year, MUST interpret a timestamp that appears to be more than 50 years in the future as representing the most recent year in the past that had the same last two digits.
- It follows that whoever wrote the spec considered 2-digit years legal, and the implementation
if year < 50 then 2000 + year else if year < 1000 then year + 1000is wrong. Take the examples17;40;70. 17:2017is not more than 50 years after2018, so it should be2017.40:2040is not more than 50 years after2018, so it should be2040.51:2051is not more than 50 years after2018, so it should be2051.70:2070is more than 50 years after2018, so it should be1970. With the implementation cited above applied two 2-digit years those would be:17:201740:204051:105170:1070- this is a ridiculous year in the context of purporting the be the time of generation of the HTTP request. I believe the logic should be:
(* handle two-digit years: *) if year < 0 then failwith "no negative years"; if year <= (current_year mod 100) + 50 then year = (current_year - current_year mod 100) + year else year = (current_year - 100 - current_year mod 100) + yearFurthermore, as @hannesm points out it doesn't make sense to use this logic on input from a four-digit year, since
0017is a valid four-digit year, and while nonsense in a real-world context, it could very well be used to specify HTTP-requests prophecized by Jesus. It doesn't make sense to apply it to four-digit years either, consider0999turning into1999and1000turning into1000; that's almost a thousand years off. But of course this is a bit pedantic; I couldn't find any clients that actually send two-digit years so I'd be inclined to not support that. - It follows that whoever wrote the spec considered 2-digit years legal, and the implementation
@cfcs @hannesm
-
Regarding the time-zones: I agree, they are not required (they don't hurt either), we can safely dump them.
-
IMF-Date months are case-sensitive, this parser is case-sensitive, I don't see the problem here.
-
Regarding the year, there are two things to make clear:
-
With Scanf a %4d reads an integer having at most 4 decimal digits, so it would still match two digit integers. So the calculations would still make sense in this case.
-
The calculations don't follow RFC850, but as the commit message mentioned RFC 1123 and RFC 5322, and there they have - surprisingly :roll_eyes: - a different understanding: RFC 5322 Sec. 4.3 says:
Where a two or three digit year occurs in a date, the year is to be interpreted as follows: If a two digit year is encountered whose value is between 00 and 49, the year is interpreted by adding 2000, ending up with a value between 2000 and 2049. If a two digit year is encountered with a value between 50 and 99, or any three digit year is encountered, the year is interpreted by adding 1900.
So I boldly claim, my calculations are correct. :smirk:
-
-
I think we are over-engineering a bit here, since this is meant to be not a validator but a simple parser that is able to understand the most common format, which appears to be called IMF-fixdate. And it does that already now. However, because I also like lean code, I simplified the parser a bit, assuming IMF-fixdate input, and made the scanf format a bit stricter. See PR #92.
-
Timezones: Nice!
-
IMF-Date-stuff: I had completely misunderstood, I thought we were talking about the header name (
daTe:vsDate:etc), sorry! -
Scanf: We lose the ability to tell 0017 from 2017, but that's fine since 0017 would not be legal to generate anyway.
-
Calculations:
- Well it should follow RFC7231 from 2014, but oh well, this doesn't really seem to be used anyway, so I don't think it matters.
- Your implementation of RFC5322's looks correct to me! :)
-
I think you PR looks good :)
ACKing discussion. I haven't had a chance to read it thoroughly but I will get to it this weekend.