ocaml-webmachine Wm

while reviewing the diff between 0.5 and 0.6, I couldn't resist to lookup the specification of HTTP 1.1 and especially the Date header (see (obsoleted) RFC 2616 Sec 3.3 and RFC 7231 Section 7.1.1.1) as references):

An HTTP-date value represents time as an instance of Coordinated Universal Time (UTC). -- why does the parser in Wm_util include (i.e. are there clients out there who actually set a different timezone which webmachine needs to support - and what is the reason to hardcode the 8 timezones below)?

          | "" | "Z" | "GMT" | "UTC" | "UT" -> 0
          | "PST" -> -480
          | "MST" | "PDT" -> -420
          | "CST" | "MDT" -> -360
          | "EST" | "CDT" -> -300
          | "EDT" -> -240
          | s -> Scanf.sscanf s "%c%02d%_[:]%02d" (fun sign hour min ->
              min + hour * (if sign = '-' then -60 else 60))

HTTP-date is case sensitive. this doesn't seem to be taken care of in webmachine at all
I fail to understand why year is bound to if year < 50 then 2000 + year else if year < 1000 then year + 1000 after %4d was used in sscanf (which parses a 4 digit number)

The function is named parse_rfc1123_date (which may actually parse a full RFC 1123 timestamp), but in HTTP the HTTP-Date is slightly different (a subset thereof, plus allowing other formats) according to the RFCs mentioned above.

Using sscanf is dangerous: there can be characters at the end of the input which are not matched by the expression, thus a date such as Mon, 10 Mar 1994 10:20:30 a0040abcdef is accepted by webmachine.

Should we revise the implementation to (a) include some test cases and (b) be more strict about its input? It is not clear to me whether there are still clients out there (which we would like to talk to) that don't use IMF-fixdate, but RFC 850 (Sunday, 06-Nov-94 08:49:37 GMT) or asctime () (Sun Nov 6 08:49:37 1994), which are both required by 7231. If this is worthwhile, I suspect using angstrom for the parser would be more convenient than the OCaml stdlib utilities. or is a dependency onto angstrom in webmachine intentionally avoided?

/cc @ansiwen @seliopou

May 23 '18 11:05 hannesm

I only looked into RFC 1123 and never verified if this is actually the correct RFC for HTTP headers. And RFC 1123 Section 5.2.14 states:

There is a strong trend towards the use of numeric timezone
indicators, and implementations SHOULD use numeric timezones
instead of timezone names.  However, all implementations MUST
accept either notation.  If timezone names are used, they MUST
be exactly as defined in RFC-822.

and RFC 822 Section 5.1. states:

zone        =  "UT"  / "GMT"                ; Universal Time
                                            ; North American : UT
            /  "EST" / "EDT"                ;  Eastern:  - 5/ - 4
            /  "CST" / "CDT"                ;  Central:  - 6/ - 5
            /  "MST" / "MDT"                ;  Mountain: - 7/ - 6
            /  "PST" / "PDT"                ;  Pacific:  - 8/ - 7
            /  1ALPHA                       ; Military: Z = UT;
                                            ;  A:-1; (J not used)
                                            ;  M:-12; N:+1; Y:+12
            / ( ("+" / "-") 4DIGIT )        ; Local differential
                                            ;  hours+min. (HHMM)

You are right, the year calculations are useless, because it's always parsed with 4 digits. I didn't spend too much time on that parser, I just wanted something independent of CalendarLib that is at least not worse than the old parser. So I avoided to read RFCs as much as possible, which is not a joy for me anyway. So if we need a different parser, and not RFC1123, feel free to replace it. So we should implement an RFC7231 parser instead? Regarding the accepted input I would go with Postel's law and be liberal about it as long as it doesn't impair security. But also here I don't have a strong opinion.

May 23 '18 14:05 ansiwen

FWIW I haven't been able to find any examples of clients that send timezones alternative to "GMT", and I believe that doing so is wrong.

I checked apache's httpd, and nginx, and they seem to agree:

nginx uses this: https://github.com/nginx/nginx/blob/b900cc28fcbb4cf5a32ab62f80b59292e1c85b4b/src/core/ngx_parse_time.c ie they don't pay attention to the timezone specifications from rfc850/822
ap_recent_rfc822_date: https://github.com/apache/httpd/blob/0d487913fcf433ac05526330f462a99d29dc9b25/server/util_time.c#L303-L305
apr_date_parse_http (which seems to be the main function): https://github.com/apache/apr-util/blob/1.7.x/include/apr_date.h#L73-L98
- the documentation for the function implementation is a bit more thorough: https://github.com/apache/apr-util/blob/1.7.x/misc/apr_date.c#L136-L137
- note that they also note that the timezone is ignored, for the "sake of handling Netscapeness", but the timezone configured is 0 (ie GMT, at the bottom of the function).
I curl locally, and it seems to send "GMT" when it does send a date.

May 24 '18 08:05 cfcs

to address @hannesm's points:

Ignoring output after the timestamp seems to be what both nginx and apache does (see apache's use of * in the apr_date_checkmask(date, "## @$$ #### ##:##:## *") calls), so this is compliant with that. Specifying anything but GMT seems to be illegal, so the question is if you want to penalize illegal clients by failing the request, or silently use their broken date header. Given that the Date header is not extremely important for anything, and is mostly useful for the If- headers, I don't think ignoring broken clients is a problem here.
I can't think of any HTTP/1.1 headers names that aren't case-insensitive, so if webmachine doesn't handle that, it seems to be a bug. It sure is annoying since it increases time spent in the parser, but that's what you get for using a crappy protocol.
The 50 years thing: RFC7231 (HTTP/1.1) says:

Recipients of a timestamp value in rfc850-date format, which uses a two-digit year, MUST interpret a timestamp that appears to be more than 50 years in the future as representing the most recent year in the past that had the same last two digits.
- It follows that whoever wrote the spec considered 2-digit years legal, and the implementation if year < 50 then 2000 + year else if year < 1000 then year + 1000 is wrong. Take the examples 17 ; 40; 70.
- 17: 2017 is not more than 50 years after 2018, so it should be 2017.
- 40: 2040 is not more than 50 years after 2018, so it should be 2040.
- 51: 2051 is not more than 50 years after 2018, so it should be 2051.
- 70: 2070 is more than 50 years after 2018, so it should be 1970. With the implementation cited above applied two 2-digit years those would be:
- 17: 2017
- 40: 2040
- 51: 1051
- 70: 1070 - this is a ridiculous year in the context of purporting the be the time of generation of the HTTP request. I believe the logic should be:
```
 (* handle two-digit years: *)
 if year < 0 then failwith "no negative years";
 if year <= (current_year mod 100) + 50 then
    year = (current_year - current_year mod 100) + year
 else
    year = (current_year - 100 - current_year mod 100) + year
```
Furthermore, as @hannesm points out it doesn't make sense to use this logic on input from a four-digit year, since 0017 is a valid four-digit year, and while nonsense in a real-world context, it could very well be used to specify HTTP-requests prophecized by Jesus. It doesn't make sense to apply it to four-digit years either, consider 0999 turning into 1999 and 1000 turning into 1000; that's almost a thousand years off. But of course this is a bit pedantic; I couldn't find any clients that actually send two-digit years so I'd be inclined to not support that.

May 24 '18 09:05 cfcs

@cfcs @hannesm

Regarding the time-zones: I agree, they are not required (they don't hurt either), we can safely dump them.
IMF-Date months are case-sensitive, this parser is case-sensitive, I don't see the problem here.
Regarding the year, there are two things to make clear:
- With Scanf a %4d reads an integer having at most 4 decimal digits, so it would still match two digit integers. So the calculations would still make sense in this case.
- The calculations don't follow RFC850, but as the commit message mentioned RFC 1123 and RFC 5322, and there they have - surprisingly :roll_eyes: - a different understanding: RFC 5322 Sec. 4.3 says:
  
  Where a two or three digit year occurs in a date, the year is to be interpreted as follows: If a two digit year is encountered whose value is between 00 and 49, the year is interpreted by adding 2000, ending up with a value between 2000 and 2049. If a two digit year is encountered with a value between 50 and 99, or any three digit year is encountered, the year is interpreted by adding 1900.
  
  So I boldly claim, my calculations are correct. :smirk:
I think we are over-engineering a bit here, since this is meant to be not a validator but a simple parser that is able to understand the most common format, which appears to be called IMF-fixdate. And it does that already now. However, because I also like lean code, I simplified the parser a bit, assuming IMF-fixdate input, and made the scanf format a bit stricter. See PR #92.

May 24 '18 10:05 ansiwen

Timezones: Nice!
IMF-Date-stuff: I had completely misunderstood, I thought we were talking about the header name (daTe: vs Date: etc), sorry!
Scanf: We lose the ability to tell 0017 from 2017, but that's fine since 0017 would not be legal to generate anyway.
Calculations:
- Well it should follow RFC7231 from 2014, but oh well, this doesn't really seem to be used anyway, so I don't think it matters.
- Your implementation of RFC5322's looks correct to me! :)
I think you PR looks good :)

May 24 '18 15:05 cfcs

ACKing discussion. I haven't had a chance to read it thoroughly but I will get to it this weekend.

May 31 '18 12:05 seliopou

Wm_util.Date?