grammars-v4
grammars-v4 copied to clipboard
[bug]antlr4 deal with http protocol
Hi, I use antlr4 to deal with http protocol according to the rfc. When I define this grammar

OWS
: (SP | HTAB)*
;
SP
: ' '
;
HTAB
: '\t'
;
which get mismatched input ' ' expecting OWS in idea plugin
have any good idea to solve the problem? Thanks a lot.
the all grammar as follow:
grammar test;
/*
HTTP-message =
start‑line
*( header‑field CRLF )
CRLF
[ message‑body ]
*/
http_message
: start_line (header_field CRLF)* CRLF message_body?
;
/*
start-line =
request‑line / status‑line
*/
start_line
: request_line
;
/*
request-line =
method SP request‑target SP HTTP‑version CRLF
*/
request_line
: method SP request_target SP http_version CRLF
;
/*
method =
token
; "GET"
; → RFC 7231 – Section 4.3.1
; "HEAD"
; → RFC 7231 – Section 4.3.2
; "POST"
; → RFC 7231 – Section 4.3.3
; "PUT"
; → RFC 7231 – Section 4.3.4
; "DELETE"
; → RFC 7231 – Section 4.3.5
; "CONNECT"
; → RFC 7231 – Section 4.3.6
; "OPTIONS"
; → RFC 7231 – Section 4.3.7
; "TRACE"
; → RFC 7231 – Section 4.3.8
*/
method
: 'GET'
| 'HEAD'
| 'POST'
| 'PUT'
| 'DELETE'
| 'CONNECT'
| 'OPTIONS'
| 'TRACE'
;
/*
SP =
%x20
; space
*/
SP
: ' '
;
/*
request-target =
origin-form / absolute-form / authority-form / asterisk-form
*/
request_target
: origin_form
;
/*
origin-form =
absolute-path [ "?" query ]
*/
origin_form
: absolute_path ('?' query)?
;
/*
absolute-path =
1*( "/" segment )
*/
absolute_path
: ('/' segment)+
;
/*
segment =
*pchar
*/
segment
: pchar*
;
/*
pchar =
unreserved / pct‑encoded / sub‑delims / ":" / "@"
*/
pchar
: unreserved | pct_encoded | sub_delims | ':' | '@'
;
/*
unreserved =
ALPHA / DIGIT / "-" / "." / "_" / "~"
*/
unreserved
: ALPHA | DIGIT | '-' | '.' | '_' | '~'
;
/*
ALPHA =
%x41‑5A / %x61‑7A
; A‑Z / a‑z
*/
ALPHA
: [A-Za-z]
;
/*
DIGIT =
%x30‑39
; 0-9
*/
DIGIT
: [0-9]
;
/*
pct-encoded =
"%" HEXDIG HEXDIG
*/
pct_encoded
: '%' HEXDIG HEXDIG
;
/*
HEXDIG =
DIGIT / "A" / "B" / "C" / "D" / "E" / "F"
*/
HEXDIG
: DIGIT | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
;
/*
sub-delims =
"!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
*/
sub_delims
: '!' | '$' | '&' | '\'' | '(' | ')' | '*' | '+' | ',' | ';' | '='
;
/*
query =
*( pchar / "/" / "?" )
*/
query
: (pchar | '/' | '?')*
;
/*
HTTP-version =
HTTP-name '/' DIGIT "." DIGIT
*/
http_version
: http_name DIGIT '.' DIGIT
;
/*
HTTP-name =
%x48.54.54.50
; "HTTP", case-sensitive
*/
http_name
: 'HTTP/'
;
/*
CRLF =
CR LF
; Internet standard newline
*/
CRLF
: '\n'
;
/*
header-field =
field-name ":" OWS field-value OWS
*/
header_field
: field_name ':' OWS field_value OWS
;
/*
field-name =
token
*/
field_name
: token
;
/*
token
*/
token
: tchar+
;
/*
tchar =
"!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~" / DIGIT / ALPHA
*/
tchar
: '!' | '#' | '$' | '%' | '&' | '\'' | '*' | '+' | '-' | '.' | '^' | '_' | '`' | '|' | '~' | DIGIT | ALPHA
;
/*
OWS =
*( SP / HTAB )
; optional whitespace
*/
OWS
: (SP | HTAB)*
;
/*
HTAB =
%x09
; horizontal tab
*/
HTAB
: '\t'
;
/*
field-value =
*( field-content / obs-fold )
*/
field_value
: (field_content | obs_fold)*
;
/*
field-content =
field-vchar [ 1*( SP / HTAB ) field-vchar ]
*/
field_content
: field_vchar ((SP | HTAB)+ field_vchar)?
;
/*
field-vchar =
VCHAR / obs-text
*/
field_vchar
: VCHAR
| obs_text
;
/*
VCHAR =
%x21-7E
; visible (printing) characters
*/
VCHAR
: [\u0021-\u007e]
;
/*
obs-text =
%x80-FF
*/
obs_text
: OBS_TEXT
;
OBS_TEXT
: [\u0080-\u00ff]
;
/*
obs-fold =
CRLF 1*( SP / HTAB ) ; see RFC 7230 – Section 3.2.4
*/
obs_fold
: CRLF (SP | HTAB)+
;
/*
message-body =
*OCTET
*/
message_body
: OCTET*
;
/*
OCTET =
%x00-FF
; 8 bits of data
*/
OCTET
: [\u0000-0x00ff]
;
Hello. Could you give me some test data, which both of us can make tests on?
Hello. Could you give me some test data, which both of us can make tests on?
Sure. For example
POST /url?sa=t&source=web&rct=j&url=https://zh.wikipedia.org/zh-hans/111&ved=2ahUKEwjhwLuRtbjiAhUPRK0KHRSjDpwQFjAKegQIAxAB HTTP/1.1
Host: www.google.com.hk
Connection: close
Content-Length: 4
Ping-From: https://www.google.com.hk/search?safe=strict&ei=gx3qXOKuJ4a8tgX-ypWIDA&q=111&oq=111&gs_l=psy-ab.3..0l10.15337.16373..16590...0.0..0.783.890.0j1j6-1......0....1..gws-wiz.....0.hUqCCrrBI9s
Origin: https://www.google.com.hk
Cache-Control: max-age=0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
Ping-To: https://zh.wikipedia.org/zh-hans/111
Content-Type: text/ping
Accept: */*
X-Client-Data: CIi2yQEIorbJAQjBtskBCKmdygEIqKPKAQjwpMoBCLGnygEI4qjKAQjxqcoBCK+sygEYz6rKAQ==
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8
Cookie: NID=184=VqX86iUz6p-H_b2qbuogwjkmsk096DB-48jilOI9Pquzq8WT-aRbKsaH8UnMfvF9uHtuUtHhnJ7Z3F74bcpMNstJ5ADYV_tv09sXOJiwf3Yu-xsZ1E588v2tX6zA-J4K6c1t6t_PQP3jvtbVSdqw_YJqgU1elwvqkjzj0kBbk0I; 1P_JAR=2019-05-26-05; DV=42xzl48Lt5gpEFuauBIUhN0LQjoor5YtIbbBr4x5AQIAAAA
PING
OWS is optional whitespace
Perfect. I will begin work on this tomorrow as I am away from my computer today.
Perfect. I will begin work on this tomorrow as I am away from my computer today.
Cool. ;)
First issue I see is the octet definition, which I reworked as
OCTET
: '\u0000' .. '\u00ff'
;
Second issue is that OWS can match the empty string which is a warning that must be taken care of.
Third is 'rule http_message contains an optional block with at least one alternative that can match an empty string'. The culprit was message_body, which is optional and can contain no octets. There are 2 posibilites I see: 1) no ? on message_body , 2) message_body is OCTET+ This is what I see, if everything is okay, I will look at the grammar and make a PR for it to be added to the repository.
It is not easy to display the tree in grun as every character is a node, which I think is not cool, so I will deviate from the RFC and the OWS issue is still present...
It is not easy to display the tree in grun as every character is a node, which I think is not cool, so I will deviate from the RFC and the OWS issue is still present...
Thank you for reply.I forget to say,the whole http protocol should add the follow point.
1、First,start-line in RFC include status-line which is using for response.
2、Second,request-target is also include absolute-form(request to a proxy, other than a CONNECT or server-wide OPTIONS request)、authority-form(is only used for CONNECT requests)、asterisk-form(is only used for a server-wide OPTIONS request).
Second issue is that OWS can match the empty string which is a warning that must be taken care of.
yeah,OWS is definition as optional whitespace
I am sorry, but I do not have the time today to fully implement the http protocol grammar as I see it is big, but I will fix the current issue. One of the biggest subproblems is that the lexer tokens are overlapping and maybe lexer modes will need to be added.
Second issue is that OWS can match the empty string which is a warning that must be taken care of.
yeah,OWS is definition as optional whitespace
I understand that, but I do not think empty strings are the way to solve this requirement. Making OWS be an optional element that can catch an unlimited number of whitespace is better.
Third is 'rule http_message contains an optional block with at least one alternative that can match an empty string'. The culprit was message_body, which is optional and can contain no octets. There are 2 posibilites I see: 1) no ? on message_body , 2) message_body is OCTET+ This is what I see, if everything is okay, I will look at the grammar and make a PR for it to be added to the repository.
Thank you for pointing out the problems.I'm trying again to test it.
No problem! :) Glad to help!
@Marti2203 Could you consider submitting a PR to add your HTTP grammar to grammars-v4?
If @0x9k is sure that this is the full grammar in his example, then I will be glad to.
If @0x9k is sure that this is the full grammar in his example, then I will be glad to.
Thank you for your working.I'm pretty sure,let's submitting pr and add this grammar to grammars-v4. ;)
Hello, all I'm trying to use this http grammar, but I'm struggling to get the same issue working. I'm using the grammar that was committed to this repo in #1446 together with the test data in the repo and I get these errors:
line 2:5 extraneous input ' ' expecting {ALPHA, DIGIT, '\n', OWS, VCHAR, OBS_TEXT}
line 2:9 missing '\n' at '.'
line 2:20 missing ':' at '\n'
line 3:0 missing {' ', '\t'} at 'C'
line 3:10 mismatched input ':' expecting {' ', ALPHA, DIGIT, '\n', OWS, '\t', VCHAR, OBS_TEXT}
Any ideas what I'm doing wrong?
Well it's possible its a grammar bug. Can you provide your input file?
The input is:
POST / HTTP/1.1
Host: www.google.com
Connection: close
Content-Length: 4
Ping-From: https://www.google.com.hk/search?safe=strict&ei=gx3qXOKuJ4a8tgX-ypWIDA&q=111&oq=111&gs_l=psy-ab.3..0l10.15337.16373..16590...0.0..0.783.890.0j1j6-1......0....1..gws-wiz.....0.hUqCCrrBI9s
Origin: https://www.google.com.hk
Cache-Control: max-age=0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
Ping-To: https://zh.wikipedia.org/zh-hans/111
Content-Type: text/ping
Accept: */*
X-Client-Data: CIi2yQEIorbJAQjBtskBCKmdygEIqKPKAQjwpMoBCLGnygEI4qjKAQjxqcoBCK+sygEYz6rKAQ==
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8
Cookie: NID=184=VqX86iUz6p-H_b2qbuogwjkmsk096DB-48jilOI9Pquzq8WT-aRbKsaH8UnMfvF9uHtuUtHhnJ7Z3F74bcpMNstJ5ADYV_tv09sXOJiwf3Yu-xsZ1E588v2tX6zA-J4K6c1t6t_PQP3jvtbVSdqw_YJqgU1elwvqkjzj0kBbk0I; 1P_JAR=2019-05-26-05; DV=42xzl48Lt5gpEFuauBIUhN0LQjoor5YtIbbBr4x5AQIAAAA
PING
Also, I just realised, the message_body is commented out inside the repo, which I've uncommented, so just for good measure, this is the grammar I've got:
grammar http;
/*
HTTP-message = start‑line ( header‑field CRLF ) CRLF message‑body
*/
http_message: start_line (header_field CRLF)* CRLF message_body ;
/*
start-line = request‑line / status‑line
*/
start_line: request_line;
/*
request-line = method SP request‑target SP HTTP‑version CRLF
*/
request_line: method SP request_target SP http_version CRLF;
/*
method =
token
; "GET" → RFC 7231 – Section 4.3.1
; "HEAD" → RFC 7231 – Section 4.3.2
; "POST" → RFC 7231 – Section 4.3.3
; "PUT" → RFC 7231 – Section 4.3.4
; "DELETE" → RFC 7231 – Section 4.3.5
; "CONNECT" → RFC 7231 – Section 4.3.6
; "OPTIONS" → RFC 7231 – Section 4.3.7
; "TRACE" → RFC 7231 – Section 4.3.8
*/
method:
'GET'
| 'HEAD'
| 'POST'
| 'PUT'
| 'DELETE'
| 'CONNECT'
| 'OPTIONS'
| 'TRACE';
/*
request-target = origin-form / absolute-form / authority-form / asterisk-form
*/
request_target: origin_form;
/*
origin-form = absolute-path [ "?" query ]
*/
origin_form: absolute_path (QuestionMark query)?;
/*
absolute-path = 1*( "/" segment )
*/
absolute_path: (Slash segment)+;
/*
segment = pchar
*/
segment: pchar*;
/*
query = ( pchar / "/" / "?" )
*/
query: (pchar | Slash | QuestionMark)*;
/*
HTTP-version = HTTP-name '/' DIGIT "." DIGIT
*/
http_version: http_name DIGIT Dot DIGIT;
/*
HTTP-name = %x48.54.54.50 ; "HTTP", case-sensitive
*/
http_name: 'HTTP/';
/*
header-field = field-name ":" OWS field-value OWS
*/
header_field: field_name Colon OWS* field_value OWS*;
/*
field-name = token
*/
field_name: token;
/*
token
*/
token: tchar+;
/*
field-value = ( field-content / obs-fold )
*/
field_value: (field_content | obs_fold)+;
/*
field-content = field-vchar [ 1*( SP / HTAB ) field-vchar ]
*/
field_content: field_vchar ((SP | HTAB)+ field_vchar)?;
/*
field-vchar = VCHAR / obs-text
*/
field_vchar: vCHAR | obs_text;
/*
obs-text = %x80-FF
*/
obs_text: OBS_TEXT;
/*
obs-fold = CRLF 1*( SP / HTAB ) ; see RFC 7230 – Section 3.2.4
*/
obs_fold: CRLF (SP | HTAB)+;
/*
message-body = OCTET
*/
message_body: OCTET*;
/*
SP = %x20 ; space
*/
SP: ' ';
/*
pchar = unreserved / pct‑encoded / sub‑delims / ":" / "@"
*/
pchar: unreserved | Pct_encoded | sub_delims | Colon | At;
/*
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
*/
unreserved: ALPHA | DIGIT | Minus | Dot | Underscore | Tilde;
/*
ALPHA = %x41‑5A / %x61‑7A ; A‑Z / a‑z
*/
ALPHA: [A-Za-z];
/*
DIGIT = %x30‑39 ; 0-9
*/
DIGIT: [0-9];
/*
pct-encoded = "%" HEXDIG HEXDIG
*/
Pct_encoded: Percent HEXDIG HEXDIG;
/*
HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F"
*/
HEXDIG: DIGIT | 'A' | 'B' | 'C' | 'D' | 'E' | 'F';
/*
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
*/
sub_delims:
ExclamationMark
| DollarSign
| Ampersand
| SQuote
| LColumn
| RColumn
| Star
| Plus
| SemiColon
| Period
| Equals;
LColumn : '(';
RColumn : ')';
SemiColon : ';';
Equals : '=';
Period : ',';
/*
CRLF = CR LF ; Internet standard newline
*/
CRLF: '\n';
/*
tchar = "!" / "#" / "$" / "%" / "&" / "'" / "*" / "+" / "-" / "." / "^" / "_" / "`" /
"|" / "~" / DIGIT / ALPHA
*/
tchar:
ExclamationMark
| DollarSign
| Hashtag
| Percent
| Ampersand
| SQuote
| Star
| Plus
| Minus
| Dot
| Caret
| Underscore
| BackQuote
| VBar
| Tilde
| DIGIT
| ALPHA;
Minus : '-';
Dot : '.';
Underscore : '_';
Tilde : '~';
QuestionMark : '?';
Slash : '/';
ExclamationMark : '!';
Colon : ':';
At : '@';
DollarSign : '$';
Hashtag : '#';
Ampersand : '&';
Percent : '%';
SQuote : '\'';
Star : '*';
Plus : '+';
Caret : '^';
BackQuote : '`';
VBar : '|';
/*
OWS = ( SP / HTAB ) ; optional whitespace
*/
OWS: SP | HTAB;
/*
HTAB = %x09 ; horizontal tab
*/
HTAB: '\t';
/*
VCHAR = %x21-7E ; visible (printing) characters
*/
vCHAR: ALPHA | DIGIT | VCHAR;
VCHAR:
ExclamationMark
| '"'
| Hashtag
| DollarSign
| Percent
| Ampersand
| SQuote
| LColumn
| RColumn
| RColumn
| Star
| Plus
| Period
| Minus
| Dot
| Slash
| Colon
| SemiColon
| '<'
| Equals
| '>'
| QuestionMark
| At
| '['
| '\\'
| Caret
| Underscore
| ']'
| BackQuote
| '{'
| '}'
| VBar
| Tilde;
OBS_TEXT: '\u0080' .. '\u00ff' ;
/*
OCTET = %x00-FF ; 8 bits of data
*/
OCTET: '\u0000' .. '\u00ff' ;
Preview:

Any one able to help at all?
So I've narrowed it down to the following:
grammar http;
header_field: 'H:' OWS* 'a' OWS*;
SP: ' ';
OWS: SP | HTAB;
HTAB: '\t'
with the following input:
H: a
which gives me the following error:
line 1:2 extraneous input ' ' expecting {'a', OWS}
And the following parse tree:

Oddly, when I move the definition of SP to below the definition of OWS, it works fine. Though doing the same in the real http.g4 grammar file makes things worse. Removing the space in the input also works, but this is not ideal.
Any help would be much appreciated.
I've been working on this here if anyone wants to jump in.
https://github.com/teverett/grammars-v4/tree/http
I don't know what I'm doing wrong, but that branch gives me the following errors:
line 1:5 mismatched input '/url?sa=t&source=web&rct=j&url=https://zh.wikipedia.org/zh-hans/111&ved=2ahUKEwjhwLuRtbjiAhUPRK0KHRSjDpwQFjAKegQIAxAB' expecting '/'
line 1:123 mismatched input 'HTTP/1.1' expecting 'HTTP/'
line 2:0 mismatched input 'Host:' expecting {'\n', TCHAR}
I've not changed anything and using the test case in that branch as well.
@hoshsadiq its a work in progress.
@teverett I noticed the branch is gone without it being merged. were you ever able to finish it?
Hello @teverett , First thank you for the effort in providing a grammar for the HTTP protocol requests. It seems that I am finding the same bugs as the other users in this thread; so I was wondering if you had the chance to manage them or if the project is in a standby. Thank you for your time!