grammars-v4 icon indicating copy to clipboard operation
grammars-v4 copied to clipboard

[bug]antlr4 deal with http protocol

Open 0x9k opened this issue 6 years ago • 27 comments
trafficstars

Hi, I use antlr4 to deal with http protocol according to the rfc. When I define this grammar

image

OWS
 :   (SP | HTAB)*
 ;

SP
 :   ' '
 ;

HTAB
 :   '\t'
 ;

which get mismatched input ' ' expecting OWS in idea plugin

have any good idea to solve the problem? Thanks a lot.

the all grammar as follow:

grammar test;


/*
    HTTP-message =
    start‑line
    *( header‑field  CRLF )
    CRLF
    [ message‑body ]
*/
http_message
    :   start_line (header_field CRLF)* CRLF message_body?
    ;


/*
    start-line =
    request‑line /  status‑line
*/
start_line
    :   request_line
    ;


/*
    request-line =
    method  SP  request‑target  SP  HTTP‑version  CRLF
*/
request_line
    :   method SP request_target SP http_version CRLF
    ;


/*
    method =
    token
    ; "GET"
    ; → RFC 7231 – Section 4.3.1
    ; "HEAD"
    ; → RFC 7231 – Section 4.3.2
    ; "POST"
    ; → RFC 7231 – Section 4.3.3
    ; "PUT"
    ; → RFC 7231 – Section 4.3.4
    ; "DELETE"
    ; → RFC 7231 – Section 4.3.5
    ; "CONNECT"
    ; → RFC 7231 – Section 4.3.6
    ; "OPTIONS"
    ; → RFC 7231 – Section 4.3.7
    ; "TRACE"
    ; → RFC 7231 – Section 4.3.8
*/
method
    :   'GET'
    |   'HEAD'
    |   'POST'
    |   'PUT'
    |   'DELETE'
    |   'CONNECT'
    |   'OPTIONS'
    |   'TRACE'
    ;


/*
    SP =
    %x20
    ; space
*/
SP
    :   ' '
    ;



/*
    request-target =
    origin-form /  absolute-form /  authority-form /  asterisk-form
*/
request_target
    :   origin_form
    ;


/*
    origin-form =
    absolute-path  [ "?"  query ]
*/
origin_form
    :   absolute_path ('?' query)?
    ;


/*
    absolute-path =
    1*(  "/"  segment )
*/
absolute_path
    :   ('/' segment)+
    ;


/*
    segment =
    *pchar
*/
segment
    :   pchar*
    ;


/*
    pchar =
    unreserved /  pct‑encoded /  sub‑delims /  ":" /  "@"
*/
pchar
    :   unreserved  |   pct_encoded |   sub_delims  |   ':' |   '@'
    ;


/*
    unreserved =
    ALPHA /  DIGIT /  "-" /  "." /  "_" /  "~"
*/
unreserved
    :   ALPHA   |   DIGIT   |   '-' |   '.' |   '_' |   '~'
    ;



/*
    ALPHA =
    %x41‑5A /  %x61‑7A
    ; A‑Z  /  a‑z
*/
ALPHA
    :   [A-Za-z]
    ;


/*
    DIGIT =
    %x30‑39
    ; 0-9
*/
DIGIT
    :   [0-9]
    ;


/*
    pct-encoded =
    "%"  HEXDIG  HEXDIG
*/
pct_encoded
    :   '%' HEXDIG HEXDIG
    ;


/*
    HEXDIG =
    DIGIT /  "A" /  "B" /  "C" /  "D" /  "E" /  "F"
*/
HEXDIG
    :   DIGIT   |   'A' |   'B' |   'C' |   'D' |   'E' |   'F'
    ;



/*
    sub-delims =
    "!" /  "$" /  "&" /  "'" /  "(" /  ")" /  "*" /  "+" /  "," /  ";" /  "="
*/
sub_delims
    :   '!' |   '$' |   '&' |   '\''    |   '(' |   ')' |   '*' |   '+' |   ',' |   ';' |   '='
    ;



/*
    query =
    *(  pchar /  "/" /  "?" )
*/
query
    :   (pchar | '/' | '?')*
    ;




/*
    HTTP-version =
    HTTP-name '/'  DIGIT  "."  DIGIT
*/
http_version
    :   http_name DIGIT '.' DIGIT
    ;


/*
    HTTP-name =
    %x48.54.54.50
    ; "HTTP", case-sensitive
*/
http_name
    :   'HTTP/'
    ;



/*
    CRLF =
    CR  LF
    ; Internet standard newline
*/
CRLF
    :   '\n'
    ;



/*
    header-field =
    field-name  ":"  OWS  field-value  OWS 
*/
header_field
    :   field_name ':' OWS field_value OWS
    ;



/*
    field-name =
    token
*/
field_name
    :   token
    ;


/*
    token
*/
token
    :   tchar+
    ;


/*
    tchar =
    "!" /  "#" /  "$" /  "%" /  "&" /  "'" /  "*" /  "+" /  "-" /  "." /  "^" /  "_" /  "`" /  "|" /  "~" /  DIGIT /  ALPHA
*/
tchar
    :   '!' |   '#' |   '$' |   '%' |   '&' |   '\''    |   '*' |   '+' |   '-' |   '.' |   '^' |   '_' |   '`' |   '|' |   '~' |   DIGIT   |   ALPHA
    ;



/*
    OWS =
    *( SP /  HTAB )
    ; optional whitespace
*/
OWS
    :   (SP | HTAB)*
    ;

/*
    HTAB =
    %x09
    ; horizontal tab
*/
HTAB
    :   '\t'
    ;


/*
    field-value =
    *( field-content /  obs-fold )
*/
field_value
    :   (field_content | obs_fold)*
    ;


/*
    field-content =
    field-vchar  [ 1*( SP  /  HTAB )  field-vchar ]
*/
field_content
    :   field_vchar ((SP | HTAB)+ field_vchar)?
    ;



/*
    field-vchar =
    VCHAR /  obs-text
*/
field_vchar
    :   VCHAR
    |   obs_text
    ;


/*
    VCHAR =
    %x21-7E
    ; visible (printing) characters
*/
VCHAR
    :   [\u0021-\u007e]
    ;


/*
    obs-text =
    %x80-FF
*/
obs_text
    :   OBS_TEXT
    ;
OBS_TEXT
    :   [\u0080-\u00ff]
    ;


/*
    obs-fold =
    CRLF  1*( SP /  HTAB )     ; see  RFC 7230 – Section 3.2.4
*/
obs_fold
    :   CRLF (SP | HTAB)+
    ;


/*
    message-body =
    *OCTET
*/
message_body
    :   OCTET*
    ;


/*
    OCTET =
    %x00-FF
    ; 8 bits of data
*/
OCTET
    :   [\u0000-0x00ff]
    ;

0x9k avatar May 16 '19 12:05 0x9k

Hello. Could you give me some test data, which both of us can make tests on?

Marti2203 avatar May 25 '19 15:05 Marti2203

Hello. Could you give me some test data, which both of us can make tests on?

Sure. For example

POST /url?sa=t&source=web&rct=j&url=https://zh.wikipedia.org/zh-hans/111&ved=2ahUKEwjhwLuRtbjiAhUPRK0KHRSjDpwQFjAKegQIAxAB HTTP/1.1
Host: www.google.com.hk
Connection: close
Content-Length: 4
Ping-From: https://www.google.com.hk/search?safe=strict&ei=gx3qXOKuJ4a8tgX-ypWIDA&q=111&oq=111&gs_l=psy-ab.3..0l10.15337.16373..16590...0.0..0.783.890.0j1j6-1......0....1..gws-wiz.....0.hUqCCrrBI9s
Origin: https://www.google.com.hk
Cache-Control: max-age=0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
Ping-To: https://zh.wikipedia.org/zh-hans/111
Content-Type: text/ping
Accept: */*
X-Client-Data: CIi2yQEIorbJAQjBtskBCKmdygEIqKPKAQjwpMoBCLGnygEI4qjKAQjxqcoBCK+sygEYz6rKAQ==
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8
Cookie: NID=184=VqX86iUz6p-H_b2qbuogwjkmsk096DB-48jilOI9Pquzq8WT-aRbKsaH8UnMfvF9uHtuUtHhnJ7Z3F74bcpMNstJ5ADYV_tv09sXOJiwf3Yu-xsZ1E588v2tX6zA-J4K6c1t6t_PQP3jvtbVSdqw_YJqgU1elwvqkjzj0kBbk0I; 1P_JAR=2019-05-26-05; DV=42xzl48Lt5gpEFuauBIUhN0LQjoor5YtIbbBr4x5AQIAAAA

PING

OWS is optional whitespace

0x9k avatar May 26 '19 05:05 0x9k

Perfect. I will begin work on this tomorrow as I am away from my computer today.

Marti2203 avatar May 26 '19 05:05 Marti2203

Perfect. I will begin work on this tomorrow as I am away from my computer today.

Cool. ;)

0x9k avatar May 26 '19 05:05 0x9k

First issue I see is the octet definition, which I reworked as

OCTET
    :   '\u0000' .. '\u00ff'
    ;

Marti2203 avatar May 27 '19 09:05 Marti2203

Second issue is that OWS can match the empty string which is a warning that must be taken care of.

Marti2203 avatar May 27 '19 09:05 Marti2203

Third is 'rule http_message contains an optional block with at least one alternative that can match an empty string'. The culprit was message_body, which is optional and can contain no octets. There are 2 posibilites I see: 1) no ? on message_body , 2) message_body is OCTET+ This is what I see, if everything is okay, I will look at the grammar and make a PR for it to be added to the repository.

Marti2203 avatar May 27 '19 09:05 Marti2203

It is not easy to display the tree in grun as every character is a node, which I think is not cool, so I will deviate from the RFC and the OWS issue is still present...

Marti2203 avatar May 27 '19 10:05 Marti2203

It is not easy to display the tree in grun as every character is a node, which I think is not cool, so I will deviate from the RFC and the OWS issue is still present...

Thank you for reply.I forget to say,the whole http protocol should add the follow point.

1、First,start-line in RFC include status-line which is using for response.

2、Second,request-target is also include absolute-form(request to a proxy, other than a CONNECT or server-wide OPTIONS request)、authority-form(is only used for CONNECT requests)、asterisk-form(is only used for a server-wide OPTIONS request).

RFC addr

0x9k avatar May 27 '19 11:05 0x9k

Second issue is that OWS can match the empty string which is a warning that must be taken care of.

yeah,OWS is definition as optional whitespace

RFC addr1

RFC add2

0x9k avatar May 27 '19 11:05 0x9k

I am sorry, but I do not have the time today to fully implement the http protocol grammar as I see it is big, but I will fix the current issue. One of the biggest subproblems is that the lexer tokens are overlapping and maybe lexer modes will need to be added.

Marti2203 avatar May 27 '19 11:05 Marti2203

Second issue is that OWS can match the empty string which is a warning that must be taken care of.

yeah,OWS is definition as optional whitespace

RFC addr1

RFC add2

I understand that, but I do not think empty strings are the way to solve this requirement. Making OWS be an optional element that can catch an unlimited number of whitespace is better.

Marti2203 avatar May 27 '19 11:05 Marti2203

Third is 'rule http_message contains an optional block with at least one alternative that can match an empty string'. The culprit was message_body, which is optional and can contain no octets. There are 2 posibilites I see: 1) no ? on message_body , 2) message_body is OCTET+ This is what I see, if everything is okay, I will look at the grammar and make a PR for it to be added to the repository.

Thank you for pointing out the problems.I'm trying again to test it.

0x9k avatar May 27 '19 11:05 0x9k

No problem! :) Glad to help!

Marti2203 avatar May 27 '19 11:05 Marti2203

@Marti2203 Could you consider submitting a PR to add your HTTP grammar to grammars-v4?

teverett avatar Jun 18 '19 14:06 teverett

If @0x9k is sure that this is the full grammar in his example, then I will be glad to.

Marti2203 avatar Jun 18 '19 14:06 Marti2203

If @0x9k is sure that this is the full grammar in his example, then I will be glad to.

Thank you for your working.I'm pretty sure,let's submitting pr and add this grammar to grammars-v4. ;)

0x9k avatar Jun 19 '19 02:06 0x9k

Hello, all I'm trying to use this http grammar, but I'm struggling to get the same issue working. I'm using the grammar that was committed to this repo in #1446 together with the test data in the repo and I get these errors:

line 2:5 extraneous input ' ' expecting {ALPHA, DIGIT, '\n', OWS, VCHAR, OBS_TEXT}
line 2:9 missing '\n' at '.'
line 2:20 missing ':' at '\n'
line 3:0 missing {' ', '\t'} at 'C'
line 3:10 mismatched input ':' expecting {' ', ALPHA, DIGIT, '\n', OWS, '\t', VCHAR, OBS_TEXT}

Any ideas what I'm doing wrong?

hoshsadiq avatar Feb 15 '20 15:02 hoshsadiq

Well it's possible its a grammar bug. Can you provide your input file?

teverett avatar Feb 15 '20 16:02 teverett

The input is:

POST / HTTP/1.1
Host: www.google.com
Connection: close
Content-Length: 4
Ping-From: https://www.google.com.hk/search?safe=strict&ei=gx3qXOKuJ4a8tgX-ypWIDA&q=111&oq=111&gs_l=psy-ab.3..0l10.15337.16373..16590...0.0..0.783.890.0j1j6-1......0....1..gws-wiz.....0.hUqCCrrBI9s
Origin: https://www.google.com.hk
Cache-Control: max-age=0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
Ping-To: https://zh.wikipedia.org/zh-hans/111
Content-Type: text/ping
Accept: */*
X-Client-Data: CIi2yQEIorbJAQjBtskBCKmdygEIqKPKAQjwpMoBCLGnygEI4qjKAQjxqcoBCK+sygEYz6rKAQ==
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8
Cookie: NID=184=VqX86iUz6p-H_b2qbuogwjkmsk096DB-48jilOI9Pquzq8WT-aRbKsaH8UnMfvF9uHtuUtHhnJ7Z3F74bcpMNstJ5ADYV_tv09sXOJiwf3Yu-xsZ1E588v2tX6zA-J4K6c1t6t_PQP3jvtbVSdqw_YJqgU1elwvqkjzj0kBbk0I; 1P_JAR=2019-05-26-05; DV=42xzl48Lt5gpEFuauBIUhN0LQjoor5YtIbbBr4x5AQIAAAA

PING

Also, I just realised, the message_body is commented out inside the repo, which I've uncommented, so just for good measure, this is the grammar I've got:

grammar http;

/*
 HTTP-message = start‑line ( header‑field  CRLF ) CRLF message‑body
 */
http_message: start_line (header_field CRLF)* CRLF message_body ;

/*
 start-line = request‑line / status‑line
 */
start_line: request_line;

/*
 request-line = method  SP  request‑target  SP  HTTP‑version  CRLF
 */
request_line: method SP request_target SP http_version CRLF;

/*
 method =
    token
    ; "GET"     → RFC 7231 – Section 4.3.1
    ; "HEAD"    → RFC 7231 – Section 4.3.2
    ; "POST"    → RFC 7231 – Section 4.3.3
    ; "PUT"     → RFC 7231 – Section 4.3.4
    ; "DELETE"  → RFC 7231 – Section 4.3.5
    ; "CONNECT" → RFC 7231 – Section 4.3.6
    ; "OPTIONS" → RFC 7231 – Section 4.3.7
    ; "TRACE"   → RFC 7231 – Section 4.3.8
 */
method:
	'GET'
	| 'HEAD'
	| 'POST'
	| 'PUT'
	| 'DELETE'
	| 'CONNECT'
	| 'OPTIONS'
	| 'TRACE';

/*
 request-target = origin-form / absolute-form / authority-form / asterisk-form
 */
request_target: origin_form;

/*
 origin-form = absolute-path  [ "?"  query ]
 */
origin_form: absolute_path (QuestionMark query)?;

/*
 absolute-path = 1*( "/"  segment )
 */
absolute_path: (Slash segment)+;

/*
 segment = pchar
 */
segment: pchar*;

/*
 query = ( pchar /  "/" /  "?" )
 */
query: (pchar | Slash | QuestionMark)*;

/*
 HTTP-version = HTTP-name '/' DIGIT  "."  DIGIT
 */
http_version: http_name DIGIT Dot DIGIT;

/*
 HTTP-name = %x48.54.54.50 ; "HTTP", case-sensitive
 */
http_name: 'HTTP/';


/*
 header-field = field-name  ":"  OWS  field-value  OWS 
 */
header_field: field_name Colon OWS* field_value OWS*;

/*
 field-name = token
 */
field_name: token;

/*
 token
 */
token: tchar+;
/*
 field-value = ( field-content / obs-fold )
 */
field_value: (field_content | obs_fold)+;

/*
 field-content = field-vchar [ 1*( SP / HTAB )  field-vchar ]
 */
field_content: field_vchar ((SP | HTAB)+ field_vchar)?;

/*
 field-vchar = VCHAR / obs-text
 */
field_vchar: vCHAR | obs_text;
/*
 obs-text = %x80-FF
 */
obs_text: OBS_TEXT;
/*
 obs-fold = CRLF  1*( SP / HTAB ) ; see RFC 7230 – Section 3.2.4
 */
obs_fold: CRLF (SP | HTAB)+;

/*
 message-body = OCTET
 */
message_body: OCTET*;


/*
 SP = %x20 ; space
 */
SP: ' ';
/*
 pchar = unreserved / pct‑encoded / sub‑delims / ":" / "@"
 */
pchar: unreserved | Pct_encoded | sub_delims | Colon | At;

/*
 unreserved = ALPHA /  DIGIT /  "-" /  "." /  "_" /  "~"
 */
unreserved: ALPHA | DIGIT | Minus | Dot | Underscore | Tilde;

/*
 ALPHA = %x41‑5A /  %x61‑7A ; A‑Z / a‑z
 */
ALPHA: [A-Za-z];

/*
 DIGIT = %x30‑39 ; 0-9
 */
DIGIT: [0-9];

/*
 pct-encoded = "%"  HEXDIG  HEXDIG
 */
Pct_encoded: Percent HEXDIG HEXDIG;

/*
 HEXDIG = DIGIT /  "A" /  "B" /  "C" /  "D" /  "E" /  "F"
 */
HEXDIG: DIGIT | 'A' | 'B' | 'C' | 'D' | 'E' | 'F';

/*
 sub-delims = "!" /  "$" /  "&" /  "'" /  "(" /  ")" /  "*" /  "+" /  "," /  ";" /  "="
 */
sub_delims:
	ExclamationMark
	| DollarSign
	| Ampersand
	| SQuote
	| LColumn
	| RColumn
	| Star
	| Plus
	| SemiColon
	| Period
	| Equals;

LColumn     : '(';
RColumn     : ')';
SemiColon   : ';';
Equals      : '=';
Period      : ',';

/*
 CRLF = CR  LF ; Internet standard newline
 */
CRLF: '\n';

/*
 tchar = "!" /  "#" /  "$" /  "%" /  "&" /  "'" /  "*" /  "+" /  "-" /  "." /  "^" /  "_" /  "`" / 
 "|" /  "~" /  DIGIT /  ALPHA
 */
tchar:
	  ExclamationMark
	| DollarSign
	| Hashtag
	| Percent
	| Ampersand
	| SQuote
	| Star
	| Plus
    | Minus
	| Dot
	| Caret
    | Underscore
	| BackQuote
	| VBar
	| Tilde
	| DIGIT
	| ALPHA;

Minus           : '-';
Dot             : '.';
Underscore      : '_';
Tilde           : '~';
QuestionMark    : '?';
Slash           : '/';
ExclamationMark : '!';
Colon           : ':';
At              : '@';
DollarSign      : '$';
Hashtag         : '#';
Ampersand       : '&';
Percent         : '%';
SQuote          : '\'';
Star            : '*';
Plus            : '+';
Caret           : '^';
BackQuote       : '`';
VBar            : '|';

/*
 OWS = ( SP / HTAB ) ; optional whitespace
 */
OWS: SP | HTAB;

/*
 HTAB = %x09 ; horizontal tab
 */
HTAB: '\t';


/*
 VCHAR = %x21-7E ; visible (printing) characters
 */
vCHAR: ALPHA | DIGIT | VCHAR;

VCHAR:
      ExclamationMark
	| '"'
	| Hashtag
	| DollarSign
	| Percent
	| Ampersand
	| SQuote
	| LColumn
	| RColumn
	| RColumn
	| Star
	| Plus
	| Period
	| Minus
	| Dot
	| Slash
	| Colon
	| SemiColon
	| '<'
	| Equals
	| '>'
	| QuestionMark
	| At
	| '['
	| '\\'
	| Caret
	| Underscore
	| ']'
	| BackQuote
	| '{'
	| '}'
	| VBar
	| Tilde;

OBS_TEXT: '\u0080' .. '\u00ff' ;

/*
 OCTET = %x00-FF ; 8 bits of data
 */
OCTET: '\u0000' .. '\u00ff' ;

Preview: image

hoshsadiq avatar Feb 15 '20 16:02 hoshsadiq

Any one able to help at all?

hoshsadiq avatar Feb 20 '20 23:02 hoshsadiq

So I've narrowed it down to the following:

grammar http;
header_field: 'H:' OWS* 'a' OWS*;
SP: ' ';
OWS: SP | HTAB;
HTAB: '\t'

with the following input:

H: a

which gives me the following error:

line 1:2 extraneous input ' ' expecting {'a', OWS}

And the following parse tree: image

Oddly, when I move the definition of SP to below the definition of OWS, it works fine. Though doing the same in the real http.g4 grammar file makes things worse. Removing the space in the input also works, but this is not ideal.

Any help would be much appreciated.

hoshsadiq avatar Apr 16 '20 16:04 hoshsadiq

I've been working on this here if anyone wants to jump in.

https://github.com/teverett/grammars-v4/tree/http

teverett avatar Apr 19 '20 15:04 teverett

I don't know what I'm doing wrong, but that branch gives me the following errors:

line 1:5 mismatched input '/url?sa=t&source=web&rct=j&url=https://zh.wikipedia.org/zh-hans/111&ved=2ahUKEwjhwLuRtbjiAhUPRK0KHRSjDpwQFjAKegQIAxAB' expecting '/'
line 1:123 mismatched input 'HTTP/1.1' expecting 'HTTP/'
line 2:0 mismatched input 'Host:' expecting {'\n', TCHAR}

I've not changed anything and using the test case in that branch as well.

hoshsadiq avatar Apr 20 '20 16:04 hoshsadiq

@hoshsadiq its a work in progress.

teverett avatar Apr 21 '20 01:04 teverett

@teverett I noticed the branch is gone without it being merged. were you ever able to finish it?

hoshsadiq avatar Aug 24 '20 14:08 hoshsadiq

Hello @teverett , First thank you for the effort in providing a grammar for the HTTP protocol requests. It seems that I am finding the same bugs as the other users in this thread; so I was wondering if you had the chance to manage them or if the project is in a standby. Thank you for your time!

Jacopobracaloni avatar Jan 19 '24 14:01 Jacopobracaloni