lark icon indicating copy to clipboard operation
lark copied to clipboard

lark permutes all the optional rules/terminals - Loading a grammar file takes >5 mins for a rule with many optionals

Open yjung-lyft opened this issue 4 years ago • 4 comments

I've create a grammar file off from the Hive CREATE syntax: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable

It's ambiguous and uses some number of terminals, but loading takes too much time and the serialized parser is >100MB (>600MB when debug=True).

This is the code I used to run: grammar = open("hive_create.lark").read().replace('\\\n', '') # to allow line breaks p2 = Lark(grammar, parser='lalr')

hive_create.lark:

`

// https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable start: create_statement

create_statement
: CREATE TEMPORARY? EXTERNAL? TABLE [IF NOT EXISTS] [db_name "."] table_name
["(" column_specification ("," column_specification)* constraint_specification? ")"]
[COMMENT table_comment]
[PARTITIONED BY "(" column_specification ("," column_specification)* ")"]
[CLUSTERED BY "(" col_name ("," col_name)* ")" [SORTED BY "(" col_name_and_order ("," col_name_and_order)* ")"]
INTO num_buckets BUCKETS]
[LOCATION hdfs_path]
[TBLPROPERTIES "(" property_name "=" property_value ("," property_name "=" property_value)* ")"]
";" | CREATE TEMPORARY? EXTERNAL? TABLE [IF NOT EXISTS] [db_name "."] table_name
LIKE existing_table_or_view_name
[LOCATION hdfs_path] ";" \

data_type
: primitive_type
| array_type
| map_type
| struct_type
| union_type

primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| DOUBLE PRECISION
| STRING
| BINARY
| TIMESTAMP
| DECIMAL
| DATE
| VARCHAR
| CHAR

array_type
: ARRAY "<" data_type ">"

map_type
: MAP "<" primitive_type "," data_type ">"

struct_type
: STRUCT "<" struct_type_col_spec ("," struct_type_col_spec)* ">"

struct_type_col_spec
: col_name ":" data_type [COMMENT col_comment]

union_type
: UNIONTYPE "<" data_type ("," data_type)* ">"

row_format
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char]

file_format
: SEQUENCEFILE
| TEXTFILE
| RCFILE
| ORC
| PARQUET
| AVRO
| JSONFILE \

column_constraint_specification
: PRIMARY KEY | UNIQUE | NOT NULL | DEFAULT [default_value] | CHECK [check_expression] ENABLE
| DISABLE NOVALIDATE [RELY | NORELY]

default_value
: LITERAL | CURRENT_USER "(" ")" | CURRENT_DATE "(" ")" | CURRENT_TIMESTAMP "(" ")" | NULL

constraint_specification
: ["," PRIMARY KEY "(" col_name ("," col_name)* ")" DISABLE NOVALIDATE (RELY|NORELY)
["," PRIMARY KEY "(" col_name ("," col_name)* ")" DISABLE NOVALIDATE (RELY|NORELY)]]
["," CONSTRAINT constraint_name FOREIGN KEY "(" col_name ("," col_name)* ")" REFERENCES table_name "(" col_name ("," col_name)* ")" DISABLE NOVALIDATE]
["," CONSTRAINT constraint_name UNIQUE "(" col_name ("," col_name)* ")" DISABLE NOVALIDATE (RELY|NORELY)]
["," CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE (RELY|NORELY)]

column_specification: col_name data_type [column_constraint_specification] [COMMENT col_comment] col_name_and_order: col_name [ASC|DESC] col_values_list: "(" col_values ("," col_values)* ")" col_values: "(" col_value ("," col_value)* ")"

check_expression: "check_expression"

//// TERMINALS existing_table_or_view_name: NAME col_name: NAME col_value: NAME | NUMBER_VALUE constraint_name: NAME db_name: NAME hdfs_path: STRING_VALUE property_name: NAME table_name: NAME col_comment: STRING_VALUE property_value: STRING_VALUE table_comment: STRING_VALUE num_buckets: INT_VALUE char: /[a-zA-Z]/

ARRAY: "ARRAY"i AS: "AS"i ASC: "ASC"i AVRO: "AVRO"i BIGINT: "BIGINT"i BINARY: "BINARY"i BOOLEAN: "BOOLEAN"i BUCKETS: "BUCKETS"i BY: "BY"i CHAR: "CHAR"i CHECK: "CHECK"i CLUSTERED: "CLUSTERED"i COLLECTION: "COLLECTION"i COMMENT: "COMMENT"i CONSTRAINT: "CONSTRAINT"i CREATE: "CREATE"i CURRENT_DATE: "CURRENT_DATE"i CURRENT_TIMESTAMP: "CURRENT_TIMESTAMP"i CURRENT_USER: "CURRENT_USER"i DATE: "DATE"i DECIMAL: "DECIMAL"i DEFAULT: "DEFAULT"i DEFINED: "DEFINED"i DELIMITED: "DELIMITED"i DESC: "DESC"i DISABLE: "DISABLE"i DOUBLE: "DOUBLE"i ENABLE: "ENABLE"i ESCAPED: "ESCAPED"i EXISTS: "EXISTS"i EXTERNAL: "EXTERNAL"i FIELDS: "FIELDS"i FLOAT: "FLOAT"i FOREIGN: "FOREIGN"i IF: "IF"i INT: "INT"i INTO: "INTO"i ITEMS: "ITEMS"i JSONFILE: "JSONFILE"i KEY: "KEY"i KEYS: "KEYS"i LIKE: "LIKE"i LINES: "LINES"i LITERAL: "LITERAL"i LOCATION: "LOCATION"i MAP: "MAP"i NORELY: "NORELY"i NOT: "NOT"i NOVALIDATE: "NOVALIDATE"i NULL: "NULL"i ORC: "ORC"i PARQUET: "PARQUET"i PARTITIONED: "PARTITIONED"i PRECISION: "PRECISION"i PRIMARY: "PRIMARY"i RCFILE: "RCFILE"i REFERENCES: "REFERENCES"i RELY: "RELY"i SEQUENCEFILE: "SEQUENCEFILE"i SMALLINT: "SMALLINT"i SORTED: "SORTED"i STRING: "STRING"i STRUCT: "STRUCT"i TABLE: "TABLE"i TBLPROPERTIES: "TBLPROPERTIES"i TEMPORARY: "TEMPORARY"i TERMINATED: "TERMINATED"i TEXTFILE: "TEXTFILE"i TIMESTAMP: "TIMESTAMP"i TINYINT: "TINYINT"i UNIONTYPE: "UNIONTYPE"i UNIQUE: "UNIQUE"i VARCHAR: "VARCHAR"i

STRING_VALUE : /[ubf]?r?('(?!'').?(?<!\)(\\)?')/i

%import common.SIGNED_NUMBER -> NUMBER_VALUE %import common.INT -> INT_VALUE %import common.WS %import common.CNAME -> NAME

COMMENT_VALUE: "--" /[^\n]/*

%ignore WS %ignore COMMENT_VALUE `

yjung-lyft avatar Dec 31 '19 00:12 yjung-lyft

@evandrocoan Why are you liking every single issue?

erezsh avatar Dec 31 '19 07:12 erezsh

@yjung-lyft This happens because optional grammar elements (i.e. elem? or [elem]) are expanded into the same rule as two separate lines, one where they are required and one where they don't exist. When you line up many of them, you get an exponential explosion of permutations.

It's fair to consider this a bug, and perhaps it's about time to fix it.

But also, the solution for you is very simple: Break the rules apart so that only a few optionals exist in the same rule at the same time.

erezsh avatar Dec 31 '19 07:12 erezsh

Because they are good. And it is not every issue. From the last 10 I liked 5.

evandrocoan avatar Dec 31 '19 16:12 evandrocoan

Thanks Erez for the hint! The workaround of splitting rules into multiple actually worked and now it takes ~2 secs. My own issue was resolved, but feel free to use this issue to track the bug fix you explained - also changed the title :)

yjung-lyft avatar Dec 31 '19 19:12 yjung-lyft