lark
lark copied to clipboard
grammar files get opened an unnecessary amount of times, causing an enormous loading time when creating a parser
it seems that, when using a FromPackageLoader
object, a grammar file is opened and read from each time another grammar uses a rule that is imported from that former grammar. this means opening the same file over and over again, for each occurence of a rule contained in that file.
while this may not be noticeable for parsers that only use grammar files contained in the same directory (meaning no custom FromPackageLoader
is necessary), it becomes highly problematic when using many FromPackageLoader
s, as the time required to construct a parser goes up by an absurd amount.
by placing a print(resource_name)
in the get_data()
function of the python lib pkgutil.py
, i was able to count how many times my grammar files were loaded each. for example, the common.lark
grammar provided by lark gets opened 61 (!) times, one of my own grammars 25 times, another 16, etc.
Can you provided an example grammar(s)? lark should not do more than one import of common.lark
per grammar file.
i have 46 grammar files, some of which don't use common.lark
.
lark should not do more than one import of
common.lark
per grammar file.
do you mean the common.lark
file gets opened and read from once per grammar file? wouldn't it be better to cache the grammar once and for all?
Can you provided an example grammar(s)?
let me cook smth up so that i don't have to link my whole project lol
wouldn't it be better to cache the grammar once and for all?
Parsing/loading the grammar file is almost certainly not the problem. Also, consider these grammars:
// common_rules
name: NAME
%import common.NAME
// a.lark
start_a: "A:" name
%import common_rules.name
// b.lark
start_b: "B:" name
%import common_rules.name
// main.lark
start: start_a | start_b
%import a.start_a
%import b.start_b
when you now parse "A:hello"
, you get the tree Tree('start', [Tree('start_a', [Tree('a__name', [Token('a__common_rules__CNAME', 'hello')])])])
. Notices how the import path (main
(no prefix) imports a
imports common_rules
) is include in the rule and token names. This would be quite hard if we cached the parsed grammar, for almost no benefit because parsing is so fast.
The slowdown you have is because of many many rules, which is hard to avoid since the prefixes make all those new rules. (e.g. a__name
and b__name
are completely distinct rules, each analyzed on their own).
i get that (i used to merge transformers manually before merge_transformers
existed, so i'm familiar with the whole naming of imported rules). what i'm questioning is not the parsing of the grammar each time, rather the opening of the file containing the grammar each time. or am i completely mistaken in thinking what's slowing down the construction of the parser is the opening and closing of the files (we're talking literally 100+ openings of file)?
Parsing the lark grammar is done using LALR, so unless it's thousands of lines, I don't think it will be noticeable.
However, reading the file again and again might be noticeable on an extremely slow filesystem, such as NFS.
or am i completely mistaken in thinking what's slowing down the construction of the parser is the opening and closing of the files (we're talking literally 100+ openings of file)?
Probably, although I can't be 100% sure. You can add a simple print/timing statement in Lark.__init__
near line 300 after load_grammar
. At that point all file excess (except cache) has been done.
You can add a simple print/timing statement in
Lark.__init__
near line 300 afterload_grammar
. At that point all file excess (except cache) has been done.
how will that help me distinguish between the opening of the files and the parsing of the grammars though? at that point both have been done if i'm not mistaken
@ornariece I would expect that whole operation to be the fast part and the slow part to come afterwards (when compiling the grammar).
i just measured. in my case, the part after it takes less time in each calling of Lark.__init__
. notably, for the calling that takes the most time out of all, i get 0.78s of loading and 0.55s of compiling.
so yea, opening the files just one time each would probably reduce that loading time to at most hundredths of second
@ornariece Can you check if the loading part is the slow stuff by adding
from functools import lru_cache
import pkgutil
pkgutil.get_data = lru_cache(pkgutil.get_data)
before import lark?
(This should at the very least remove all file access for common.lark
)
do you mean lru_cache()(pkgutil.get_data)
?
For me it works without ()
, but yes. (probably python version differences)
ah yes i was using 3.7. well, no notable difference noticed... so it's mostly parsing time. well i guess if i want to avoid that i should limit my use of imported grammars...
@ornariece Can you provided the full project? It is not impossible to cache the parsing, but I would like to do profiling myself.
But also note that cache=True
for the whole lark
instance should be really fast either way.
@ornariece Can you provided the full project? It is not impossible to cache the parsing, but I would like to do profiling myself.
i can't do that, unless you're willing to sign a NDA :/
Ok, I will try to provided a patch without the project for the moment.
But also note that
cache=True
for the wholelark
instance should be really fast either way.
i've tried that before, without seeing any improvement nor cache file
i've tried that before, without seeing any improvement nor cache file
That on the other seems like a bug. The cache file will get stored in your temp
folder. Can you try to manually specify a cache file name (e.g. cache="grammar_cache.tmp"
)
that creates the file alright, but i can't notice any measurable improvement.
(i'm using open_from_package
to create the parser, maybe that matters, although from what i've seen it shouldn't)
Iirc, in order to detect changes in imported files, we end up reading and parsing all of the grammars, even when cache=True.
@erezsh But only once. I would be surprised if that would be as slow as reading and parsing everything repeatedly. (And yes open_from_package
shouldn't matter)
@ornariece Can you try this branch? pip install git+https://github.com/MegaIng/lark.git@load_files_once
no noticeable change still. i checked that the change you made in load_grammar
is indeed reached, and it is.
so yea, i think we can be pretty certain now that it's the parsing of each imported library that's eating up time. i didn't expect it to take that much time for such small grammars tbh (390 lines in total). i get the trickiness of handling imported grammars without re-parsing them, but is there hope you go down that path at some point haha?
@ornariece That branch should also only parse each grammar once.
@ornariece Could you use some kind like this:
from functools import lru_cache
from time import time
def log_wrap(f):
i = 0
def wraps(*args, **kwargs):
nonlocal i
i += 1
print(f"{i:3}: {f} called with {args=} and {kwargs=}")
start = time()
r = f(*args, **kwargs)
end = time()
print(f"{i:3}: {f} took {end - start}s")
return r
return wraps
import pkgutil
import io
io.open = log_wrap(io.open)
pkgutil.get_data = log_wrap(pkgutil.get_data)
from lark import Lark
from lark import load_grammar
load_grammar._parse_grammar = log_wrap(load_grammar._parse_grammar)
And try to figure if any of that takes long?
@ornariece That branch should also only parse each grammar once.
ok now i'm confused
And try to figure if any of that takes long?
no function is taking significantly long. what's curious though is that many calls to the get_data
don't output their elapsed time. there are 970 calls for get_data
, and only 235 prints for the elapsed time (counting all functions)
@ornariece That is expected since not all call to get_data
succeed (they throw FileNotFound
errors.)
Can you do a full on profile of the entire application and look at which function take the most time?
I would suggest https://nejc.saje.info/pstats-viewer.html or https://jiffyclub.github.io/snakeviz/ to look at the results.
It might be deepcopy
take a lot of time.
Why not just use regular profile module? Sort by cumulative time and you should see who's at fault here.
i've already done that actually (using snakeviz). but here i'm talking about time the application spends initializing in the imports (ie. including creating the parsers). i've analyzed the import times using https://pypi.org/project/importtime-waterfall/, and what comes out of it is creating the parsers is ~100% of it. but wait let me analyze the creation of the parsers specifically