lark icon indicating copy to clipboard operation
lark copied to clipboard

grammar files get opened an unnecessary amount of times, causing an enormous loading time when creating a parser

Open ornariece opened this issue 3 years ago • 72 comments

it seems that, when using a FromPackageLoader object, a grammar file is opened and read from each time another grammar uses a rule that is imported from that former grammar. this means opening the same file over and over again, for each occurence of a rule contained in that file.

while this may not be noticeable for parsers that only use grammar files contained in the same directory (meaning no custom FromPackageLoader is necessary), it becomes highly problematic when using many FromPackageLoaders, as the time required to construct a parser goes up by an absurd amount.

by placing a print(resource_name) in the get_data() function of the python lib pkgutil.py, i was able to count how many times my grammar files were loaded each. for example, the common.lark grammar provided by lark gets opened 61 (!) times, one of my own grammars 25 times, another 16, etc.

ornariece avatar Sep 12 '21 15:09 ornariece

Can you provided an example grammar(s)? lark should not do more than one import of common.lark per grammar file.

MegaIng avatar Sep 12 '21 15:09 MegaIng

i have 46 grammar files, some of which don't use common.lark.

lark should not do more than one import of common.lark per grammar file.

do you mean the common.lark file gets opened and read from once per grammar file? wouldn't it be better to cache the grammar once and for all?

Can you provided an example grammar(s)?

let me cook smth up so that i don't have to link my whole project lol

ornariece avatar Sep 12 '21 15:09 ornariece

wouldn't it be better to cache the grammar once and for all?

Parsing/loading the grammar file is almost certainly not the problem. Also, consider these grammars:

// common_rules
name: NAME

%import common.NAME

// a.lark

start_a: "A:" name

%import common_rules.name


// b.lark

start_b: "B:" name

%import common_rules.name


// main.lark

start: start_a | start_b

%import a.start_a
%import b.start_b

when you now parse "A:hello", you get the tree Tree('start', [Tree('start_a', [Tree('a__name', [Token('a__common_rules__CNAME', 'hello')])])]). Notices how the import path (main (no prefix) imports a imports common_rules) is include in the rule and token names. This would be quite hard if we cached the parsed grammar, for almost no benefit because parsing is so fast.

The slowdown you have is because of many many rules, which is hard to avoid since the prefixes make all those new rules. (e.g. a__name and b__name are completely distinct rules, each analyzed on their own).

MegaIng avatar Sep 12 '21 16:09 MegaIng

i get that (i used to merge transformers manually before merge_transformers existed, so i'm familiar with the whole naming of imported rules). what i'm questioning is not the parsing of the grammar each time, rather the opening of the file containing the grammar each time. or am i completely mistaken in thinking what's slowing down the construction of the parser is the opening and closing of the files (we're talking literally 100+ openings of file)?

ornariece avatar Sep 12 '21 16:09 ornariece

Parsing the lark grammar is done using LALR, so unless it's thousands of lines, I don't think it will be noticeable.

However, reading the file again and again might be noticeable on an extremely slow filesystem, such as NFS.

erezsh avatar Sep 12 '21 16:09 erezsh

or am i completely mistaken in thinking what's slowing down the construction of the parser is the opening and closing of the files (we're talking literally 100+ openings of file)?

Probably, although I can't be 100% sure. You can add a simple print/timing statement in Lark.__init__ near line 300 after load_grammar. At that point all file excess (except cache) has been done.

MegaIng avatar Sep 12 '21 16:09 MegaIng

You can add a simple print/timing statement in Lark.__init__ near line 300 after load_grammar. At that point all file excess (except cache) has been done.

how will that help me distinguish between the opening of the files and the parsing of the grammars though? at that point both have been done if i'm not mistaken

ornariece avatar Sep 12 '21 16:09 ornariece

@ornariece I would expect that whole operation to be the fast part and the slow part to come afterwards (when compiling the grammar).

MegaIng avatar Sep 12 '21 16:09 MegaIng

i just measured. in my case, the part after it takes less time in each calling of Lark.__init__. notably, for the calling that takes the most time out of all, i get 0.78s of loading and 0.55s of compiling. so yea, opening the files just one time each would probably reduce that loading time to at most hundredths of second

ornariece avatar Sep 12 '21 16:09 ornariece

@ornariece Can you check if the loading part is the slow stuff by adding

from functools import lru_cache
import pkgutil
pkgutil.get_data = lru_cache(pkgutil.get_data)

before import lark?

(This should at the very least remove all file access for common.lark)

MegaIng avatar Sep 12 '21 16:09 MegaIng

do you mean lru_cache()(pkgutil.get_data)?

ornariece avatar Sep 12 '21 16:09 ornariece

For me it works without (), but yes. (probably python version differences)

MegaIng avatar Sep 12 '21 16:09 MegaIng

ah yes i was using 3.7. well, no notable difference noticed... so it's mostly parsing time. well i guess if i want to avoid that i should limit my use of imported grammars...

ornariece avatar Sep 12 '21 17:09 ornariece

@ornariece Can you provided the full project? It is not impossible to cache the parsing, but I would like to do profiling myself.

MegaIng avatar Sep 12 '21 17:09 MegaIng

But also note that cache=True for the whole lark instance should be really fast either way.

MegaIng avatar Sep 12 '21 17:09 MegaIng

@ornariece Can you provided the full project? It is not impossible to cache the parsing, but I would like to do profiling myself.

i can't do that, unless you're willing to sign a NDA :/

ornariece avatar Sep 12 '21 17:09 ornariece

Ok, I will try to provided a patch without the project for the moment.

MegaIng avatar Sep 12 '21 17:09 MegaIng

But also note that cache=True for the whole lark instance should be really fast either way.

i've tried that before, without seeing any improvement nor cache file

ornariece avatar Sep 12 '21 17:09 ornariece

i've tried that before, without seeing any improvement nor cache file

That on the other seems like a bug. The cache file will get stored in your temp folder. Can you try to manually specify a cache file name (e.g. cache="grammar_cache.tmp")

MegaIng avatar Sep 12 '21 17:09 MegaIng

that creates the file alright, but i can't notice any measurable improvement. (i'm using open_from_package to create the parser, maybe that matters, although from what i've seen it shouldn't)

ornariece avatar Sep 12 '21 17:09 ornariece

Iirc, in order to detect changes in imported files, we end up reading and parsing all of the grammars, even when cache=True.

erezsh avatar Sep 12 '21 18:09 erezsh

@erezsh But only once. I would be surprised if that would be as slow as reading and parsing everything repeatedly. (And yes open_from_package shouldn't matter)

MegaIng avatar Sep 12 '21 18:09 MegaIng

@ornariece Can you try this branch? pip install git+https://github.com/MegaIng/lark.git@load_files_once

MegaIng avatar Sep 12 '21 18:09 MegaIng

no noticeable change still. i checked that the change you made in load_grammar is indeed reached, and it is. so yea, i think we can be pretty certain now that it's the parsing of each imported library that's eating up time. i didn't expect it to take that much time for such small grammars tbh (390 lines in total). i get the trickiness of handling imported grammars without re-parsing them, but is there hope you go down that path at some point haha?

ornariece avatar Sep 12 '21 21:09 ornariece

@ornariece That branch should also only parse each grammar once.

MegaIng avatar Sep 12 '21 21:09 MegaIng

@ornariece Could you use some kind like this:

from functools import lru_cache
from time import time


def log_wrap(f):
    i = 0

    def wraps(*args, **kwargs):
        nonlocal i
        i += 1
        print(f"{i:3}: {f} called with {args=} and {kwargs=}")
        start = time()
        r = f(*args, **kwargs)
        end = time()
        print(f"{i:3}: {f} took {end - start}s")
        return r

    return wraps


import pkgutil
import io

io.open = log_wrap(io.open)
pkgutil.get_data = log_wrap(pkgutil.get_data)

from lark import Lark

from lark import load_grammar
load_grammar._parse_grammar = log_wrap(load_grammar._parse_grammar)

And try to figure if any of that takes long?

MegaIng avatar Sep 12 '21 21:09 MegaIng

@ornariece That branch should also only parse each grammar once.

ok now i'm confused

And try to figure if any of that takes long?

no function is taking significantly long. what's curious though is that many calls to the get_data don't output their elapsed time. there are 970 calls for get_data, and only 235 prints for the elapsed time (counting all functions)

ornariece avatar Sep 12 '21 22:09 ornariece

@ornariece That is expected since not all call to get_data succeed (they throw FileNotFound errors.)

Can you do a full on profile of the entire application and look at which function take the most time?

I would suggest https://nejc.saje.info/pstats-viewer.html or https://jiffyclub.github.io/snakeviz/ to look at the results.

It might be deepcopy take a lot of time.

MegaIng avatar Sep 12 '21 22:09 MegaIng

Why not just use regular profile module? Sort by cumulative time and you should see who's at fault here.

erezsh avatar Sep 12 '21 22:09 erezsh

i've already done that actually (using snakeviz). but here i'm talking about time the application spends initializing in the imports (ie. including creating the parsers). i've analyzed the import times using https://pypi.org/project/importtime-waterfall/, and what comes out of it is creating the parsers is ~100% of it. but wait let me analyze the creation of the parsers specifically

ornariece avatar Sep 12 '21 22:09 ornariece