reach
reach copied to clipboard
Exact matching is slow on local run of refparse
During my work on this PR I ran locally for the 70 documents in the latest MSF scrape, using:
python -m policytool.refparse.refparse \
--scraper-file "s3://datalabs-dev/reach-airflow/output/policy/parsed-pdfs/msf/parsed-pdfs-msf.json.gz" \
--references-file "s3://datalabs-data/wellcome_publications/uber_api_publications.csv" \
--model-file "s3://datalabs-data/reference_parser_models/reference_parser_pipeline.pkl" \
--output-url "file://./tmp/parser-output/output_folder_name"
this took 504 seconds and finds 10 doc-publication matches.
Without these changes this took 8 seconds and found 2 doc-publication matches (reassuring also found in the 10 full text search matches).
- I tried flashtext to make it quicker, but it didn't work on looking for whole sentences in large amounts of text - it works for 1 or 2 words in text.
- I tried Whoosh to make this quicker, but it took 1774 seconds and found 7538 matches - obviously there is probably a bug somewhere, but I didn't investigate any further.
- I started to try Spacy's PhraseMatcher (
pip install spacy
andpython -m spacy download en_core_web_sm
) but stopped, this might be a good starting point
We can potentially instrument this with cProfile to see what's eating up all the time while it's processing. I can have a look at running this next week wrapped with that to see where the time is being spent.
Good time to try Python 3.8 cprofile as a context manager
?
https://docs.python.org/3/whatsnew/3.8.html#cprofile
This looks like it's down to the use of regex in exact_match.py
for substitutions, each time the clean_text
method is called, it's recompiling the regular expressions. Regex isn't the fastest thing on the planet unfortunately and it may be worth exploring different options for doing these substitutions if the speed of this is a major issue.
string = re.sub("\\n", " ", string)
string = re.sub("\s{1,}", " ", string)
string = re.sub("[^A-Za-z0-9 ]", "", string)
As a side note: A (relatively easy) way to diagnose these things is using cProfile
and kcachegrind
.
Normally you can just import cProfile into your file that you want to run and set it up like so:
import cProfile
def some_func():
# Do things
if __name__ == "__main__":
cProfile.run("some_func()", "output.pyprof")
which will output a profile for the run of some_func()
.
Once you have that output file you can install pyprof2calltree and qcachegrind (a QT UI on top of kcachegrind).
brew install qcachegrind
pip install pyprof2calltree
Then run the command pyprof2calltree -i output.pyprof -k
which will format and open the pyprof in the qcachegrind UI. You can then click through and see where time is being spent in the codes run.
In later version of python 3.8+ you can run cProfile
from the command-line against a module endpoint (i.e. python -m cProfile -m reach.refparse.refparse ...
) but the module targeting for cProfile didn't land in python until 3.8.I think.
@jdu thanks for the analysis and info, I didn't expect it to be the regex!
@liz also worth noting if you have a license for pycharm or intellij there's a profiling GUI built on top of cProfile built into the IDE which you can use to run the command and get some output about where time in the code is spent, it's not as robust as qcachegrinds UI but it does the job. I'm not sure about VSCode, whether it has python profiling but it may have something similar if you're using that. If you're using vim/neovim (like me) then the above is probably the best route.
each time the clean_text method is called, it's recompiling the regular expressions.
One note is that I do think regex compilations are cached, per https://stackoverflow.com/questions/12514157/how-does-pythons-regex-pattern-caching-work . At least on Python 3.6, looking in the source, that's the case.
I still have the pyprof so had a quick look at the regex compilation only happens about 15 times across the life of the script so it is caching it, so it may just be completely down to the cumulative time repeated calls to sub
take for the size of the text strings and number of them being iterated through.