refextract icon indicating copy to clipboard operation
refextract copied to clipboard

authors: catastrophic backtracking in regex

Open jacquerie opened this issue 8 years ago • 3 comments

How to reproduce:

>>> from refextract import extract_references_from_string
>>> extract_references_from_string('G. W. and L. B. and M. M. G. and T. A. and E. L. I. and E. P. and X. M. and B. Urbaszek, Magneto-optics in transition metal diselenide monolayers. 2D Mater. 2, 34002 (2015).')

this hangs refextract for, at least, days.

The reason appears to be catastrophic backtracking in this regex: https://github.com/inspirehep/refextract/blob/27588da5611f34266fd54fdbf8784814fffa0e7b/refextract/authors/regexs.py#L491-L494.

jacquerie avatar Apr 12 '17 14:04 jacquerie

This is the article that causes the issue, it should be reharvested once this is fixed: arXiv:1704.00841

david-caro avatar Apr 19 '17 16:04 david-caro

@tsgit are you by chance going to work on this issue in the near future? For the time being we have a workaround, but the approach you outlined in chat sounded way better than a workaround.

kaplun avatar Jun 15 '17 14:06 kaplun

@kaplun yes, very high on my todo list. unfortunately got pushed back by AAHEP, vacation, surgery and some other business -- by next week!

tsgit avatar Jun 15 '17 19:06 tsgit