extruct icon indicating copy to clipboard operation
extruct copied to clipboard

Very slow extraction for specific string

Open Schwankenson opened this issue 2 years ago • 6 comments

I have one site with HTML strings, where I have really slow extraction times (~60 seconds). I just call extruct.extract with this string:

https://pastebin.com/QJbUdaA6

Other strings work in times like 1-5 seconds. Does somebody have an idea what`s wrong with this string? Is there something I can do?

Thank you all for working on this great python package!

Schwankenson avatar Apr 18 '22 14:04 Schwankenson

@Schwankenson I didn't check the string yet but what might help is restricting the supported dialects by passing a custom syntaxes argument to extruct.extract, in case you can afford that. Depending on the data you deal with, you might find that some dialects are very rare but have large processing time, so it can make sense to disable them by default. For example, in one project we only use syntaxes=['microdata', 'opengraph', 'json-ld'] as they cover most kinds of semantic markup and are fast.

lopuhin avatar Apr 18 '22 15:04 lopuhin

@lopuhin Great, thank you! Limiting it to json-ld and microdata shortens time to below one second!

Schwankenson avatar Apr 18 '22 15:04 Schwankenson

Glad it helped, and thanks for checking 👍 I'd rather keep the issue open to see if we can fix this or update defaults or README

lopuhin avatar Apr 18 '22 15:04 lopuhin

For one html string, I waited 10 hours. Finally found out that the problem is just in 'microformat'. After skipping that format, it takes just 1 second.

sitems avatar Jul 08 '22 07:07 sitems

For one html string, I waited 10 hours. Finally found out that the problem is just in 'microformat'. After skipping that format, it takes just 1 second.

Super helpful thank you - this was the case for me too.

azcarraga avatar Aug 22 '22 07:08 azcarraga