extruct
extruct copied to clipboard
Very slow extraction for specific string
I have one site with HTML strings, where I have really slow extraction times (~60 seconds). I just call extruct.extract
with this string:
https://pastebin.com/QJbUdaA6
Other strings work in times like 1-5 seconds. Does somebody have an idea what`s wrong with this string? Is there something I can do?
Thank you all for working on this great python package!
@Schwankenson I didn't check the string yet but what might help is restricting the supported dialects by passing a custom syntaxes
argument to extruct.extract
, in case you can afford that. Depending on the data you deal with, you might find that some dialects are very rare but have large processing time, so it can make sense to disable them by default. For example, in one project we only use syntaxes=['microdata', 'opengraph', 'json-ld']
as they cover most kinds of semantic markup and are fast.
@lopuhin Great, thank you! Limiting it to json-ld and microdata shortens time to below one second!
Glad it helped, and thanks for checking 👍 I'd rather keep the issue open to see if we can fix this or update defaults or README
For one html string, I waited 10 hours. Finally found out that the problem is just in 'microformat'. After skipping that format, it takes just 1 second.
For one html string, I waited 10 hours. Finally found out that the problem is just in 'microformat'. After skipping that format, it takes just 1 second.
Super helpful thank you - this was the case for me too.