socid-extractor icon indicating copy to clipboard operation
socid-extractor copied to clipboard

Major optional performance boost suggestion when operating on url input

Open guylando opened this issue 1 year ago • 1 comments

This is relevant to the operation on url input.

socid-extractor sends request to the url and only then tries to parse according to its list of supported websites.

On the one hand this allows to handle generic platforms such as vBulletin which can appear under different domains and urls, on the other hand for supporting most if not all of the other websites which can a specific domain/url, there could have been a check if the website is supported before sending the request to avoid unnecessary request for unsupported website.

So by sacrificing support of vBulletin and adding a pre-request url support check, you get a major performance improvement.

To make it optional for those who do not want to sacrifice vBulletin, this can be dependent on a new flag.

For around 180 urls which contain 25 supported urls it can lower execution time from around 400 seconds to around 200 seconds.

However for the check of url support to work, the dictionary of supported websites needs to contain some word appearing in the url so also need to fix the dictionary names (or to add a domain property for those websites which have specific domain).

So need to add (for temporary solution without adding domain property for every supported website which has a specific domain):

  1. in cli.py: def check_url_relevance(url): lowercaseUrl = url.lower() for scheme_name, scheme_data in schemes.items(): for name_part in scheme_name.lower().split(): if len(name_part) > 1 and name_part not in ['api', 'user', 'profile', 'group', 'page', 'file', 'html'] and name_part in lowercaseUrl: return True return False

  2. in cli.py run method after "print(f'Analyzing URL {url}...')" put everything inside the following conditional check: if check_url_relevance(args.url):

  3. in schemes.py change dictionary keys: 'Linktree' -> 'Linktree linktr.ee' 'Odnoklassniki' -> 'Odnoklassniki ok.ru' 'Habrahabr HTML (old)' -> 'Habrahabr HTML (old) habra' 'Habrahabr JSON' -> 'Habrahabr JSON habra' 'Telegram' -> 'Telegram t.me'

  4. optional parameter which will trigger this behavior and which can be added to the "if check_url_relevance(args.url):" condition

guylando avatar Aug 28 '22 13:08 guylando

@guylando thank you for the good idea, can you tell a little bit more about your usecase of socid-extractor? I was sure that in case of massive URLs list checks somebody will have http-responses anyway (only if that list does not contain random links).

Can you also make a draft PR with supposed changes?

soxoj avatar Sep 11 '22 16:09 soxoj