python-seo-analyzer icon indicating copy to clipboard operation
python-seo-analyzer copied to clipboard

loop on multiple Analyze website write on same variable object

Open mayouf opened this issue 4 years ago • 5 comments

Hi, I want to analyze multiple website by loop on a list and write the results in a json file.

I notice that when we crawl 2 differents website and we store the output in two differents variables (let's say A and B), the second variable, B, gets incremented of A...and so on for other crawls.

It is like the analyse() write on a the same object !!

And it gets even weirder when I delete A and B with a del A,B, the analyse() function do not re-run, it recovers the old results from nowhere !!

I tried to use function % reset to erase the memory...but still recover the results from local memory !!!

here is an example:

from seoanalyzer import analyze
A = analyze("https://krugerwildlifesafaris.com/")

# the lenght is 90
print(len(A['pages'])) 

B = analyze("http://www.vintage.co.bw/")

# the lenght is 90
print(len(A['pages']))
# the lenght is 100 but it should be 10 
print(len(B['pages']))

the A has 90 pages and B should have only 10 pages, but it has 90 from A + its own 10..

how to avoid this ? Why this erratic behavior ?

regards,

karim.m

mayouf avatar Jan 01 '20 20:01 mayouf

Same problem guyz !

ghost avatar Jan 04 '20 09:01 ghost

I fixed the issue by doing this: Go to the ("Manifest") class in the implementation and look for the "Analyze" method.

At the end of the method, before "return output" just write: Manifest.clear_cache()

Everything will be cool !

ghost avatar Jan 04 '20 09:01 ghost

Hi Ghezaielm,

Thanks for your quick feedback..by the meantime, I used another workaround, see below:

import os for website in list_of_website: ----file_name = # whatever name file you want ----command='seoanalyze {} -f json > "{}"'.format(website,file_name) ----returned_value = os.system(command) ----print(str(returned_value)+' name= '+file_name+' '+website)

And it is convenient if you want parallelize you crawl by using ThreadPoolExecutor

I have 8 cores /20 threads CPU, it is damn fast...I crawled 20k websites in few hours !!

with concurrent.futures.ThreadPoolExecutor(max_workers=80) as executor: #48 Start the load operations and mark each future with its URL future_to_url = {executor.submit(analyze_SEO, url): url for row in list_website} #print(future_to_url)

for future_url in concurrent.futures.as_completed(future_to_url): url_completed = future_to_url[future_url]

try: data = url_completed .result() if data!=None: print(data) except Exception as exc: print('%r generated an exception: %s' % url, exc)

(PS: sorry I did not how to make the spaces on github quote for code)

mayouf avatar Jan 04 '20 15:01 mayouf

Did you submit the correction on github ?

mayouf avatar Jan 04 '20 15:01 mayouf

Ah, right. I'm putting this on my roadmap for v4.1. 👍

sethblack avatar Feb 01 '20 18:02 sethblack