pyResearchInsights icon indicating copy to clipboard operation
pyResearchInsights copied to clipboard

scraper_main -> TypeError: object of type 'NoneType' has no len()

Open carbbin opened this issue 1 year ago • 6 comments

Hi!

I am working with python 3.11.7.

I created a virtual enviroment and also after the error installed lower versions of beautifulsoup4==4.12.2 and bs4==0.0.1 to see if it was because of this.

It creates me the LOGS folder and i already have the NLTK_DATA folder.

What can be the error?

The error: ##################

[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date! [INFO]12:20:32 Built LOG folder for session [INFO]12:20:32 https://link.springer.com/search/page/ start_url has been received [INFO]12:20:32 https://link.springer.com/search/page/0?facet-content-type="Article"&query=Western+Ghats+Conservation&facet-language="En" has been obtained Traceback (most recent call last): File "c:\Users\vscode\pyresearch\data.py", line 16, in scraper_main(keywords_to_search, abstracts_log_name, status_logger_name) File "c:\Users\vscode\pyresearch.venv\Lib\site-packages\pyResearchInsights\Scraper.py", line 396, in scraper_main urls_to_scrape = url_generator(start_url, query_string, status_logger_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\vscode\pyresearch.venv\Lib\site-packages\pyResearchInsights\Scraper.py", line 65, in url_generator test_soup = bs(url_reader(total_url, status_logger_name), 'html.parser') ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "c:\Users\vscode\pyresearch.venv\Lib\site-packages\bs4_init_.py", line 315, in init elif len(markup) <= 256 and (

##################

The code: ################## from pyResearchInsights.common_functions import pre_processing from pyResearchInsights.Scraper import scraper_main

'''Abstracts containing these keywords will be queried from Springer''' keywords_to_search = "Western Ghats Conservation"

'''Calling the pre_processing functions here so that abstracts_log_name and status_logger_name is available across the code.''' abstracts_log_name, status_logger_name = pre_processing(keywords_to_search)

'''Runs the scraper here to scrape the details from the scientific repository''' scraper_main(keywords_to_search, abstracts_log_name, status_logger_name)

##################

carbbin avatar Mar 21 '24 11:03 carbbin

Hi @carbbin, can you post the full traceback? It looks like you've cut off the error right at the start of the actual error description.

Thanks!

SarthakJShetty avatar Mar 22 '24 14:03 SarthakJShetty

Hi @SarthakJShetty,

Yes sorry for that. Here it is:

File "c:\Users.venv\Lib\site-packages\bs4_init_.py", line 315, in init elif len(markup) <= 256 and ( ^^^^^^^^^^^ TypeError: object of type 'NoneType' has no len()

sustianovich avatar Mar 25 '24 13:03 sustianovich

Thank you for reporting this error. Indeed, it looks like there must have been site-wide changes to Elseiver which are preventing the page retrieval. I fear that it may not be possible to even retrieve the HTML anymore. I will try some other ways to retrieve the HTML and get back to you on this. This was also reported in #11

SarthakJShetty avatar Mar 25 '24 20:03 SarthakJShetty

Ok, ty very much @SarthakJShetty!

sustianovich avatar Mar 26 '24 17:03 sustianovich

Hey @SarthakJShetty Did you manage to retrieve the HTML differently and if yes, are you planning to implenting it? I'm looking for a tool like yours atm...

SebastianLeimbacher avatar May 17 '24 12:05 SebastianLeimbacher

Hi @SebastianLeimbacher! Thank you for trying out pyResearchInsights. I'll be taking a look today and try to get back to you. Apologies for the delay, the situation looks to a bit more tricky than I anticipated at first :sweat:

SarthakJShetty avatar May 19 '24 13:05 SarthakJShetty

Hi @SarthakJShetty Do you have any update?

devans18 avatar May 30 '24 16:05 devans18

Hi @devans18 and @SebastianLeimbacher and @sustianovich

Sorry for the delay, but I've finally figured this out. I will build and post a new package in a new hours and get back on this issue with an update.

SarthakJShetty avatar Jun 28 '24 18:06 SarthakJShetty

Thank you for being patient. v1.60 release should solve this issue. Feel free to reopen this page if you still run into this issue.

SarthakJShetty avatar Jun 28 '24 20:06 SarthakJShetty