awesome-notebooks icon indicating copy to clipboard operation
awesome-notebooks copied to clipboard

LangChain - Perform Web scraping

Open FlorentLvr opened this issue 2 years ago • 17 comments

This notebook performs web scraping to gather content from the web and running a LLM over them. It is usefull for organizations to breakthough and achieve their goals.

FlorentLvr avatar Oct 08 '23 07:10 FlorentLvr

🚀 Branch and template have been created and pushed. You should work on:

FlorentLvr avatar Oct 08 '23 07:10 FlorentLvr

I am a developer and I would love to work on this issue please assign this to me.

Mohitraut07 avatar Oct 08 '23 07:10 Mohitraut07

Hi @Mohitraut07 , I am glad want to contribute! Please follow these instructions in the awesome-notebook README.md to start contributing. -> https://github.com/jupyter-naas/awesome-notebooks/blob/master/README.md#how-to-contribute Let me know if you have any questions! 🙏 Cheers!

FlorentLvr avatar Oct 08 '23 08:10 FlorentLvr

Hi @Mohitraut07 , I am glad want to contribute! Please follow these instructions in the awesome-notebook README.md to start contributing. -> https://github.com/jupyter-naas/awesome-notebooks/blob/master/README.md#how-to-contribute Let me know if you have any questions! 🙏 Cheers!

@Mohitraut07, just checking in! I didn't receive your application: https://bit.ly/3F8Jsjr Let me know if you have any questions.

FlorentLvr avatar Oct 10 '23 09:10 FlorentLvr

@Mohitraut07, Just checking in, is everything okay?

FlorentLvr avatar Oct 16 '23 07:10 FlorentLvr

Hi, @FlorentLvr , I want to work on this issue, Can you please assign it to me?

hope205 avatar Oct 18 '23 22:10 hope205

Hi, @FlorentLvr , I want to work on this issue, Can you please assign it to me?

@hope205! Sure, let us know if you have any question :) @srini047

FlorentLvr avatar Oct 19 '23 06:10 FlorentLvr

Hi, @FlorentLvr , I want to work on this issue, Can you please assign it to me?

@hope205! Sure, let us know if you have any question :) @srini047

Awesome @hope205, Feel free to reach out incase you need anything. I can assist you further. Looking forward to the contribution.

srini047 avatar Oct 19 '23 10:10 srini047

🚀 Branch and template have been created and pushed. You should work on:

when I cloned this repo, I couldn't find the langchain perform web scraping notebook.

hope205 avatar Oct 19 '23 13:10 hope205

🚀 Branch and template have been created and pushed. You should work on:

when I cloned this repo, I couldn't find the langchain perform web scraping notebook.

Did you switch to the right branch? I can see the template in Github: image

FlorentLvr avatar Oct 19 '23 15:10 FlorentLvr

@hope205 Make sure to see that you are in the right branch and head to the directory as suggested by @FlorentLvr image

srini047 avatar Oct 19 '23 16:10 srini047

Thanks @srini047. I have gotten it already. Started working on it

hope205 avatar Oct 19 '23 17:10 hope205

Hello @FlorentLvr, I have been working on the notebook but I am encountering errors from the langchain frame work itself. The AsyncChromiumLoader library has some internal issues

from langchain.document_loaders import AsyncChromiumLoader
from langchain.document_transformers import BeautifulSoupTransformer

# Load HTML
loader = AsyncChromiumLoader([url])
html = `loader.load()`

it gives an error at this point. Here is the error it gives

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [3], in <cell line: 3>()
      1 # Load HTML
      2 loader = AsyncChromiumLoader([url])
----> 3 html = loader.load()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\document_loaders\chromium.py:90, in AsyncChromiumLoader.load(self)
     81 def load(self) -> List[Document]:
     82     """
     83     Load and return all Documents from the provided URLs.
     84 
   (...)
     88 
     89     """
---> 90     return list(self.lazy_load())

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\langchain\document_loaders\chromium.py:77, in AsyncChromiumLoader.lazy_load(self)
     66 """
     67 Lazily load text content from the provided URLs.
     68 
   (...)
     74 
     75 """
     76 for url in self.urls:
---> 77     html_content = asyncio.run(self.ascrape_playwright(url))
     78     metadata = {"source": url}
     79     yield Document(page_content=html_content, metadata=metadata)

File ~\AppData\Local\Programs\Python\Python310\lib\asyncio\runners.py:33, in run(main, debug)
      9 """Execute the coroutine and return the result.
     10 
     11 This function runs the passed coroutine, taking care of
   (...)
     30     asyncio.run(main())
     31 """
     32 if events._get_running_loop() is not None:
---> 33     raise RuntimeError(
     34         "asyncio.run() cannot be called from a running event loop")
     36 if not coroutines.iscoroutine(main):
     37     raise ValueError("a coroutine was expected, got {!r}".format(main))

RuntimeError: asyncio.run() cannot be called from a running event loop

hope205 avatar Oct 25 '23 08:10 hope205

BeautifulSoupTransformer

Hey @hope205 !

Sorry for the delay in response. Did you install playwright and are you trying it in naas lab? This can be a problem as the drivers may not be able to run cloud based juyter environemnts.

srini047 avatar Oct 27 '23 16:10 srini047

No problem @srini047 . I installed playwright but I am not using nass.ai labs. I am running it on my pc

hope205 avatar Oct 27 '23 22:10 hope205

No problem @srini047 . I installed playwright but I am not using nass.ai labs. I am running it on my pc

Hey @hope205! Just checking in, did you make some progress? 🙏

FlorentLvr avatar Nov 01 '23 10:11 FlorentLvr

I am still working on it

hope205 avatar Nov 02 '23 20:11 hope205