generative-ai
generative-ai copied to clipboard
[Bug]: rag_google_documentation.ipynb has isssues in execution
File Name
/search/retrieval-augmented-generation/examples/rag_google_documentation.ipynb
What happened?
# Given a Google documentation URL, retrieve a list of all text chunks within h2 sections
def get_sections(url: str) -> list[str]:
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
sections = []
paragraphs = []
body_div = soup.find("div", class_="devsite-article-body")
for child in body_div.findChildren():
if child.name == "p":
paragraphs.append(child.get_text().strip())
if child.name == "h2":
sections.append(" ".join(paragraphs))
break
for header in soup.find_all("h2"):
paragraphs = []
nextNode = header.nextSibling
while nextNode:
if isinstance(nextNode, Tag):
if nextNode.name in {"p", "ul"}:
paragraphs.append(nextNode.get_text().strip())
elif nextNode.name == "h2":
sections.append(" ".join(paragraphs))
break
nextNode = nextNode.nextSibling
return sections
Needs to be fixed to handle cases when there is no H2 or devsite-article-body class / tag. Currently the code for child in body_div.findChildren(): runs into error if no such tag is found in the URL source code
Relevant log output
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-440b9131ebc9> in <cell line: 1>()
----> 1 all_text = [t for url in URLS for t in get_sections(url) if t]
1 frames
<ipython-input-6-440b9131ebc9> in <listcomp>(.0)
----> 1 all_text = [t for url in URLS for t in get_sections(url) if t]
<ipython-input-5-73e0f3cdcce1> in get_sections(url)
8
9 body_div = soup.find("div", class_="devsite-article-body")
---> 10 for child in body_div.findChildren():
11 if child.name == "p":
12 paragraphs.append(child.get_text().strip())
AttributeError: 'NoneType' object has no attribute 'findChildren'
CC: @holtskinner
@grivescorbett is the creator of this notebook.
Possible improvement to be made to this notebook:
The Document AI Layout Parser can handle HTML pages. This could be a way to extract the paragraph/title/etc information without doing the manual HTML parsing.