chat-langchain
chat-langchain copied to clipboard
It does not work. Either take it down or fix it.
Has many errors. URL is incorrect for fetch, ingest.py gives an error even when the URL is changed to something that works and enough content is fetched. Too much effort just to try what is meant to be a demo application.
Unfortunately I run into the same issues. The URL seems to be incorrect and FAISS throws out of range errors.
I encountered the same problem as you, which prevented me from loading the Langchain doc. Is the biggest significance of this project to facilitate our interaction with Langchain doc more conveniently? This overshadows this project and we hope to resolve it as soon as possible
Perhaps we can manually save the doc for the program to parse?
The issue is intent. The use case of this application is to ease us into the langchain ecosystem, and it is not doing that. Second, the lack of attention and care on behalf of the team to even look at the bug reports is also making me feel that going back to writing my own solution from scratch is far simpler than learning a broken, unsupported, unmaintained framework. Perhaps harsh, but it is factual.
...and no, ingest.py will fail regardless.
broken. why have it up.
I was able to get it working after fixing the errors. Try this fork. Already made a pull request but has not been approved yet. The one issue you might find is a limit on OpenID API requests, because there is a lot of content on the site to digest.
thanks @joaocarlosleme
could not get your fork to work (went further than this repo though) would be good to have it running when workin in langchain as an interactive knowledge base
this is where I ended up
python ./ingest.py /Users/alexfuchs/opt/anaconda3/envs/langchain/lib/python3.11/site-packages/langchain/document_loaders/readthedocs.py:48: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 48 of the file /Users/alexfuchs/opt/anaconda3/envs/langchain/lib/python3.11/site-packages/langchain/document_loaders/readthedocs.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.
_ = BeautifulSoup(
Traceback (most recent call last):
File "/Users/~/github/chat-langchain/./ingest.py", line 36, in
@urbanscribe usually a warning is just that; it should work fine with the warning. If there is an ERROR, that is where the attention should be. Did you get the content from the URL saved on a local directory 'api.python.langchain.com`? What is the given ERROR when running ingest.py?
I´m using python 3.10 and noticed you are on 3.11. Not sure if it might change something.
@joaocarlosleme thanks very much for writing. I rebuilt the environment with 3.10
./ingest.sh on your repo code runs and downloads all the urls to local but I do not get a vectorstore.pkl created so it does not exist and the main script fails this way
ingest.sh ends with
--2023-07-29 13:07:12-- https://api.python.langchain.com/en/latest/_modules/index.html
Reusing existing connection to api.python.langchain.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘api.python.langchain.com/en/latest/_modules/index.html’
api.python.langchain.com/en/latest/_ [ <=> ] 78.89K --.-KB/s in 0.005s
2023-07-29 13:07:13 (15.1 MB/s) - ‘api.python.langchain.com/en/latest/_modules/index.html’ saved [80781]
--2023-07-29 13:07:13-- https://api.python.langchain.com/en/latest/_modules/pydantic/config.html
Reusing existing connection to api.python.langchain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2023-07-29 13:07:13 ERROR 404: Not Found.
--2023-07-29 13:07:13-- https://api.python.langchain.com/en/latest/_modules/pydantic/env_settings.html
Reusing existing connection to api.python.langchain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2023-07-29 13:07:13 ERROR 404: Not Found.
--2023-07-29 13:07:13-- https://api.python.langchain.com/en/latest/_modules/pydantic/utils.html
Reusing existing connection to api.python.langchain.com:443.
HTTP request sent, awaiting response... 404 Not Found
2023-07-29 13:07:14 ERROR 404: Not Found.
FINISHED --2023-07-29 13:07:14--
Total wall clock time: 5m 1s
Downloaded: 2030 files, 69M in 8.2s (8.37 MB/s)
main script fails like this
make start
uvicorn main:app --reload --port 9000
INFO: Will watch for changes in these directories: ['/Users/alexfuchs/Developer/chat-langchain']
INFO: Uvicorn running on http://127.0.0.1:9000 (Press CTRL+C to quit)
INFO: Started reloader process [84795] using StatReload
INFO: Started server process [84797]
INFO: Waiting for application startup.
ERROR: Traceback (most recent call last):
File "/Users/alexfuchs/anaconda3/envs/langchain/lib/python3.10/site-packages/starlette/routing.py", line 677, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/Users/alexfuchs/anaconda3/envs/langchain/lib/python3.10/site-packages/starlette/routing.py", line 566, in __aenter__
await self._router.startup()
File "/Users/alexfuchs/anaconda3/envs/langchain/lib/python3.10/site-packages/starlette/routing.py", line 654, in startup
await handler()
File "/Users/alexfuchs/Developer/chat-langchain/main.py", line 24, in startup_event
raise ValueError("vectorstore.pkl does not exist, please run ingest.py first")
ValueError: vectorstore.pkl does not exist, please run ingest.py first
I will put this comment on your repo also perhaps the convo is better there
@urbanscribe the 404 error must have caused ingest.py not to run. Just run ingest.py manually and check if it creates the vectorstore.pkl before running make start.
I preferred to separate out the readthedocs fetcher from the embeddings and hacked away at the embed.py
this runs for me if helpful to anyone else. thanks @joaocarlosleme
"""Load html from files, clean up, split, ingest into Weaviate."""
import pickle
import platform
from dotenv import load_dotenv
from langchain.document_loaders import ReadTheDocsLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
load_dotenv()
import os
def load_html_docs():
"""Load documents from web pages and return them."""
if platform.system() == "Windows":
loader = ReadTheDocsLoader("api.python.langchain.com/en/latest/", "utf-8-sig")
print("\nusing utf-8-sig windows")
print(f"Current working directory: {os.getcwd()}")
else:
loader = ReadTheDocsLoader("api.python.langchain.com/en/latest/", "utf-8-sig")
print("\nusing utf-8-sig")
print(f"Current working directory: {os.getcwd()}")
raw_documents = loader.load()
return raw_documents
def create_vectors_and_save(raw_documents):
print("Raw documents length:", len(raw_documents)) # Print raw_documents length
"""Create vectors from raw documents and save them to a pickle file."""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
documents = text_splitter.split_documents(raw_documents)
print("Documents length:", len(documents)) # Print documents length
print("Documents split into chunks.")
embeddings = OpenAIEmbeddings()
print("Embeddings:", embeddings) # Print embeddings if it's feasible
print("OpenAI Embeddings created.")
vectorstore = FAISS.from_documents(documents, embeddings)
print("Vectorstore created from documents.")
# Save vectorstore
with open("vectorstore.pkl", "wb") as f:
pickle.dump(vectorstore, f)
print("Vectorstore saved to pickle file.")
def load_local_html_docs():
"""Load locally saved HTML documents and return them."""
path = "api.python.langchain.com/en/latest/" # Adjust the path as needed
if platform.system() == "Windows":
loader = ReadTheDocsLoader(path, "utf-8-sig")
print("\nusing utf-8-sig windows")
else:
loader = ReadTheDocsLoader(path, "utf-8-sig")
print("\nusing utf-8-sig")
raw_documents = loader.load()
return raw_documents
if __name__ == "__main__":
raw_documents = load_local_html_docs()
create_vectors_and_save(raw_documents)
# if __name__ == "__main__":
# raw_documents = load_html_docs()
# create_vectors_and_save(raw_documents)