langchain
langchain copied to clipboard
langchain.document_loaders.generic GenericLoader not working on Azure OpenAI - InvalidRequestError: Resource Not Found, cannot detect declared resource
System Info
langchain=0.0.225, python=3.9.17, openai=0.27.8 openai.api_type = "azure", openai.api_version = "2023-05-15" api_base, api_key, deployment_name environment variables all configured.
Who can help?
No response
Information
- [X] The official example notebooks/scripts
- [X] My own modified scripts
Related Components
- [ ] LLMs/Chat Models
- [ ] Embedding Models
- [ ] Prompts / Prompt Templates / Prompt Selectors
- [ ] Output Parsers
- [X] Document Loaders
- [ ] Vector Stores / Retrievers
- [ ] Memory
- [ ] Agents / Agent Executors
- [ ] Tools / Toolkits
- [ ] Chains
- [ ] Callbacks/Tracing
- [ ] Async
Reproduction
Steps to reproduce the behavior: Note: This code is pulled directly from document loaders chapter of Langchain Chat With Your Data course with Harrison Chase and Andrew Ng. It downloads an audio file of a public youtube video and generates a transcript.
- In a Jupyter notebook, configure your Azure OpenAI environment variables and add this code:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
- Create and run a new cell with this inside:
url="<https://www.youtube.com/watch?v=jGwO_UgTS7I>"
save_dir="docs/youtube/"
loader = GenericLoader( YoutubeAudioLoader([url],save_dir), OpenAIWhisperParser() )
docs = loader.load()
- At the transcribing step, it will fail on "InvalidRequestError".
Successfully executes the following steps:
[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading android player API JSON
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] docs/youtube//Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded
[download] 100% of 69.76MiB
[ExtractAudio] Not converting audio docs/youtube//Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a; file is already in target format m4a
Transcribing part 1!
InvalidRequestError Traceback (most recent call last)
Cell In[14], line 8
3 save_dir="docs/youtube/"
4 loader = GenericLoader(
5 YoutubeAudioLoader([url],save_dir),
6 OpenAIWhisperParser()
7 )
----> 8 docs = loader.load()
File /usr/local/lib/python3.9/site-packages/langchain/document_loaders/generic.py:90, in GenericLoader.load(self)
88 def load(self) -> List[Document]:
89 """Load all documents."""
---> 90 return list(self.lazy_load())
File /usr/local/lib/python3.9/site-packages/langchain/document_loaders/generic.py:86, in GenericLoader.lazy_load(self)
84 """Load documents lazily. Use this when working at a large scale."""
85 for blob in self.blob_loader.yield_blobs():
---> 86 yield from self.blob_parser.lazy_parse(blob)
File /usr/local/lib/python3.9/site-packages/langchain/document_loaders/parsers/audio.py:51, in OpenAIWhisperParser.lazy_parse(self, blob)
49 # Transcribe
50 print(f"Transcribing part {split_number+1}!")
---> 51 transcript = openai.Audio.transcribe("whisper-1", file_obj)
53 yield Document(
54 page_content=transcript.text,
55 metadata={"source": blob.source, "chunk": split_number},
56 )
File /usr/local/lib/python3.9/site-packages/openai/api_resources/audio.py:65, in Audio.transcribe(cls, model, file, api_key, api_base, api_type, api_version, organization, **params)
55 requestor, files, data = cls._prepare_request(
56 file=file,
57 filename=file.name,
(...)
62 **params,
63 )
64 url = cls._get_url("transcriptions")
---> 65 response, _, api_key = requestor.request("post", url, files=files, params=data)
66 return util.convert_to_openai_object(
67 response, api_key, api_version, organization
68 )
File /usr/local/lib/python3.9/site-packages/openai/api_requestor.py:298, in APIRequestor.request(self, method, url, params, headers, files, stream, request_id, request_timeout)
277 def request(
278 self,
279 method,
(...)
286 request_timeout: Optional[Union[float, Tuple[float, float]]] = None,
287 ) -> Tuple[Union[OpenAIResponse, Iterator[OpenAIResponse]], bool, str]:
288 result = self.request_raw(
289 method.lower(),
290 url,
(...)
296 request_timeout=request_timeout,
297 )
--> 298 resp, got_stream = self._interpret_response(result, stream)
299 return resp, got_stream, self.api_key
File /usr/local/lib/python3.9/site-packages/openai/api_requestor.py:700, in APIRequestor._interpret_response(self, result, stream)
692 return (
693 self._interpret_response_line(
694 line, result.status_code, result.headers, stream=True
695 )
696 for line in parse_stream(result.iter_lines())
697 ), True
698 else:
699 return (
--> 700 self._interpret_response_line(
701 result.content.decode("utf-8"),
702 result.status_code,
703 result.headers,
704 stream=False,
705 ),
706 False,
707 )
File /usr/local/lib/python3.9/site-packages/openai/api_requestor.py:763, in APIRequestor._interpret_response_line(self, rbody, rcode, rheaders, stream)
761 stream_error = stream and "error" in resp.data
762 if stream_error or not 200 <= rcode < 300:
--> 763 raise self.handle_error_response(
764 rbody, rcode, resp.data, rheaders, stream_error=stream_error
765 )
766 return resp
InvalidRequestError: Resource not found
Usually, with "resource not found" errors, the message will tell you to input api_key or deployment_name. I'm not sure what this means, as none of the Loader methods take in these as params.
Expected behavior
Expected behavior is to finish four parts of transcription and "load" as doc in docs variable.