langchain icon indicating copy to clipboard operation
langchain copied to clipboard

langchain.document_loaders.generic GenericLoader not working on Azure OpenAI - InvalidRequestError: Resource Not Found, cannot detect declared resource

Open marielaquino opened this issue 1 year ago • 0 comments

System Info

langchain=0.0.225, python=3.9.17, openai=0.27.8 openai.api_type = "azure", openai.api_version = "2023-05-15" api_base, api_key, deployment_name environment variables all configured.

Who can help?

No response

Information

  • [X] The official example notebooks/scripts
  • [X] My own modified scripts

Related Components

  • [ ] LLMs/Chat Models
  • [ ] Embedding Models
  • [ ] Prompts / Prompt Templates / Prompt Selectors
  • [ ] Output Parsers
  • [X] Document Loaders
  • [ ] Vector Stores / Retrievers
  • [ ] Memory
  • [ ] Agents / Agent Executors
  • [ ] Tools / Toolkits
  • [ ] Chains
  • [ ] Callbacks/Tracing
  • [ ] Async

Reproduction

Steps to reproduce the behavior: Note: This code is pulled directly from document loaders chapter of Langchain Chat With Your Data course with Harrison Chase and Andrew Ng. It downloads an audio file of a public youtube video and generates a transcript.

  1. In a Jupyter notebook, configure your Azure OpenAI environment variables and add this code:
from langchain.document_loaders.generic import GenericLoader 
from langchain.document_loaders.parsers import OpenAIWhisperParser 
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
  1. Create and run a new cell with this inside:
url="<https://www.youtube.com/watch?v=jGwO_UgTS7I>" 
save_dir="docs/youtube/" 
loader = GenericLoader( YoutubeAudioLoader([url],save_dir), OpenAIWhisperParser() ) 
docs = loader.load()
  1. At the transcribing step, it will fail on "InvalidRequestError".

Successfully executes the following steps:

[youtube] Extracting URL: https://www.youtube.com/watch?v=jGwO_UgTS7I
[youtube] jGwO_UgTS7I: Downloading webpage
[youtube] jGwO_UgTS7I: Downloading ios player API JSON
[youtube] jGwO_UgTS7I: Downloading android player API JSON
[youtube] jGwO_UgTS7I: Downloading m3u8 information
[info] jGwO_UgTS7I: Downloading 1 format(s): 140
[download] docs/youtube//Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a has already been downloaded
[download] 100% of   69.76MiB
[ExtractAudio] Not converting audio docs/youtube//Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2018).m4a; file is already in target format m4a
Transcribing part 1!
InvalidRequestError                       Traceback (most recent call last)
Cell In[14], line 8
      3 save_dir="docs/youtube/"
      4 loader = GenericLoader(
      5     YoutubeAudioLoader([url],save_dir),
      6     OpenAIWhisperParser()
      7 )
----> 8 docs = loader.load()

File /usr/local/lib/python3.9/site-packages/langchain/document_loaders/generic.py:90, in GenericLoader.load(self)
     88 def load(self) -> List[Document]:
     89     """Load all documents."""
---> 90     return list(self.lazy_load())

File /usr/local/lib/python3.9/site-packages/langchain/document_loaders/generic.py:86, in GenericLoader.lazy_load(self)
     84 """Load documents lazily. Use this when working at a large scale."""
     85 for blob in self.blob_loader.yield_blobs():
---> 86     yield from self.blob_parser.lazy_parse(blob)

File /usr/local/lib/python3.9/site-packages/langchain/document_loaders/parsers/audio.py:51, in OpenAIWhisperParser.lazy_parse(self, blob)
     49 # Transcribe
     50 print(f"Transcribing part {split_number+1}!")
---> 51 transcript = openai.Audio.transcribe("whisper-1", file_obj)
     53 yield Document(
     54     page_content=transcript.text,
     55     metadata={"source": blob.source, "chunk": split_number},
     56 )

File /usr/local/lib/python3.9/site-packages/openai/api_resources/audio.py:65, in Audio.transcribe(cls, model, file, api_key, api_base, api_type, api_version, organization, **params)
     55 requestor, files, data = cls._prepare_request(
     56     file=file,
     57     filename=file.name,
   (...)
     62     **params,
     63 )
     64 url = cls._get_url("transcriptions")
---> 65 response, _, api_key = requestor.request("post", url, files=files, params=data)
     66 return util.convert_to_openai_object(
     67     response, api_key, api_version, organization
     68 )

File /usr/local/lib/python3.9/site-packages/openai/api_requestor.py:298, in APIRequestor.request(self, method, url, params, headers, files, stream, request_id, request_timeout)
    277 def request(
    278     self,
    279     method,
   (...)
    286     request_timeout: Optional[Union[float, Tuple[float, float]]] = None,
    287 ) -> Tuple[Union[OpenAIResponse, Iterator[OpenAIResponse]], bool, str]:
    288     result = self.request_raw(
    289         method.lower(),
    290         url,
   (...)
    296         request_timeout=request_timeout,
    297     )
--> 298     resp, got_stream = self._interpret_response(result, stream)
    299     return resp, got_stream, self.api_key

File /usr/local/lib/python3.9/site-packages/openai/api_requestor.py:700, in APIRequestor._interpret_response(self, result, stream)
    692     return (
    693         self._interpret_response_line(
    694             line, result.status_code, result.headers, stream=True
    695         )
    696         for line in parse_stream(result.iter_lines())
    697     ), True
    698 else:
    699     return (
--> 700         self._interpret_response_line(
    701             result.content.decode("utf-8"),
    702             result.status_code,
    703             result.headers,
    704             stream=False,
    705         ),
    706         False,
    707     )

File /usr/local/lib/python3.9/site-packages/openai/api_requestor.py:763, in APIRequestor._interpret_response_line(self, rbody, rcode, rheaders, stream)
    761 stream_error = stream and "error" in resp.data
    762 if stream_error or not 200 <= rcode < 300:
--> 763     raise self.handle_error_response(
    764         rbody, rcode, resp.data, rheaders, stream_error=stream_error
    765     )
    766 return resp

InvalidRequestError: Resource not found

Usually, with "resource not found" errors, the message will tell you to input api_key or deployment_name. I'm not sure what this means, as none of the Loader methods take in these as params.

Expected behavior

Expected behavior is to finish four parts of transcription and "load" as doc in docs variable.

marielaquino avatar Jul 06 '23 19:07 marielaquino