quivr icon indicating copy to clipboard operation
quivr copied to clipboard

PermissionError when I Add a PDF to Database

Open pepeto opened this issue 1 year ago • 8 comments

image

I think I followed all the instructions but once the streamlit runs I drag a PDF and when a click on Add to Database, this error is shown. Any idea?

THANK YOU !!!

pepeto avatar May 13 '23 21:05 pepeto

Ouch something about windows probably 😬

Where did you install quiver and do you have access to the D folder mentioned ?

StanGirard avatar May 13 '23 22:05 StanGirard

image

I think I followed all the instructions but once the streamlit runs I drag a PDF and when a click on Add to Database, this error is shown. Any idea?

THANK YOU !!!

I can see three letters drives in your answer. Probably that's the issue. When you upload a file, it's going to a folder in the app, and after it is uploaded as embeddings, it's deleted. I don't know why this "duplication" is needed.

Klaudioz avatar May 13 '23 23:05 Klaudioz

This is what is shown in the console:

2023-05-13 18:19:16.063 Uncaught app exception Traceback (most recent call last): File "M:\Working- ENVS\Python3.10B\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 565, in _run_script exec(code, module.dict) File "N:- GoogleDrive USAL\Working\PYTHON\quiver-main\main.py", line 57, in file_uploader(supabase, openai_api_key, vector_store) File "n:- GoogleDrive USAL\Working\PYTHON\quiver-main\files.py", line 37, in file_uploader file_processors[file_extension](vector_store, file) File "n:- GoogleDrive USAL\Working\PYTHON\quiver-main\loaders\pdf.py", line 6, in process_pdf return process_file(vector_store, file, PyPDFLoader, ".pdf") File "n:- GoogleDrive USAL\Working\PYTHON\quiver-main\loaders\common.py", line 19, in process_file documents = loader.load() File "M:\Working- ENVS\Python3.10B\lib\site-packages\langchain\document_loaders\pdf.py", line 113, in load return list(self.lazy_load()) File "M:\Working- ENVS\Python3.10B\lib\site-packages\langchain\document_loaders\pdf.py", line 120, in lazy_load yield from self.parser.parse(blob) File "M:\Working- ENVS\Python3.10B\lib\site-packages\langchain\document_loaders\base.py", line 87, in parse return list(self.lazy_parse(blob)) File "M:\Working- ENVS\Python3.10B\lib\site-packages\langchain\document_loaders\parsers\pdf.py", line 16, in lazy_parse with blob.as_bytes_io() as pdf_file_obj: File "C:\Program Files\Python310\lib\contextlib.py", line 135, in enter return next(self.gen) File "M:\Working- ENVS\Python3.10B\lib\site-packages\langchain\document_loaders\blob_loaders\schema.py", line 86, in as_bytes_io with open(str(self.path), "rb") as f: PermissionError: [Errno 13] Permission denied: 'D:\TEMP\tmpim3u4796.pdf'

D:\TEMP has no problem with permissions, it's the temporary directory of the system, all programs and users have permission.

pepeto avatar May 14 '23 03:05 pepeto

Hey, was asked to help someone trying to use your project who were getting the same error. Below is the reply I gave them, which includes the likely cause.

https://github.com/StanGirard/quiver/blob/adbb41eb40f20fc264dbd68df2079649518e381d/loaders/common.py#L14 https://github.com/StanGirard/quiver/blob/adbb41eb40f20fc264dbd68df2079649518e381d/loaders/common.py#L20 https://github.com/StanGirard/quiver/blob/adbb41eb40f20fc264dbd68df2079649518e381d/utils.py#L4

Looks like they create a temp file, then pass its file name to a function that tries to open it.

https://docs.python.org/3.9/library/tempfile.html#tempfile.NamedTemporaryFile

Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows)

(and I knew what to look for thanks to https://stackoverflow.com/questions/23212435/permission-denied-to-write-to-my-temporary-file)

D1firehail avatar May 14 '23 09:05 D1firehail

Hey, was asked to help someone trying to use your project who were getting the same error. Below is the reply I gave them, which includes the likely cause.

https://github.com/StanGirard/quiver/blob/adbb41eb40f20fc264dbd68df2079649518e381d/loaders/common.py#L14

https://github.com/StanGirard/quiver/blob/adbb41eb40f20fc264dbd68df2079649518e381d/loaders/common.py#L20

https://github.com/StanGirard/quiver/blob/adbb41eb40f20fc264dbd68df2079649518e381d/utils.py#L4

Looks like they create a temp file, then pass its file name to a function that tries to open it.

https://docs.python.org/3.9/library/tempfile.html#tempfile.NamedTemporaryFile

Whether the name can be used to open the file a second time, while the named temporary file is still open, varies across platforms (it can be so used on Unix; it cannot on Windows)

(and I knew what to look for thanks to https://stackoverflow.com/questions/23212435/permission-denied-to-write-to-my-temporary-file)

That looks exactly like the problem I have. Any idea of how catch the error?

pepeto avatar May 16 '23 09:05 pepeto

I have the same problem, on Windows as well.

adamengberg avatar May 16 '23 18:05 adamengberg

This worked for me:

import os
import tempfile
import time
from utils import compute_sha1_from_file
from langchain.schema import Document
import streamlit as st
from langchain.text_splitter import RecursiveCharacterTextSplitter

def process_file(vector_store, file, loader_class, file_suffix):
    documents = []
    file_sha = ""
    file_name = file.name
    file_size = file.size
    dateshort = time.strftime("%Y%m%d")

    # Create a temporary file using mkstemp
    fd, tmp_file_name = tempfile.mkstemp(suffix=file_suffix)

    with os.fdopen(fd, 'wb') as tmp_file:
        tmp_file.write(file.getvalue())

    loader = loader_class(tmp_file_name)
    documents = loader.load()
    file_sha1 = compute_sha1_from_file(tmp_file_name)

    chunk_size = st.session_state['chunk_size']
    chunk_overlap = st.session_state['chunk_overlap']

    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

    documents = text_splitter.split_documents(documents)

    # Add the document sha1 as metadata to each document
    docs_with_metadata = [Document(page_content=doc.page_content, metadata={"file_sha1": file_sha1,"file_size":file_size ,"file_name": file_name, "chunk_size": chunk_size, "chunk_overlap": chunk_overlap, "date": dateshort}) for doc in documents]

    vector_store.add_documents(docs_with_metadata)

    # Don't forget to remove the temporary file when you're done with it
    os.remove(tmp_file_name)

    return

This version of common.py should avoid the permission issue you were encountering on Windows.

seaberry0620 avatar May 16 '23 18:05 seaberry0620

I encountered a PermissionError when trying to open a temporary file on a Windows platform. The issue originates from this block of code in common.py:

with tempfile.NamedTemporaryFile(delete=True, suffix=file_suffix) as tmp_file:
    tmp_file.write(file.getvalue())
    tmp_file.flush()

    loader = loader_class(tmp_file.name)
    documents = loader.load()
    file_sha1 = compute_sha1_from_file(tmp_file.name)

The PermissionError arises because tempfile.NamedTemporaryFile() opens a temporary file that cannot be opened again on Windows platforms while it's still open. This is due to the way Windows handles temporary files differently than Unix-based systems.

To resolve this issue, I modified the code to use tempfile.mkstemp() instead, which creates a temporary file in a more reliable manner across different platforms than tempfile.NamedTemporaryFile(). Importantly, it also ensures that the temporary file is closed before trying to open it again.

Here's the modified block of code:

# Create a temporary file using `tempfile.mkstemp`.
tmp_fd, tmp_file_name = tempfile.mkstemp(suffix=file_suffix)

try:
    # Write to the temporary file.
    with os.fdopen(tmp_fd, 'wb') as tmp_file:
        tmp_file.write(file.getvalue())
        tmp_file.flush()

    # Now you can pass the temporary file's name to `loader_class` and `compute_sha1_from_file`.
    loader = loader_class(tmp_file_name)
    documents = loader.load()
    file_sha1 = compute_sha1_from_file(tmp_file_name)
    
finally:
    # Clean up the temporary file.
    if os.path.exists(tmp_file_name):
        os.remove(tmp_file_name)

adamengberg avatar May 16 '23 19:05 adamengberg