private-gpt
private-gpt copied to clipboard
soffice command was not found. Please install libreoffice
running ingest.py I get the error
FileNotFoundError: soffice command was not found. Please install libreoffice on your system and try again.
installing libreoffice does not fix the problem
found that certain office files (xlsx and docx) had problems that triggered this - not sure what the commonality is. Otherwise worked fine on the majority of docx and xlsx
Windows 11, Python 3.10.7, nVidia 3080,
It seems to be part of a Python bug. The fix seemed two-pronged for me, though I really don't like doing the explicit call. Maybe someone can suggest a better way.
1: Install LibreOffice. 2: C:\Python310(or whatever version of python you have in your system)\Lib\site-packages\unstructured\partition\common.py, line 178 is "soffice", Change that to wherever your soffice.exe file from your LibreOffice install is. For me it was "C:\Program Files\LibreOffice\program\soffice.exe",
Once I did that, I was able to get it to tell me what files were actually triggering the bug. What was activating the bug for me (I only had .txt files in my source_documents folder) was some of my text files have lines in them with only hyphens. They look like this:
Once I got rid of those lines, the ingester happily took it. Hope this helps!
I had errors with LibreOffice not able to find the resulting converted file for .doc to .docx. I made something quick to convert them myself:
https://gist.github.com/igoforth/80b86cc4a256db502b5d8bed3b857113
So I was able to get a stable fix for this addresses this issue and one other I was having. This was done after I installed LibreOffice and is on a Windows system with Office installed. Run this BEFORE you run ingest.py
`from pathlib import Path
from glob import glob
import re
import os
import win32com.client as win32
from win32com.client import constants
paths = []
for docfile in Path('source_documents').glob('**/*.doc*'):
paths.append(docfile)
def save_as_docx(path):
# Uses Word on the system to open and resave any .docx files
# to make sure they are valid.
# Also converts .doc to .docx and renames .doc to .old to
# prevent future issues with ingest.py
try:
# Opening MS Word
print('Processing file - ', path)
old_file_abs = os.path.abspath(path)
word = win32.gencache.EnsureDispatch('Word.Application')
doc = word.Documents.Open(old_file_abs)
doc.Activate ()
# Rename path with .docx
new_file_abs = os.path.abspath(path)
new_file_abs = re.sub(r'\.\w+$', '.docx', new_file_abs)
# Save and Close
word.ActiveDocument.SaveAs(
new_file_abs, FileFormat=constants.wdFormatXMLDocument
)
doc.Close(False)
#rename .doc files so they don't continue to cause issues
if path.suffix == '.doc':
ren_file_abs = re.sub(r'\.\w+$', '.old', old_file_abs)
os.rename(old_file_abs, ren_file_abs)
except OSError as err:
print('OS Error - ', err)
except Exception as err:
print(f"Unexpected {err=}, {type(err)=}")
for path in paths:
save_as_docx(path)`
1: Install LibreOffice. 2: C:\Python310(or whatever version of python you have in your system)\Lib\site-packages\unstructured\partition\common.py, line 178 is "soffice", Change that to wherever your soffice.exe file from your LibreOffice install is. For me it was "C:\Program Files\LibreOffice\program\soffice.exe",
Alternatively, I think you can try adding the "C:\Program Files\LibreOffice\program" directory to your Path in Windows.
1:Install LibreOfffice 2: C:\Python310(or whatever version of python you have in your system)\Lib\site-packages\unstructured\partition\common.py, line 178 is "soffice", 3: add LibreOfffice to my path in Windows 4: execute soffice in the cmd 5: but it is error