faiss icon indicating copy to clipboard operation
faiss copied to clipboard

Possible class conflict between faiss-cpu and pymupdf

Open hairuoguo opened this issue 1 year ago • 6 comments

Summary

Hello,

I am currently using the ColBERT model for a work project, which uses faiss. We had pymupdf installed in the same conda environment, as we are trying to work with scanned documents as a datasource.

ColBERT calls faiss's kmeans.train(), which led to an AssertionError on line 109 in vector_to_array.py (assert classname.endswith('Vector')). When I took a look at the input to that function it was a pymupdf proxy object instead of belonging to the expected "[dtype]Vector" classes defined in faiss.

This error disappeared after uninstalling pymupdf.

Platform

OS: Ubuntu 20.04.5 LTS (in docker container)

Faiss version: faiss-cpu 1.8.0.post1

Installed from: pip

Faiss compilation options: default flags

Running on:

  • [ X] CPU
  • [ ] GPU

Interface:

  • [ ] C++
  • [X ] Python

Reproduction instructions

Install faiss-cpu and pymupdf in conda environment using pip. Import fitz (pymupdf) and attempt to train faiss kmeans class

OR

Install ColBERT from ColBERT repo using instructions Install pymupdf import fitz (pymupdf) in code that runs ColBERT's Indexer class

hairuoguo avatar Jul 25 '24 15:07 hairuoguo

This may also be a potential security vulnerability depending on what is actually happening under the hood. For example, I could modify the pymupdf vector class to include malicious code in the data() function, and the pymupdf proxy class would inadvertently be used, allowing for the code to be run whenever the .data() method is called.

hairuoguo avatar Jul 25 '24 18:07 hairuoguo

this may be because both Faiss and pymupdf are wrapped with SWIG. LMC if there is a workaround for this case.

mdouze avatar Jul 29 '24 07:07 mdouze

I think we could use SWIG_TYPE_TABLE to make a unique type table for Faiss. https://www.swig.org/Doc4.2/Modules.html#Modules_nn2 It seems that it just makes sure the table holding type names is distinct for Faiss.

mdouze avatar Jul 29 '24 07:07 mdouze

@hairuoguo could you try to install Faiss through conda? and here is the instruction https://github.com/facebookresearch/faiss/blob/main/INSTALL.md . Thanks

junjieqi avatar Jul 31 '24 16:07 junjieqi

will try this out when I have the time (next week or so), thanks

hairuoguo avatar Aug 01 '24 19:08 hairuoguo

@hairuoguo I faced the same issue while using fitz but when I used PDFplumber there is no issue. You can try with PDFplumber it might work, but i need to do it with fitz , is there any way to do it without using conda.

Luffy241 avatar Sep 09 '24 12:09 Luffy241