faiss
faiss copied to clipboard
If the path contains Unicode characters, can not read_index and write_index
Summary
If the path contains Unicode characters, can not read_index and write_index
Platform
OS: Windows 11
Python: Python 3.11.4
Faiss version: 1.7.4
Installed from: pip install faiss-cpu
Running on:
- [ ] CPU
Interface:
- [ ] Python
Reproduction instructions
here is code:
import pathlib
import faiss
import numpy
import torch
class FlatL2Index():
def __init__(self, root: pathlib.Path, dim: int = 1024):
self.dim = dim
param = f'Flat'
measure = faiss.METRIC_L2
self.faiss_index = faiss.index_factory(dim, param, measure)
def load(self):
f = str(self.root)
self.faiss_index = faiss.read_index(f)
def dump(self):
f = str(self.root)
faiss.write_index(self.faiss_index, f)
def train(self, dataset: numpy.ndarray | torch.Tensor = None):
if dataset is None:
train_points = max(self.dim * 10, 39936)
random_train_dataset: numpy.ndarray = numpy.random.random((train_points, dim)).astype(numpy.float32)
self.faiss_index.train(random_train_dataset)
if isinstance(dataset, torch.Tensor):
dataset = dataset.cpu().detach().numpy()
dataset = dataset.astype(numpy.float32)
self.faiss_index.train(dataset)
def append(self, iv: numpy.ndarray | torch.Tensor):
if isinstance(iv, torch.Tensor):
iv = iv.cpu().detach().numpy()
iv = iv.astype(numpy.float32)
self.faiss_index.add(iv)
def search(self, query_iv: numpy.ndarray | torch.Tensor, top_k: int = 10):
if isinstance(query_iv, torch.Tensor):
query_iv = query_iv.cpu().detach().numpy()
query_iv = query_iv.astype(numpy.float32)
return self.faiss_index.search(query_iv, top_k)
if __name__ == "__main__":
dim = 768
root = pathlib.Path("Z:\\") / "中文" / "flatL2.index"
index = FlatL2Index(root, dim)
# index.load()
print(type(index.faiss_index))
train_dataset: numpy.ndarray = numpy.random.random((20, dim)).astype(numpy.float32)
test_dataset: numpy.ndarray = numpy.random.random((20, dim)).astype(numpy.float32)
query_iv: numpy.ndarray = numpy.random.random((1, dim)).astype(numpy.float32)
index.train(train_dataset)
index.append(test_dataset)
index.search(query_iv)
index.dump()
When performing faiss.read_index and faiss.write_index operations, if the path contains Unicode characters, you may encounter the following error:
RuntimeError: Error in __cdecl faiss::FileIOWriter::FileIOWriter(const char *)
at D:\a\faiss-wheels\faiss-wheels\faiss\faiss\impl\io.cpp:98:
Error: 'f' failed: could not open Z:\中文\flatL2.index
for writing: No such file or directory
This is because there is no unambiguous way of converting unicode to char * in the C++ code.
oh, how to solve this problem, anyone have idea?
I also encountered the same issue. While it's not a fundamental solution, I resolved it by saving the index file to a temporary path and then copying the file. Below is the code example.
import os
import shutil
import tempfile
import faiss
import numpy as np
from pathlib import Path
from uuid import uuid4
def get_temp_dir():
# windows
if os.name == "nt":
return "/Temp"
# linux, macos
return "/tmp"
features = [
[0, 0, 0, 0, 0],
[1, 0, 0, 1, 0],
[1, 1, 0, 0, 1],
[0, 1, 0, 0, 1],
[1, 1, 0, 1, 1],
[1, 0, 0, 1, 1],
]
d = len(features[0]) # dimension
index = faiss.IndexFlatL2(d)
for ft in features:
parsed = np.array([ft], dtype=np.float32)
index.add(parsed)
dest_path = "/path/to/save/faiss.idx"
temp_dir = get_temp_dir()
if not Path(temp_dir).is_dir():
Path(temp_dir).mkdir()
with tempfile.TemporaryDirectory(dir=temp_dir) as p:
temp_file_path = Path(p) / str(uuid4())
faiss.write_index(index, str(temp_file_path))
shutil.move(str(temp_file_path), dest_path)
Since the OS user name can be included in the default temp directory, I specified a separate temp_dir. If the user name contains Unicode, the same problem can occur. If it is guaranteed that the user name does not include Unicode, the attribute dir
can be omitted in tempfile.TemporaryDirectory
.
Okay, that's a workaround for write_index but what do you do for read_index? If I understand this issue correctly, this problem also occurs for read_index so that you should encounter this problem as well when you want to read from the (now moved to the correct path) index file.
同样遇到了这个问题,我的业务场景必须使用到中文路径,请问有人解决了吗
Okay, that's a workaround for write_index but what do you do for read_index? If I understand this issue correctly, this problem also occurs for read_index so that you should encounter this problem as well when you want to read from the (now moved to the correct path) index file.
Reading or writing an index is the same. Copy the index to be read to a temporary path with a filename that does not contain Unicode characters, then read the file using faiss.read_index.
Hi there, it is more like C++ issue rather than Faiss, so we will close it for now and feel free to have an open discussion in the discussion page