faiss icon indicating copy to clipboard operation
faiss copied to clipboard

If the path contains Unicode characters, can not read_index and write_index

Open huanggefan opened this issue 1 year ago • 6 comments

Summary

If the path contains Unicode characters, can not read_index and write_index

Platform

OS: Windows 11

Python: Python 3.11.4

Faiss version: 1.7.4

Installed from: pip install faiss-cpu

Running on:

  • [ ] CPU

Interface:

  • [ ] Python

Reproduction instructions

here is code:

import pathlib

import faiss
import numpy
import torch


class FlatL2Index():
    def __init__(self, root: pathlib.Path, dim: int = 1024):
        self.dim = dim

        param = f'Flat'
        measure = faiss.METRIC_L2

        self.faiss_index = faiss.index_factory(dim, param, measure)

    def load(self):
        f = str(self.root)
        self.faiss_index = faiss.read_index(f)

    def dump(self):
        f = str(self.root)
        faiss.write_index(self.faiss_index, f)

    def train(self, dataset: numpy.ndarray | torch.Tensor = None):
        if dataset is None:
            train_points = max(self.dim * 10, 39936)
            random_train_dataset: numpy.ndarray = numpy.random.random((train_points, dim)).astype(numpy.float32)
            self.faiss_index.train(random_train_dataset)

        if isinstance(dataset, torch.Tensor):
            dataset = dataset.cpu().detach().numpy()

        dataset = dataset.astype(numpy.float32)

        self.faiss_index.train(dataset)

    def append(self, iv: numpy.ndarray | torch.Tensor):
        if isinstance(iv, torch.Tensor):
            iv = iv.cpu().detach().numpy()

        iv = iv.astype(numpy.float32)

        self.faiss_index.add(iv)

    def search(self, query_iv: numpy.ndarray | torch.Tensor, top_k: int = 10):
        if isinstance(query_iv, torch.Tensor):
            query_iv = query_iv.cpu().detach().numpy()

        query_iv = query_iv.astype(numpy.float32)

        return self.faiss_index.search(query_iv, top_k)


if __name__ == "__main__":
    dim = 768
    root = pathlib.Path("Z:\\") / "中文" / "flatL2.index"

    index = FlatL2Index(root, dim)

    # index.load()

    print(type(index.faiss_index))

    train_dataset: numpy.ndarray = numpy.random.random((20, dim)).astype(numpy.float32)
    test_dataset: numpy.ndarray = numpy.random.random((20, dim)).astype(numpy.float32)
    query_iv: numpy.ndarray = numpy.random.random((1, dim)).astype(numpy.float32)

    index.train(train_dataset)
    index.append(test_dataset)
    index.search(query_iv)

    index.dump()

When performing faiss.read_index and faiss.write_index operations, if the path contains Unicode characters, you may encounter the following error:

RuntimeError: Error in __cdecl faiss::FileIOWriter::FileIOWriter(const char *)
    at D:\a\faiss-wheels\faiss-wheels\faiss\faiss\impl\io.cpp:98: 
        Error: 'f' failed: could not open Z:\中文\flatL2.index 
    for writing: No such file or directory

huanggefan avatar Sep 25 '23 02:09 huanggefan

This is because there is no unambiguous way of converting unicode to char * in the C++ code.

mdouze avatar Sep 27 '23 14:09 mdouze

oh, how to solve this problem, anyone have idea?

sulmz avatar Oct 13 '23 14:10 sulmz

I also encountered the same issue. While it's not a fundamental solution, I resolved it by saving the index file to a temporary path and then copying the file. Below is the code example.

import os
import shutil
import tempfile
import faiss
import numpy as np
from pathlib import Path
from uuid import uuid4


def get_temp_dir():
    # windows
    if os.name == "nt":
        return "/Temp"
    # linux, macos
    return "/tmp"

features = [
    [0, 0, 0, 0, 0],
    [1, 0, 0, 1, 0],
    [1, 1, 0, 0, 1],
    [0, 1, 0, 0, 1],
    [1, 1, 0, 1, 1],
    [1, 0, 0, 1, 1],
]

d = len(features[0]) # dimension
index = faiss.IndexFlatL2(d)

for ft in features:
    parsed = np.array([ft], dtype=np.float32)
    index.add(parsed)


dest_path = "/path/to/save/faiss.idx"
temp_dir = get_temp_dir()
if not Path(temp_dir).is_dir():
    Path(temp_dir).mkdir()

with tempfile.TemporaryDirectory(dir=temp_dir) as p:
    temp_file_path = Path(p) / str(uuid4())
    faiss.write_index(index, str(temp_file_path))
    shutil.move(str(temp_file_path), dest_path)

Since the OS user name can be included in the default temp directory, I specified a separate temp_dir. If the user name contains Unicode, the same problem can occur. If it is guaranteed that the user name does not include Unicode, the attribute dir can be omitted in tempfile.TemporaryDirectory.

soonbee avatar Mar 10 '24 07:03 soonbee

Okay, that's a workaround for write_index but what do you do for read_index? If I understand this issue correctly, this problem also occurs for read_index so that you should encounter this problem as well when you want to read from the (now moved to the correct path) index file.

hansblafoo avatar Jun 27 '24 12:06 hansblafoo

同样遇到了这个问题,我的业务场景必须使用到中文路径,请问有人解决了吗

Algabeno avatar Jun 28 '24 02:06 Algabeno

Okay, that's a workaround for write_index but what do you do for read_index? If I understand this issue correctly, this problem also occurs for read_index so that you should encounter this problem as well when you want to read from the (now moved to the correct path) index file.

Reading or writing an index is the same. Copy the index to be read to a temporary path with a filename that does not contain Unicode characters, then read the file using faiss.read_index.

soonbee avatar Jun 30 '24 23:06 soonbee

Hi there, it is more like C++ issue rather than Faiss, so we will close it for now and feel free to have an open discussion in the discussion page

junjieqi avatar Jul 31 '24 18:07 junjieqi