unisim icon indicating copy to clipboard operation
unisim copied to clipboard

Missing query indices in results of `text_sim.search`

Open aishwaryap opened this issue 1 year ago • 0 comments

There seems to be a bug in the way query_idx gets filled in results in a call to text_sim.search. In would have expected one of the following two behaviors in the values of query_idx in the output to text_sim.search:

  1. (Preferred) The output results are in the same order as the input queries and each result also has query_idx correctly identifying the index of the query it is a result of (which will be equal to the index of the result in the output)
  2. (Less preferred but still usable) The output results may be in a different order to the input queries but each result has query_idx correctly identifying the index of the query it is a result of so that results can be sorted by query_idx to match the order of queries.

Correct values of query_idx are important for debugging if an index is created with store_data=False (desirable for large indexes). However, I observed in my use of the package that the results from text_sim.search seem to have fewer unique values for query_idx in the results than the length of the input query list. Example to reproduce:

import nltk
# nltk.download('punkt') # Needs to be run once
# nltk.download('gutenberg') # Needs to be run once
import numpy as np
from tqdm import tqdm
from unisim import TextSim

hamlet = nltk.corpus.gutenberg.raw('shakespeare-hamlet.txt')
sents = nltk.sent_tokenize(hamlet)

text_sim = TextSim(store_data=True, index_type="approx", batch_size=1024)
text_sim.reset_index()
for sent in sents:
    text_sim.add([sent])

queries = sents
retrieval_results = text_sim.search(queries, similarity_threshold=0.9, k=1, drop_closest_match=False)

print("Num queries =", len(queries))
print("Num results =", len(retrieval_results.results))
print("Num results where query_idx in result != idx of result in results list =", len([idx for idx in range(len(retrieval_results.results)) if idx != retrieval_results.results[idx].query_idx]))
print("Num unique queries in input =", len(set(queries)))
print("Num unique queries in output =", len(set([result.query_data for result in retrieval_results.results])))
print("Num results where query data at result idx != input query at idx =", len([idx for idx in range(len(retrieval_results.results)) if retrieval_results.results[idx].query_data != queries[idx]]))
print("Num results where match 0 data at result idx != input query at idx =", len([idx for idx in range(len(retrieval_results.results)) if retrieval_results.results[idx].matches[0].data != queries[idx]]))

My output:

Num queries = 2355
Num results = 2355
Num results where query_idx in result != idx of result in results list = 1331
Num unique queries in input = 1991
Num unique queries in output = 1991
Num results where query data at result idx != input query at idx = 0
Num results where match 0 data at result idx != input query at idx = 9

In this case I created the index with store_data=True so I could verify using the query_data field that the results were in the same order of the queries but the lack of reliable query indexes makes more detailed debugging in indexes where store_data=False challenging for example if we are observing unexpected retrieval results.

I am using unisim==1.0.1 with Python 3.10.12 on a workstation with 2 A6000 GPUs.

In case it is relevant, I also get the following warnings:

2024-10-21 11:05:35.186726: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-21 11:05:35.186754: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-21 11:05:35.187653: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-21 11:05:35.191876: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-21 11:05:35.852691: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-10-21 11:05:36.482330: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:36.486731: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:36.523215: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:36.527291: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:36.531250: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:36.535234: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.468180: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.469837: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.471392: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.472804: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.474295: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.475709: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.486683: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.488149: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.489640: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.491081: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.492567: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.493970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13450 MB memory:  -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:2b:00.0, compute capability: 8.6
2024-10-21 11:05:42.494316: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-21 11:05:42.495743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 46146 MB memory:  -> device: 1, name: NVIDIA RTX A6000, pci bus id: 0000:41:00.0, compute capability: 8.6
/home/aishwarya/Documents/venvs/copyright_env/lib/python3.10/site-packages/keras/src/initializers/initializers.py:120: UserWarning: The initializer RandomNormal is unseeded and being called multiple times, which will return identical values each time (even if the initializer is unseeded). Please update your code to provide a seed to the initializer, or avoid using the same initializer instance more than once.
  warnings.warn(
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.

aishwaryap avatar Oct 21 '24 18:10 aishwaryap