bolt
bolt copied to clipboard
Python unit test is broken?
OS: MacOS Monterey 12.5 (Intel chip) Python: 3.10.5
❯ pytest tests
============================================================================== test session starts ===============================================================================
platform darwin -- Python 3.10.5, pytest-7.1.2, pluggy-1.0.0
rootdir: /Users/xiao/development/github.com/XiaoConstantine/bolt-1
collected 4 items
tests/test_encoder.py ..F. [100%]
==================================================================================== FAILURES ====================================================================================
________________________________________________________________________________ test_unquantize _________________________________________________________________________________
def test_unquantize():
X, Q = _load_digits_X_Q(nqueries=20)
> enc = bolt.Encoder('dot', accuracy='high').fit(X)
tests/test_encoder.py:151:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../dblalock/bolt/venv/lib/python3.10/site-packages/pybolt-0.1.4-py3.10-macosx-11-x86_64.egg/bolt/bolt_api.py:466: in fit
centroids = _learn_centroids(X, ncentroids=ncentroids,
../../dblalock/bolt/venv/lib/python3.10/site-packages/pybolt-0.1.4-py3.10-macosx-11-x86_64.egg/bolt/bolt_api.py:142: in _learn_centroids
centroids, labels = kmeans(X_in, ncentroids)
../../dblalock/bolt/venv/lib/python3.10/site-packages/pybolt-0.1.4-py3.10-macosx-11-x86_64.egg/bolt/bolt_api.py:106: in kmeans
seeds = kmc2.kmc2(X, k).astype(np.float32)
kmc2.pyx:97: in kmc2.kmc2
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E ValueError: probabilities contain NaN
mtrand.pyx:935: ValueError
I've made a PR to kmc2 code and added a hack to the bolt api the with a PR here: https://github.com/dblalock/bolt/pull/38
The issue is that each row is padded with 0's: Since there are 16 rows, but we only get 15 values per codebook from python, I have zeroed out the last row at all columns
here: https://github.com/dblalock/bolt/issues/29#issuecomment-1162339472.
When we pass in columns 1 at a time to get centroids for each column. The first column is all 0's. The kmc2 code errors when it has only 1 unique row: it updates points with the normalized the distances of every row from each other. This is nan if all the rows are the same, since the sum is 0.
This is mentioned in the thread where the external KMC2 package is included: https://github.com/dblalock/bolt/issues/4#issuecomment-942381565.
Make sense to me 👍 will wait for @dblalock to take a look when he gets time
I'm using Python 3.10.0 on my intel mac. I couldn't pip install kmc2
because the cython interface has changed. I did clone the kmc2 repository and rand cython kmc2
which then built. However, I still got the Nan error reported above.
Did you run python setup.py install
inside the bolt repo after checking out the branch with the updated python/bolt/bolt_api.py
?
Following the steps here https://github.com/dblalock/bolt/issues/4#issuecomment-1163934659 .
I just tested this on python 3.7 and ran python setup.py install
in both repos; I've not tried with cython.
Here's commands that pass the pytests on macOS 12.5:
git clone https://github.com/dblalock/bolt.git
pip install -r requirements.txt
python setup.py install
cd ..
git clone [email protected]:clark-hive/kmc2.git
cd kmc2/
git checkout clark/allow_duplicated_inputs
python setup.py install
cd ../bolt/
pytest tests/