umap
umap copied to clipboard
Crash when running with larger dataset
Hello,
I am having some trouble running umap with a large data-set (0.75m with 75 columns) on 0.4.1 After running around 10-15mins, the python session just crash.
I was able to run the exact same things on a slightly bigger data-set (2.5m with 75 columns) on 0.3.10.
Cheers,
That sounds troubling, but I can't say too much without a little more information. Presumably the whole thing is segfaulting somewhere inside numba's workload. Are you using any different metrics, or is this with the euclidean metric? There have been some reports of possible issues in mahalanobis and/or correlation.
I was using all the default parameter which is euclidean.
Here is the verbose output before it crash:
UMAP(a=None, angular_rp_forest=False, b=None,
force_approximation_algorithm=False, init='spectral', learning_rate=1.0,
local_connectivity=1.0, low_memory=False, metric='euclidean',
metric_kwds=None, min_dist=0.1, n_components=2, n_epochs=None,
n_neighbors=15, negative_sample_rate=5, output_metric='euclidean',
output_metric_kwds=None, random_state=None, repulsion_strength=1.0,
set_op_mix_ratio=1.0, spread=1.0, target_metric='categorical',
target_metric_kwds=None, target_n_neighbors=-1, target_weight=0.5,
transform_queue_size=4.0, transform_seed=42, unique=False, verbose=True)
Construct fuzzy simplicial set
Wed Apr 29 02:30:21 2020 Finding Nearest Neighbors
Wed Apr 29 02:30:21 2020 Building RP forest with 48 trees
Wed Apr 29 02:32:40 2020 NN descent for 20 iterations
Any chance you could try installing pynndescent and see if that makes any difference?
It works after pynndescent was installed (tested on 2.5m x 75). Also much better performance as well.
0.3.10 (w/o pynndescent)
CPU times: user 2h 24min 10s, sys: 49.7 s, total: 2h 25min
Wall time: 2h 22min 37s
0.4.1 (w/ pynndescent)
CPU times: user 3h 29min 4s, sys: 1min 17s, total: 3h 30min 21s
Wall time: 21min 30s
UMAP(a=None, angular_rp_forest=False, b=None,
force_approximation_algorithm=False, init='spectral', learning_rate=1.0,
local_connectivity=1.0, low_memory=False, metric='euclidean',
metric_kwds=None, min_dist=0.1, n_components=2, n_epochs=None,
n_neighbors=15, negative_sample_rate=5, output_metric='euclidean',
output_metric_kwds=None, random_state=None, repulsion_strength=1.0,
set_op_mix_ratio=1.0, spread=1.0, target_metric='categorical',
target_metric_kwds=None, target_n_neighbors=-1, target_weight=0.5,
transform_queue_size=4.0, transform_seed=42, unique=False, verbose=True)
Construct fuzzy simplicial set
Wed Apr 29 14:01:29 2020 Finding Nearest Neighbors
Wed Apr 29 14:01:29 2020 Building RP forest with 82 trees
Wed Apr 29 14:02:53 2020 NN descent for 21 iterations
0 / 21
1 / 21
2 / 21
3 / 21
4 / 21
5 / 21
6 / 21
/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pynndescent/pynndescent_.py:1155: RuntimeWarning: invalid value encountered in sqrt
self._distance_correction(self._neighbor_graph[1]),
Wed Apr 29 14:09:49 2020 Finished Nearest Neighbor Search
Wed Apr 29 14:10:15 2020 Construct embedding
completed 0 / 200 epochs
completed 20 / 200 epochs
completed 40 / 200 epochs
completed 60 / 200 epochs
completed 80 / 200 epochs
completed 100 / 200 epochs
completed 120 / 200 epochs
completed 140 / 200 epochs
completed 160 / 200 epochs
completed 180 / 200 epochs
Wed Apr 29 14:22:55 2020 Finished embedding
I'm afraid this may have to suffice as a workaround for now -- I'll try to figure out what the issue might be, but it will likely be hard to track down, so it will take some time.
I wonder if this is the same issue I'm having in #430. I'll try the pynndescent
resolution too.
Installed pynndescent=0.3.3 and my pipeline still failed at exactly the same place as before, ughhhhh :-( I'll return to posting my updates on #430.
So pynndescent=0.4.7
worked for me when installed in an env with just
- ipykernel
- seaborn
- pandas
- numba
- hdbscan
- umap-learn
- pynndescent
So not sure exactly what's going on. The env I have now has numba 0.46.0. Either way, it's going now and it's going fast 😄🎉
I'm glad it is working. The crash is very puzzling. I am seeing some crash issues with a new metric I am implementing in pynndescent (it won't be the cause of your issues) that are very hard to track down, but are, possibly, stemming from a similar root cause. I'll let you know if I manage to find something reproducible on my end that might solve the problem more permanently for you.
Glad to hear that I was able to confirm your suggestion worked at least as a temp fix for others! Thank you for this amazing library and all your hard work!
Installing pynndescent
solved it for us as well. Worth adding it to the requirements or shouldn't it be a hard dependency?
I had this problem and found it was solved by scaling the data using sklearn StandardScaler.
I have the same issue. Silent exit. In my case UMAP with cosine fails if I use robust scaler, but works if I use minmax or standard scaler. python 3.8.12 umap_model = umap.UMAP(n_components=6, verbose=True, metric='cosine', low_memory=True)