umap icon indicating copy to clipboard operation
umap copied to clipboard

Crash when running with larger dataset

Open DicksonK opened this issue 4 years ago • 13 comments

Hello,

I am having some trouble running umap with a large data-set (0.75m with 75 columns) on 0.4.1 After running around 10-15mins, the python session just crash.

I was able to run the exact same things on a slightly bigger data-set (2.5m with 75 columns) on 0.3.10.

Cheers,

DicksonK avatar Apr 28 '20 10:04 DicksonK

That sounds troubling, but I can't say too much without a little more information. Presumably the whole thing is segfaulting somewhere inside numba's workload. Are you using any different metrics, or is this with the euclidean metric? There have been some reports of possible issues in mahalanobis and/or correlation.

lmcinnes avatar Apr 28 '20 17:04 lmcinnes

I was using all the default parameter which is euclidean.

Here is the verbose output before it crash:

UMAP(a=None, angular_rp_forest=False, b=None,
     force_approximation_algorithm=False, init='spectral', learning_rate=1.0,
     local_connectivity=1.0, low_memory=False, metric='euclidean',
     metric_kwds=None, min_dist=0.1, n_components=2, n_epochs=None,
     n_neighbors=15, negative_sample_rate=5, output_metric='euclidean',
     output_metric_kwds=None, random_state=None, repulsion_strength=1.0,
     set_op_mix_ratio=1.0, spread=1.0, target_metric='categorical',
     target_metric_kwds=None, target_n_neighbors=-1, target_weight=0.5,
     transform_queue_size=4.0, transform_seed=42, unique=False, verbose=True)
Construct fuzzy simplicial set
Wed Apr 29 02:30:21 2020 Finding Nearest Neighbors
Wed Apr 29 02:30:21 2020 Building RP forest with 48 trees
Wed Apr 29 02:32:40 2020 NN descent for 20 iterations

DicksonK avatar Apr 29 '20 02:04 DicksonK

Any chance you could try installing pynndescent and see if that makes any difference?

lmcinnes avatar Apr 29 '20 03:04 lmcinnes

It works after pynndescent was installed (tested on 2.5m x 75). Also much better performance as well.

0.3.10 (w/o pynndescent)

CPU times: user 2h 24min 10s, sys: 49.7 s, total: 2h 25min
Wall time: 2h 22min 37s

0.4.1 (w/ pynndescent)

CPU times: user 3h 29min 4s, sys: 1min 17s, total: 3h 30min 21s
Wall time: 21min 30s
UMAP(a=None, angular_rp_forest=False, b=None,
     force_approximation_algorithm=False, init='spectral', learning_rate=1.0,
     local_connectivity=1.0, low_memory=False, metric='euclidean',
     metric_kwds=None, min_dist=0.1, n_components=2, n_epochs=None,
     n_neighbors=15, negative_sample_rate=5, output_metric='euclidean',
     output_metric_kwds=None, random_state=None, repulsion_strength=1.0,
     set_op_mix_ratio=1.0, spread=1.0, target_metric='categorical',
     target_metric_kwds=None, target_n_neighbors=-1, target_weight=0.5,
     transform_queue_size=4.0, transform_seed=42, unique=False, verbose=True)
Construct fuzzy simplicial set
Wed Apr 29 14:01:29 2020 Finding Nearest Neighbors
Wed Apr 29 14:01:29 2020 Building RP forest with 82 trees
Wed Apr 29 14:02:53 2020 NN descent for 21 iterations
	 0  /  21
	 1  /  21
	 2  /  21
	 3  /  21
	 4  /  21
	 5  /  21
	 6  /  21
/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/pynndescent/pynndescent_.py:1155: RuntimeWarning: invalid value encountered in sqrt
  self._distance_correction(self._neighbor_graph[1]),
Wed Apr 29 14:09:49 2020 Finished Nearest Neighbor Search
Wed Apr 29 14:10:15 2020 Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Wed Apr 29 14:22:55 2020 Finished embedding

DicksonK avatar Apr 29 '20 14:04 DicksonK

I'm afraid this may have to suffice as a workaround for now -- I'll try to figure out what the issue might be, but it will likely be hard to track down, so it will take some time.

lmcinnes avatar Apr 29 '20 17:04 lmcinnes

I wonder if this is the same issue I'm having in #430. I'll try the pynndescent resolution too.

AlexMRuch avatar May 19 '20 15:05 AlexMRuch

Installed pynndescent=0.3.3 and my pipeline still failed at exactly the same place as before, ughhhhh :-( I'll return to posting my updates on #430.

AlexMRuch avatar May 19 '20 16:05 AlexMRuch

So pynndescent=0.4.7 worked for me when installed in an env with just

- ipykernel
- seaborn
- pandas
- numba
- hdbscan
- umap-learn
- pynndescent

So not sure exactly what's going on. The env I have now has numba 0.46.0. Either way, it's going now and it's going fast 😄🎉

AlexMRuch avatar May 19 '20 17:05 AlexMRuch

I'm glad it is working. The crash is very puzzling. I am seeing some crash issues with a new metric I am implementing in pynndescent (it won't be the cause of your issues) that are very hard to track down, but are, possibly, stemming from a similar root cause. I'll let you know if I manage to find something reproducible on my end that might solve the problem more permanently for you.

lmcinnes avatar May 19 '20 19:05 lmcinnes

Glad to hear that I was able to confirm your suggestion worked at least as a temp fix for others! Thank you for this amazing library and all your hard work!

AlexMRuch avatar May 19 '20 19:05 AlexMRuch

Installing pynndescent solved it for us as well. Worth adding it to the requirements or shouldn't it be a hard dependency?

VolkerBergen avatar Jun 13 '20 07:06 VolkerBergen

I had this problem and found it was solved by scaling the data using sklearn StandardScaler.

eafpres avatar Jul 29 '20 18:07 eafpres

I have the same issue. Silent exit. In my case UMAP with cosine fails if I use robust scaler, but works if I use minmax or standard scaler. python 3.8.12 umap_model = umap.UMAP(n_components=6, verbose=True, metric='cosine', low_memory=True)

seniordatascientist avatar Aug 10 '23 15:08 seniordatascientist