FIt-SNE icon indicating copy to clipboard operation
FIt-SNE copied to clipboard

Memory Allocation Failed - Large Datasets

Open TonyX26 opened this issue 2 years ago • 7 comments

Hi all,

I've been trying to run FIt-SNE on a FCS file 20 million events large. Unfortunately, despite allocating 1.5TB of memory, an error still arises (below). This does not occur when running the same file downsampled to 2 or 5 million cells. I have just been trying to run a small 20 iterations, just to identify the problem, however it never manages to get there...

Has anyone encountered this error before? I've attached the error file, the output, and my script.

Thanks!

=============== t-SNE v1.2.1 ===============
fast_tsne data_path: <path> 
fast_tsne result_path: <path>
fast_tsne nthreads: 96
Read the following parameters:
	 n 19113296 by d 17 dataset, theta 0.500000,
	 perplexity 50.000000, no_dims 2, max_iter 20,
	 stop_lying_iter 250, mom_switch_iter 250,
	 momentum 0.500000, final_momentum 0.800000,
	 learning_rate 1592774.666667, max_step_norm 5.000000,
	 K -1, sigma -30.000000, nbody_algo 2,
	 knn_algo 1, early_exag_coeff 12.000000,
	 no_momentum_during_exag 0, n_trees 50, search_k 7500,
	 start_late_exag_iter -1, late_exag_coeff 1.000000
	 nterms 3, interval_per_integer 1.000000, min_num_intervals 50, t-dist df 1.000000
Read the 19113296 x 17 data matrix successfully. X[0,0] = 71838.656250
Read the initialization successfully.
Will use momentum during exaggeration phase
Computing input similarities...
Using perplexity, so normalizing input data (to prevent numerical problems)
Using perplexity, not the manually set kernel width.  K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored.
Using ANNOY for knn search, with parameters: n_trees 50 and search_k 7500
Going to allocate memory. N: 19113296, K: 150, N*K = -1427972896
Memory allocation failed!

Resource Usage on 2021-08-05 16:59:31:
Job Id:             job_ID
Project:            ##
Exit Status:        1
Service Units:      6.20
NCPUs Requested:    48                     NCPUs Used: 48              
                                           CPU Time Used: 00:02:26                                   
   Memory Requested:   1.46TB                Memory Used: 37.03GB         
   Walltime requested: 20:00:00            Walltime Used: 00:02:35        
   JobFS requested:    30.0GB                 JobFS used: 15.18KB         

Error file:

FIt-SNE R wrapper loading.
FIt-SNE root directory was set to <directory>
Using rsvd() to compute the top PCs for initialization.
Error in fftRtsne(dsobject_s[, -c(1, 19:24)], perplexity = 50, max_iter = 20) : 
  tsne call failed
Execution halted

Script:

library(flowCore)

## Sourcing FITSNE 
fast_tsne_path  <- "<path>/fast_tsne" 
source(paste0(fast_tsne_path,".R"))

## Loading in File
object <- exprs(read.FCS("<file>.fcs"))

## Running FIt-SNE 
tsne_object <-fftRtsne(object[,-c(1, 19:24)],perplexity = 50, max_iter = 20)
export_obj <- cbind(object, tSNEX = tsne_object[,1], tSNEY = tsne_object[,2],fast_tsne_path=fast_tsne_path)

## Saving Object
saveRDS(export_obj, "fitSNE_alltube_simple20.rds")

TonyX26 avatar Aug 12 '21 04:08 TonyX26

Thanks for posting the issue, @TonyX26. I think the problem is integer overflow. See how N*K gives a negative number here?

Going to allocate memory. N: 19113296, K: 150, N*K = -1427972896

This is because N*K = 2.89E9 which is larger than the maximum integer 2.14E9.

I think you need to change the definition of the function computeGaussianPerplexity() so that N and D are not declared as integer, but long integer instead. This would also need to be changed in the header file, of course.

Then, N*D should not overflow, and calloc() will not be trying to allocate a "negative" amount of memory. If you have trouble making that change, I can also make it for you.

If that fixes the problem, please make a pull request so we can update the repo.

linqiaozhi avatar Aug 12 '21 14:08 linqiaozhi

Thanks for the reply!

That has fixed the negative memory problem... however it is still not working sadly. I'm not sure why, but it normally quits out after 2 or 3 minutes. But this went for much longer so I think that is hopeful!

Thanks Tony

Error output:

FIt-SNE R wrapper loading.
FIt-SNE root directory was set to <directory>
Using rsvd() to compute the top PCs for initialization.
Error in fftRtsne(data, perplexity = 80, max_iter = 3000,  : 
  tsne call failed
Execution halted

Full Output

=============== t-SNE v1.2.1 ===============
fast_tsne data_path: <path>.dat
fast_tsne result_path:  <path>.dat
fast_tsne nthreads: 64
Read the following parameters:
	 n 19113296 by d 17 dataset, theta 0.500000,
	 perplexity 80.000000, no_dims 2, max_iter 3000,
	 stop_lying_iter 250, mom_switch_iter 250,
	 momentum 0.500000, final_momentum 0.800000,
	 learning_rate 1592774.666667, max_step_norm 5.000000,
	 K -1, sigma -30.000000, nbody_algo 2,
	 knn_algo 1, early_exag_coeff 12.000000,
	 no_momentum_during_exag 0, n_trees 50, search_k 12000,
	 start_late_exag_iter -1, late_exag_coeff 1.000000
	 nterms 3, interval_per_integer 1.000000, min_num_intervals 50, t-dist df 0.800000
Read the 19113296 x 17 data matrix successfully. X[0,0] = 71838.656250
Read the initialization successfully.
Will use momentum during exaggeration phase
Computing input similarities...
Using perplexity, so normalizing input data (to prevent numerical problems)
Using perplexity, not the manually set kernel width.  K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored.
Using ANNOY for knn search, with parameters: n_trees 50 and search_k 12000
Going to allocate memory. N: 19113296, K: 240, N*K = 292223744
Building Annoy tree...
Done building tree. Beginning nearest neighbor search... 
parallel (64 threads):
[>                                                           ] 0% 0.036s
======================================================================================
                  Resource Usage on 2021-08-13 11:09:37:
   Job Id:             <job>
   Project:            <job>
   Exit Status:        1
   Service Units:      45.22
   NCPUs Requested:    32                     NCPUs Used: 32              
                                           CPU Time Used: 01:07:32                                   
   Memory Requested:   2.93TB                Memory Used: 54.25GB         
   Walltime requested: 20:00:00            Walltime Used: 01:07:50        
   JobFS requested:    30.0GB                 JobFS used: 2.71GB          
======================================================================================

Script:

library(flowCore)
fast_tsne_path <- "<path>/fast_tsne"
source(paste0(fast_tsne_path,".R"))
object <- exprs( read.FCS("ConcatAnnaB_AllTube_1.fcs"))
tsne_object <-fftRtsne(object[,-c(1, 19:24)],perplexity = 80,max_iter = 3000, df = 0.8, fast_tsne_path = fast_tsne_path)
export_obj <- cbind(object_e, tSNEX = tsne_object[,1], tSNEY = tsne_object[,2])
saveRDS(export_obj, "fitSNE_alltube_full.rds")

TonyX26 avatar Aug 13 '21 01:08 TonyX26

Same story. It is crashing here, and you can see the indices are integers, so they are overflowing. Try changing those to long int, particularly n.

Although the algorithm should work, we did not test the actual code with a dataset of this size. There are likely other places too where we used int and it should be long int instead. It is a big help if you can go through and make those changes and then pull request. If you have difficulty or it's too much work, let me know and I can do it.

linqiaozhi avatar Aug 13 '21 17:08 linqiaozhi

Thanks for all the help! I've had a shot at doing it myself, but sadly haven't managed to get it to work still. If it's possible to get some help, that'd be much appreciated. I'll pull the request though, seeing that it is solved! Thank you

TonyX26 avatar Aug 15 '21 03:08 TonyX26

Hey Toni, "pull request" does not mean closing the issue :-) I am reopening it, as it's clearly a bug.

dkobak avatar Aug 17 '21 21:08 dkobak

I'm so sorry. First time as you may have realised :D I've put in a pull request now.

TonyX26 avatar Aug 19 '21 11:08 TonyX26

Hi All,

I've implemented the above changes, and have been trying to find additional ways around it. The output below is the furthest I've managed to get sadly. I note the memory allocation is still negative, but I'm unsure of what else to change to avoid this problem. Any help would be very much appreciated!

fast_tsne data_path: <path> RtmpfICJOc/fftRtsne_data_1e6bc64af50344.dat
fast_tsne result_path: <path> RtmpfICJOc/fftRtsne_result_1e6bc69d15fd9.dat
fast_tsne nthreads: 96
Read the following parameters:
	 n 19113296 by d 17 dataset, theta 0.500000,
	 perplexity 50.000000, no_dims 2, max_iter 20,
	 stop_lying_iter 250, mom_switch_iter 250,
	 momentum 0.500000, final_momentum 0.800000,
	 learning_rate 1592774.666667, max_step_norm 5.000000,
	 K -1, sigma -30.000000, nbody_algo 2,
	 knn_algo 1, early_exag_coeff 12.000000,
	 no_momentum_during_exag 0, n_trees 50, search_k 7500,
	 start_late_exag_iter -1, late_exag_coeff 1.000000
	 nterms 3, interval_per_integer 1.000000, min_num_intervals 50, t-dist df 1.000000
Read the 19113296 x 17 data matrix successfully. X[0,0] = 71838.656250
Read the initialization successfully.
Will use momentum during exaggeration phase
Computing input similarities...
Using perplexity, so normalizing input data (to prevent numerical problems)
Using perplexity, not the manually set kernel width.  K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored.
Using ANNOY for knn search, with parameters: n_trees 50 and search_k 7500
Going to allocate memory. N: 19113296, K: 150, N*K = -1427972896
Building Annoy tree...
Done building tree. Beginning nearest neighbor search... 
parallel (96 threads):
[>                                                           ] 0% 0.005s
[>                                                           ] 0% 0.564s
[>                                                           ] 0% 1.101s
[>                                                           ] 0% 1.663s
[>                                                           ] 0% 2.181s
[>                                                           ] 0% 2.777s
[>                                                           ] 0% 3.315s
[>                                                           ] 0% 3.988s
[>                                                           ] 0% 4.601s
[>                                                           ] 0% 5.105s
[>                                                           ] 0% 5.698s
[>                                                           ] 0% 6.276s
[>                                                           ] 0% 6.759s
[>                                                           ] 0% 7.307s
[>                                                           ] 0% 7.807s
[>                                                           ] 0% 8.303s
[>                                                           ] 0% 8.858s
[>                                                           ] 0% 9.383s
[>                                                           ] 0% 9.907s
[>                                                           ] 0% 10.397s
[>                                                           ] 0% 10.951s
[>                                                           ] 1% 11.444s
[>                                                           ] 1% 11.91s
[>                                                           ] 1% 12.352s
[>                                                           ] 1% 12.803s

This continues until:

[===========================================================>] 98% 1002.15s
[===========================================================>] 99% 1018.77s

Where the process then stops and ends. The error output is:

FIt-SNE R wrapper loading.
FIt-SNE root directory was set to /scratch/nd12/tx2668
Using rsvd() to compute the top PCs for initialization.
Error in fftRtsne(dsobject_s[, -c(1, 19:24)], perplexity = 50, max_iter = 20) : 
  tsne call failed
Execution halted

TonyX26 avatar Sep 13 '21 01:09 TonyX26