seurat icon indicating copy to clipboard operation
seurat copied to clipboard

JointLayers() takes too long

Open ryanhchung opened this issue 1 year ago • 19 comments

Hi Team,

I merged 16 Seurat objects and made a single Seurat object, and tried to make a single layer but it takes forever.

The merged data is pretty large data, but It is weird that it runs more than 5hrs

I attach the code and relevant info

Thank you!

data_combined

An object of class Seurat 
24440 features across 126178 samples within 1 assay 
Active assay: RNA (24440 features, 0 variable features)
 16 layers present: counts.Normal_Prostate, counts.Normal_Prostate.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.1.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.2.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject

data_combined[["RNA"]] <- JoinLayers(data_combined)

And it is running for more than 5 hours. Do I just have to wait? Or is there something wrong?

Thank you!

p.s. I tested it for the whole night (more than 13 hrs including the first 5hrs) but it is still running..

ryanhchung avatar Dec 20 '23 03:12 ryanhchung

Hi, this issue seems unlikely to be due to JoinLayers, but the runtime is definitely unusually long here. Based on the output of your object, the layer names look very odd -- would you be able to share the full code for creating your data_combined object?

igrabski avatar Dec 21 '23 21:12 igrabski

@igrabski Thank you for your reply.

Here you are!

I repeated this 15 more as there were a total of 16 raw count matrices to be merged

Thank you!

data <- Read10X(data.dir = path1)
data <- CreateSeuratObject(counts = data, project = "Normal_Prostate", min.cells = 3, min.features = 200)

data2 <- Read10X(data.dir = path2)
data2 <- CreateSeuratObject(counts = data2, project = "Normal_Prostate", min.cells = 3, min.features = 200)

data_combined <- merge(data, data2)

ryanhchung avatar Dec 22 '23 22:12 ryanhchung

Hi Ryan, it's possible the issue is related to the very long and similar layer names produced by this pairwise merging process. We will investigate whether naming causes an issue on our end, but in the meantime, can you try merging all 16 objects at once instead of doing them in pairs like this? You can do so by passing one object as the first argument, and the rest to the second argument, for example:

merge(obj1, y = c(obj2, obj3, obj4))

but replacing the object names above with your Seurat objects.

If that doesn't fix the issue, would you be able to share, say, 3 of your objects with us to look at, if possible? You can email them to igrabski [at] nygenome [dot] org.

igrabski avatar Jan 05 '24 21:01 igrabski

Hi @igrabski

Yes, as you suggested, I tried to call only three data and merge as you told me to test, but it still takes too long time (Merging only three and using JointLayers takes more than 1hr is abnormal, right?)

I attach the code here

data1 <- Read10X(data.dir = dir1)
data1 <- CreateSeuratObject(counts = data1, project = "Normal_Prostate", min.cells = 3, min.features = 200)
data2 <- Read10X(data.dir = dir2)
data2 <- CreateSeuratObject(counts = data2, project = "Normal_Prostate", min.cells = 3, min.features = 200)
data3 <- Read10X(data.dir = dir3)
data3 <- CreateSeuratObject(counts = data3, project = "Normal_Prostate", min.cells = 3, min.features = 200)

data_combined <- merge(data1, y = c(data2, data3))

data_combined
An object of class Seurat 
22677 features across 23659 samples within 1 assay 
Active assay: RNA (22677 features, 0 variable features)
 3 layers present: counts.1, counts.2, counts.3

data_combined[["RNA"]] <- JoinLayers(data_combined)

If nothing is wrong with the code, I can share these three objects with you. But I was wondering whether it is a problem with Seurat v5 itself, as it worked fine in previous Seurat version that didn't require JointLayers function.

Thank you!

ryanhchung avatar Jan 08 '24 20:01 ryanhchung

Yes, that amount of time is definitely abnormal! Would you be able to share your objects with me at this email address: igrabski [at] nygenome [dot] org? It's possible it could be an issue with Seurat v5, but we're not able to reproduce this problem under Seurat v5 with data we have, so it would still be helpful to see your objects.

igrabski avatar Jan 12 '24 21:01 igrabski

I am pretty curious about your problem. Actually, I encontered similar problem. I have 100 samples. Firstly, I preprocessed each sample to filter low-quality cells and find doublets. Then I tried to merge these data. Similarly, merge function took very long time and the combine_sce had 100 counts layers, 100 data layers, and 100 scale.data layers. But in seurat V4, it wouldn't take so much time to do merge. Then, JoinLayers function took a pretty long time too. I wonder how to efficiently merge these multiple seurat objects if they were preprocessed before? In addition, how to link with BPcells in this condition?

lisch7 avatar Jan 13 '24 13:01 lisch7

Hi Lisch, if you are able to share your data, you are also welcome to send me 3 or so samples and I can investigate why this process is so slow.

igrabski avatar Jan 19 '24 21:01 igrabski

Thanks. How can I send my data to you? @igrabski

lisch7 avatar Jan 20 '24 08:01 lisch7

You can email them to igrabski [at] nygenome [dot] org!

igrabski avatar Jan 22 '24 14:01 igrabski

I just had the same problem with more than one hundred samples. I followed your advice of merging them at the same time, but what seemed to help the process was parallelizing it. Within minutes it was solved.

angelasanzo avatar Jan 23 '24 12:01 angelasanzo

I have a related issues - JoinLayers() seems also to be very memory hungry. I merge 83 small objects (=249 layers), of 189,619 cells in total, seem to take 38GB in Memory.

merge() completes successfully 160 gigabytes RAM, JoinLayers() however, fails and I need a node with 600* gigabytes of RAM which is pretty crazy, and the problematic on my setup. Joining entails counts data and scale data slots.

I do not explicitly tell it to use multiple cores - but maybe it uses them anyways, that's why the footprint?

I will now try deleting all scale.data slots with removeLayersByPattern().


*maybe less would be enough too but 600 is the next step up from 160GB

vertesy avatar Feb 04 '24 11:02 vertesy

Hi, if you are able to send a couple of your samples to me, I can investigate why this might be using so much memory! You can email them to igrabski [at] nygenome [dot] org.

igrabski avatar Feb 09 '24 20:02 igrabski

Hey, thank you so much - I am working on another part now, let me get back to you next week. I experienced a few odd things (removing scale.data layers, then joining failed as far as i remember), I want to make sure I understand what's going on.

vertesy avatar Feb 13 '24 08:02 vertesy

I have the same problem, can I send my data to you too? thx a lot! @igrabski

hkhllyzh avatar Mar 23 '24 19:03 hkhllyzh

Yes, please feel free to email your data to igrabski [at] nygenome [dot] org! I actually have not received data yet from anyone but I am still happy to investigate if anyone can send me an example!

igrabski avatar Mar 24 '24 20:03 igrabski

Same problem, I noticed that my server's CPU load is minimal when I run the code on large data. (My Seurat object was composed of more than 100 datasets and was 28GB)

xietianlei avatar Apr 08 '24 14:04 xietianlei

There have been a lot of reports of this issue, but no one has sent me any example data yet, so I have still been unable to debug this. However, if you are able to send 3 or so of your samples to igrabski [at] nygenome [dot] org, I am happy to take a look!

igrabski avatar Apr 09 '24 13:04 igrabski

i encountered the same issue. however only when integrating >50 samples / 300K cells. i have to wait for 20min before joinlayer finished join 80 samples. which is crazy. i think it is impossible to share such a big file [5gb in size] as a reprex.

Pentayouth avatar Apr 12 '24 17:04 Pentayouth

How long does e.g. joining 3 or 5 or so of your samples? If that's longer than usual, I could take a look at just those.

igrabski avatar Apr 12 '24 17:04 igrabski

I have two guesses:

  1. the problem caused by call bpcell on each sample separately, so matrix were created into different folders.
  2. genes between samples dont match perfectly.

Pentayouth avatar May 16 '24 12:05 Pentayouth

Hi @igrabski ,

I can't see anything, but I'm wondering if this has been solved? I'm having a similar issue where running JoinLayers() on 50 samples is taking an age (the function just runs for hours with no end in sight).

It's tough to give a reproducible example here given the data size, but it's following a very standard pipeline i.e.

## read in data
folders <- list.files("sc_data", full.names = TRUE)
df <- mclapply(folders, function(x) Read10X(data.dir = x), mc.cores = 12)

## add ADT data
df <- mclapply(df, function(x) {
      combined_data <- CreateSeuratObject(x[["Gene Expression"]])
      combined_data[["ADT"]] <- CreateAssayObject(x[["Antibody Capture"]][, colnames(combined_data)])
      combined_data <- PercentageFeatureSet(combined_data, pattern = mito_pattern, col.name = "percent_mt")
      return(combined_data)
    }, mc.cores = cores)

## merge samples
df <- merge(x = df[[1]], y = df[2:length(df)])

## look at object
df

An object of class Seurat 
36662 features across 983938 samples within 2 assays 
Active assay: RNA (36601 features, 0 variable features)
 53 layers present: counts.1, counts.2, counts.3, counts.4, counts.5, counts.6, counts.7, counts.8, counts.9, counts.10, counts.11, counts.12, counts.13, counts.14, counts.15, counts.16, counts.17, counts.18, counts.19, counts.20, counts.21, counts.22, counts.23, counts.24, counts.25, counts.26, counts.27, counts.28, counts.29, counts.30, counts.31, counts.32, counts.33, counts.34, counts.35, counts.36, counts.37, counts.38, counts.39, counts.40, counts.41, counts.42, counts.43, counts.44, counts.45, counts.46, counts.47, counts.48, counts.49, counts.50, counts.51, counts.52, counts.53
 1 other assay present: ADT

## join layers (both of these just run for hours...)
df <- JoinLayers(df)
df[["RNA"]] <- JoinLayers(df[["RNA"]])

Thanks Jack

jackbibby1 avatar May 29 '24 16:05 jackbibby1

Hi Jack, unfortunately we have been unable to reproduce this issue on our end with any of our own data, so we haven't been able to debug it. I haven't actually received any data from anyone else either -- if you are already experiencing longer runtimes than expected with just a subset (say 3 or 4 of your samples), I would be very happy to take a look and see if we can figure out what's going on! You are welcome to send anything to igrabski [at] nygenome [dot] org.

igrabski avatar May 29 '24 17:05 igrabski

Hi,

Things run fine in smaller sample sizes, and I can't share the full dataset unfortunately. I'll try and find a public dataset to replicate the issue. I've ran a few things to see what the issue could be...

  1. Fewer layers and cells
  • The JoinLayers() function runs fine on fewer layers e.g. runs in a few minutes for 25 layers
  1. Fewer cells but the same number of layers
  • Downsampling the cells but keeping the layer number constant is also fine up to around 500k cells, which runs in around 7 minutes, but then things seem to clog up at 800k cells (see below -- output includes cell number first and then time in minutes). This 800k cell section has been running for an hour or so.
downsample <- c(1e4, 5e4, 1e5, 5e5, 8e5, ncol(df))

time_results <- lapply(downsample, function(x) {
    message(x)
    downsampled_seurat <- df[, sample(colnames(df), x)]
    time_taken <- system.time(downsampled_seurat[["RNA"]] <- JoinLayers(downsampled_seurat[["RNA"]]))
    message(time_taken[3]/60)
    return(time_taken)
    })

-----

10000

0.142733333333338

50000

0.37178333333333

100000

0.898649999999998

500000

7.03813333333333

800000

I've also tried to combine data manually as below, but this also just runs for hours, so maybe it's just independent of the JoinLayers() function:

count_names <- df@assays$RNA@layers %>% names()
all_count_data <- mclapply(count_names, function(x) {
    all_count_data <- LayerData(df, layer = x)
}, mc.cores = 20)

combined_data <- do.call(cbind, all_count_data)

jackbibby1 avatar May 31 '24 15:05 jackbibby1

With great thanks to a user who emailed me with a possible solution -- it seems that in their case, the long runtime for JoinLayers() could be resolved by ensuring that the rownames of the metadata matches the order of the cells in the object. We will investigate this further in the future, but for now, I recommend that anyone experiencing long runtimes try and see if this resolves their issue. If it does not, please open a new issue!

igrabski avatar Jun 20 '24 13:06 igrabski

Hi all, I'm having this issue as well with four samples. I don't know that the rownames match the order of cells - how might I go about fixing that to ameliorate this? My original input data are Read10x style folders containing barcodes, features and matrix files, and I have them in V5 seurat objects currently. Thanks!

hgildea avatar Jul 16 '24 18:07 hgildea

Hi, can you check if the rownames of your metadata matches the column names of your object? i.e., all(rownames([email protected])==colnames(object)) if your object is called object?

igrabski avatar Jul 19 '24 19:07 igrabski

It returns TRUE, so I believe they do match. I am experiencing long-ish run times, but they were much improved by using so[["RNA"]] <- JoinLayers(so[["RNA"]]) instead of so[["RNA"]] <- JoinLayers(so)

hgildea avatar Jul 19 '24 20:07 hgildea

Same problem. Even with just 10 samples, the merge() function is taking forever. I've left it running overnight, but there's still no result.

256wangliu avatar Jul 29 '24 22:07 256wangliu

Hi Liu, can you check if the rownames of your metadata matches the column names of your object? i.e., all(rownames([email protected])==colnames(object)) if your object is called object? Otherwise, what if you try the suggestion of hgildea above, i.e. object[["RNA"]] <- JoinLayers(object[["RNA"]])? If you are still experiencing very long runtimes, you are welcome to send me your samples (or a subset of your samples) to igrabski[at]nygenome[dot]org and I am happy to take a look.

igrabski avatar Aug 16 '24 19:08 igrabski