seurat
seurat copied to clipboard
JointLayers() takes too long
Hi Team,
I merged 16 Seurat objects and made a single Seurat object, and tried to make a single layer but it takes forever.
The merged data is pretty large data, but It is weird that it runs more than 5hrs
I attach the code and relevant info
Thank you!
data_combined
An object of class Seurat
24440 features across 126178 samples within 1 assay
Active assay: RNA (24440 features, 0 variable features)
16 layers present: counts.Normal_Prostate, counts.Normal_Prostate.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.Normal_Prostate.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.1.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject, counts.2.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject.SeuratProject
data_combined[["RNA"]] <- JoinLayers(data_combined)
And it is running for more than 5 hours. Do I just have to wait? Or is there something wrong?
Thank you!
p.s. I tested it for the whole night (more than 13 hrs including the first 5hrs) but it is still running..
Hi, this issue seems unlikely to be due to JoinLayers
, but the runtime is definitely unusually long here. Based on the output of your object, the layer names look very odd -- would you be able to share the full code for creating your data_combined
object?
@igrabski Thank you for your reply.
Here you are!
I repeated this 15 more as there were a total of 16 raw count matrices to be merged
Thank you!
data <- Read10X(data.dir = path1)
data <- CreateSeuratObject(counts = data, project = "Normal_Prostate", min.cells = 3, min.features = 200)
data2 <- Read10X(data.dir = path2)
data2 <- CreateSeuratObject(counts = data2, project = "Normal_Prostate", min.cells = 3, min.features = 200)
data_combined <- merge(data, data2)
Hi Ryan, it's possible the issue is related to the very long and similar layer names produced by this pairwise merging process. We will investigate whether naming causes an issue on our end, but in the meantime, can you try merging all 16 objects at once instead of doing them in pairs like this? You can do so by passing one object as the first argument, and the rest to the second argument, for example:
merge(obj1, y = c(obj2, obj3, obj4))
but replacing the object names above with your Seurat objects.
If that doesn't fix the issue, would you be able to share, say, 3 of your objects with us to look at, if possible? You can email them to igrabski [at] nygenome [dot] org.
Hi @igrabski
Yes, as you suggested, I tried to call only three data and merge as you told me to test, but it still takes too long time (Merging only three and using JointLayers takes more than 1hr is abnormal, right?)
I attach the code here
data1 <- Read10X(data.dir = dir1)
data1 <- CreateSeuratObject(counts = data1, project = "Normal_Prostate", min.cells = 3, min.features = 200)
data2 <- Read10X(data.dir = dir2)
data2 <- CreateSeuratObject(counts = data2, project = "Normal_Prostate", min.cells = 3, min.features = 200)
data3 <- Read10X(data.dir = dir3)
data3 <- CreateSeuratObject(counts = data3, project = "Normal_Prostate", min.cells = 3, min.features = 200)
data_combined <- merge(data1, y = c(data2, data3))
data_combined
An object of class Seurat
22677 features across 23659 samples within 1 assay
Active assay: RNA (22677 features, 0 variable features)
3 layers present: counts.1, counts.2, counts.3
data_combined[["RNA"]] <- JoinLayers(data_combined)
If nothing is wrong with the code, I can share these three objects with you. But I was wondering whether it is a problem with Seurat v5 itself, as it worked fine in previous Seurat version that didn't require JointLayers function.
Thank you!
Yes, that amount of time is definitely abnormal! Would you be able to share your objects with me at this email address: igrabski [at] nygenome [dot] org? It's possible it could be an issue with Seurat v5, but we're not able to reproduce this problem under Seurat v5 with data we have, so it would still be helpful to see your objects.
I am pretty curious about your problem. Actually, I encontered similar problem. I have 100 samples. Firstly, I preprocessed each sample to filter low-quality cells and find doublets. Then I tried to merge these data. Similarly, merge
function took very long time and the combine_sce
had 100 counts
layers, 100 data
layers, and 100 scale.data
layers. But in seurat V4, it wouldn't take so much time to do merge
. Then, JoinLayers
function took a pretty long time too. I wonder how to efficiently merge these multiple seurat objects if they were preprocessed before? In addition, how to link with BPcells in this condition?
Hi Lisch, if you are able to share your data, you are also welcome to send me 3 or so samples and I can investigate why this process is so slow.
Thanks. How can I send my data to you? @igrabski
You can email them to igrabski [at] nygenome [dot] org!
I just had the same problem with more than one hundred samples. I followed your advice of merging them at the same time, but what seemed to help the process was parallelizing it. Within minutes it was solved.
I have a related issues - JoinLayers()
seems also to be very memory hungry.
I merge 83 small objects (=249 layers), of 189,619 cells in total, seem to take 38GB in Memory.
merge()
completes successfully 160 gigabytes RAM, JoinLayers()
however, fails and I need a node with 600* gigabytes of RAM which is pretty crazy, and the problematic on my setup. Joining entails counts data and scale data slots.
I do not explicitly tell it to use multiple cores - but maybe it uses them anyways, that's why the footprint?
I will now try deleting all scale.data slots with removeLayersByPattern()
.
*maybe less would be enough too but 600 is the next step up from 160GB
Hi, if you are able to send a couple of your samples to me, I can investigate why this might be using so much memory! You can email them to igrabski [at] nygenome [dot] org.
Hey, thank you so much - I am working on another part now, let me get back to you next week. I experienced a few odd things (removing scale.data layers, then joining failed as far as i remember), I want to make sure I understand what's going on.
I have the same problem, can I send my data to you too? thx a lot! @igrabski
Yes, please feel free to email your data to igrabski [at] nygenome [dot] org! I actually have not received data yet from anyone but I am still happy to investigate if anyone can send me an example!
Same problem, I noticed that my server's CPU load is minimal when I run the code on large data. (My Seurat object was composed of more than 100 datasets and was 28GB)
There have been a lot of reports of this issue, but no one has sent me any example data yet, so I have still been unable to debug this. However, if you are able to send 3 or so of your samples to igrabski [at] nygenome [dot] org, I am happy to take a look!
i encountered the same issue. however only when integrating >50 samples / 300K cells. i have to wait for 20min before joinlayer finished join 80 samples. which is crazy. i think it is impossible to share such a big file [5gb in size] as a reprex.
How long does e.g. joining 3 or 5 or so of your samples? If that's longer than usual, I could take a look at just those.
I have two guesses:
- the problem caused by call bpcell on each sample separately, so matrix were created into different folders.
- genes between samples dont match perfectly.
Hi @igrabski ,
I can't see anything, but I'm wondering if this has been solved? I'm having a similar issue where running JoinLayers() on 50 samples is taking an age (the function just runs for hours with no end in sight).
It's tough to give a reproducible example here given the data size, but it's following a very standard pipeline i.e.
## read in data
folders <- list.files("sc_data", full.names = TRUE)
df <- mclapply(folders, function(x) Read10X(data.dir = x), mc.cores = 12)
## add ADT data
df <- mclapply(df, function(x) {
combined_data <- CreateSeuratObject(x[["Gene Expression"]])
combined_data[["ADT"]] <- CreateAssayObject(x[["Antibody Capture"]][, colnames(combined_data)])
combined_data <- PercentageFeatureSet(combined_data, pattern = mito_pattern, col.name = "percent_mt")
return(combined_data)
}, mc.cores = cores)
## merge samples
df <- merge(x = df[[1]], y = df[2:length(df)])
## look at object
df
An object of class Seurat
36662 features across 983938 samples within 2 assays
Active assay: RNA (36601 features, 0 variable features)
53 layers present: counts.1, counts.2, counts.3, counts.4, counts.5, counts.6, counts.7, counts.8, counts.9, counts.10, counts.11, counts.12, counts.13, counts.14, counts.15, counts.16, counts.17, counts.18, counts.19, counts.20, counts.21, counts.22, counts.23, counts.24, counts.25, counts.26, counts.27, counts.28, counts.29, counts.30, counts.31, counts.32, counts.33, counts.34, counts.35, counts.36, counts.37, counts.38, counts.39, counts.40, counts.41, counts.42, counts.43, counts.44, counts.45, counts.46, counts.47, counts.48, counts.49, counts.50, counts.51, counts.52, counts.53
1 other assay present: ADT
## join layers (both of these just run for hours...)
df <- JoinLayers(df)
df[["RNA"]] <- JoinLayers(df[["RNA"]])
Thanks Jack
Hi Jack, unfortunately we have been unable to reproduce this issue on our end with any of our own data, so we haven't been able to debug it. I haven't actually received any data from anyone else either -- if you are already experiencing longer runtimes than expected with just a subset (say 3 or 4 of your samples), I would be very happy to take a look and see if we can figure out what's going on! You are welcome to send anything to igrabski [at] nygenome [dot] org.
Hi,
Things run fine in smaller sample sizes, and I can't share the full dataset unfortunately. I'll try and find a public dataset to replicate the issue. I've ran a few things to see what the issue could be...
- Fewer layers and cells
- The JoinLayers() function runs fine on fewer layers e.g. runs in a few minutes for 25 layers
- Fewer cells but the same number of layers
- Downsampling the cells but keeping the layer number constant is also fine up to around 500k cells, which runs in around 7 minutes, but then things seem to clog up at 800k cells (see below -- output includes cell number first and then time in minutes). This 800k cell section has been running for an hour or so.
downsample <- c(1e4, 5e4, 1e5, 5e5, 8e5, ncol(df))
time_results <- lapply(downsample, function(x) {
message(x)
downsampled_seurat <- df[, sample(colnames(df), x)]
time_taken <- system.time(downsampled_seurat[["RNA"]] <- JoinLayers(downsampled_seurat[["RNA"]]))
message(time_taken[3]/60)
return(time_taken)
})
-----
10000
0.142733333333338
50000
0.37178333333333
100000
0.898649999999998
500000
7.03813333333333
800000
I've also tried to combine data manually as below, but this also just runs for hours, so maybe it's just independent of the JoinLayers() function:
count_names <- df@assays$RNA@layers %>% names()
all_count_data <- mclapply(count_names, function(x) {
all_count_data <- LayerData(df, layer = x)
}, mc.cores = 20)
combined_data <- do.call(cbind, all_count_data)
With great thanks to a user who emailed me with a possible solution -- it seems that in their case, the long runtime for JoinLayers()
could be resolved by ensuring that the rownames of the metadata matches the order of the cells in the object. We will investigate this further in the future, but for now, I recommend that anyone experiencing long runtimes try and see if this resolves their issue. If it does not, please open a new issue!
Hi all, I'm having this issue as well with four samples. I don't know that the rownames match the order of cells - how might I go about fixing that to ameliorate this? My original input data are Read10x style folders containing barcodes, features and matrix files, and I have them in V5 seurat objects currently. Thanks!
Hi, can you check if the rownames of your metadata matches the column names of your object? i.e., all(rownames([email protected])==colnames(object))
if your object is called object
?
It returns TRUE, so I believe they do match. I am experiencing long-ish run times, but they were much improved by using so[["RNA"]] <- JoinLayers(so[["RNA"]]) instead of so[["RNA"]] <- JoinLayers(so)
Same problem. Even with just 10 samples, the merge() function is taking forever. I've left it running overnight, but there's still no result.
Hi Liu, can you check if the rownames of your metadata matches the column names of your object? i.e., all(rownames([email protected])==colnames(object))
if your object is called object? Otherwise, what if you try the suggestion of hgildea above, i.e. object[["RNA"]] <- JoinLayers(object[["RNA"]])
? If you are still experiencing very long runtimes, you are welcome to send me your samples (or a subset of your samples) to igrabski[at]nygenome[dot]org and I am happy to take a look.