dsmil-wsi icon indicating copy to clipboard operation
dsmil-wsi copied to clipboard

TCGA data download

Open LITTLEKKKK opened this issue 3 years ago • 21 comments

When I come to the website, it says: “All slide and diagnostic images from the TCGA program are currently unavailable for download”. Could you share the lung datasets by using a Google Cloud link? : )

LITTLEKKKK avatar Sep 04 '21 08:09 LITTLEKKKK

I believe the Google Drive link is posted in the readme. I have emphasized the link and updated the readme file. Could you check if the link in the section Processing raw WSI data->Download WSIs->From Google Drive works for you?

binli123 avatar Sep 04 '21 22:09 binli123

Hi Bin,

Do you have any advice on how to download the Google Drive folder with the TCGA files from a terminal? I tried using gdown, but it only allows to download folders with at most 50 files.

Best wishes, George

GeorgeBatch avatar Oct 26 '21 16:10 GeorgeBatch

Hi Bin,

Do you have any advice on how to download the Google Drive folder with the TCGA files from a terminal? I tried using gdown, but it only allows to download folders with at most 50 files.

Best wishes, George

I have never tried it with a terminal. But I think one of the appropriate ways to download a large number of files is to use the Google Drive desktop app and select the folder to sync to your local device.

binli123 avatar Oct 26 '21 16:10 binli123

Thanks! How large is the Google Drive folder that you provided?

GeorgeBatch avatar Oct 26 '21 16:10 GeorgeBatch

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

binli123 avatar Oct 26 '21 16:10 binli123

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

Thanks, I'll try using the desktop app, but I do not have 800GB of memory on my machine.

GeorgeBatch avatar Oct 26 '21 17:10 GeorgeBatch

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

Thanks, I'll try using the desktop app, but I do not have 800GB of memory on my machine.

You could also just use the cropped patches I uploaded, they are less than 100GB

binli123 avatar Oct 26 '21 17:10 binli123

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

Thanks, I'll try using the desktop app, but I do not have 800GB of memory on my machine.

You could also just use the cropped patches I uploaded, they are less than 100GB

Thank you!

GeorgeBatch avatar Oct 26 '21 17:10 GeorgeBatch

Hi Bin,

I am trying to understand which of the files from the Google Drive folder I actually need.

In TCGA-lung-WSI folder, all the .svs files are enclosed in folders, e.g. ffa686dc-0f3c-4fb8-af3b-ee82a940752a folder for the ffa686dc-0f3c-4fb8-af3b-ee82a940752a.svs WSI. Each of them also seems to have a corresponding logs folder. Can you please explain what is there and why it is needed?

A similar thing is true about the TCGA-lung-WSI-corrupt folder, but here each of the WSI subfolders also has an annotations.txt file. Can you also please also explain why the corrupted WSIs have annotations, while all the other WSIs don't?

Many thanks, George

GeorgeBatch avatar Oct 28 '21 08:10 GeorgeBatch

Hi Bin,

I am trying to understand which of the files from the Google Drive folder I actually need.

In TCGA-lung-WSI folder, all the .svs files are enclosed in folders, e.g. ffa686dc-0f3c-4fb8-af3b-ee82a940752a folder for the ffa686dc-0f3c-4fb8-af3b-ee82a940752a.svs WSI. Each of them also seems to have a corresponding logs folder. Can you please explain what is there and why it is needed?

A similar thing is true about the TCGA-lung-WSI-corrupt folder, but here each of the WSI subfolders also has an annotations.txt file. Can you also please also explain why the corrupted WSIs have annotations, while all the other WSIs don't?

Many thanks, George

Those are just download logs that automatically generated when you download something from NCI data portal. A small portion of the WSI has coarse annotations that come with the slide and those low quality ones (also scanned with a lower mag) just happen to have it. I guess those are uploaded by a specific facility who also annotated the slides.

binli123 avatar Oct 28 '21 14:10 binli123

Makes sense, thank you!

GeorgeBatch avatar Oct 28 '21 14:10 GeorgeBatch

I didn't find cropped patches in Google Drive folder. Where is the link? Thanks.

LITTLEKKKK avatar Nov 01 '21 18:11 LITTLEKKKK

I didn't find cropped patches in Google Drive folder. Where is the link? Thanks.

https://drive.google.com/file/d/17zCn-WRNzxxxh8kkdBTbDLDZy0XZ3RIu/view?usp=sharing

binli123 avatar Nov 01 '21 18:11 binli123

Thanks a lot. The cropped patches zip file is often broken off and not stable. Did you upload unzip files of cropped patches before? : (

LITTLEKKKK avatar Nov 03 '21 00:11 LITTLEKKKK

Also, it looks like the command should include the download specification.

  $ cd tcga-download
  $ gdc-client download -m gdc_manifest.2020-09-06-TCGA-LUAD.txt --config config-LUAD.dtt
  $ gdc-client download -m gdc_manifest.2020-09-06-TCGA-LUSC.txt --config config-LUSC.dtt

instead of

  $ cd tcga-download
  $ gdc-client -m gdc_manifest.2020-09-06-TCGA-LUAD.txt --config config-LUAD.dtt
  $ gdc-client -m gdc_manifest.2020-09-06-TCGA-LUSC.txt --config config-LUSC.dtt

GeorgeBatch avatar Nov 03 '21 14:11 GeorgeBatch

Also, it looks like the command should include the download specification.

  $ cd tcga-download
  $ gdc-client download -m gdc_manifest.2020-09-06-TCGA-LUAD.txt --config config-LUAD.dtt
  $ gdc-client download -m gdc_manifest.2020-09-06-TCGA-LUSC.txt --config config-LUSC.dtt

instead of

  $ cd tcga-download
  $ gdc-client -m gdc_manifest.2020-09-06-TCGA-LUAD.txt --config config-LUAD.dtt
  $ gdc-client -m gdc_manifest.2020-09-06-TCGA-LUSC.txt --config config-LUSC.dtt

They also updated the download client, I might just remove this part from the readme

binli123 avatar Nov 03 '21 15:11 binli123

Thanks a lot. The cropped patches zip file is often broken off and not stable. Did you upload unzip files of cropped patches before? : (

Which operating system do you use?

binli123 avatar Nov 03 '21 17:11 binli123

Win. I use IDM to download the file.

LITTLEKKKK avatar Nov 04 '21 04:11 LITTLEKKKK

TCGA slides are back online. But I needed to generate the manifest files from scratch. I originally wanted to used yours, but some of the file names were not found, maybe they changed them.

Are these TCGA-LUAD (541 slides) and TCGA-LUSC (512 slides) the links you used to get the manifest files?

I ended up there by clicking on "diagnostic slides" from the main links:

  • https://portal.gdc.cancer.gov/projects/TCGA-LUAD
  • https://portal.gdc.cancer.gov/projects/TCGA-LUSC

GeorgeBatch avatar Nov 05 '21 15:11 GeorgeBatch

Is there a Google Drive link for Camelyon 16 cropped patches? Thanks.

I didn't find cropped patches in Google Drive folder. Where is the link? Thanks.

https://drive.google.com/file/d/17zCn-WRNzxxxh8kkdBTbDLDZy0XZ3RIu/view?usp=sharing

LITTLEKKKK avatar Nov 06 '21 13:11 LITTLEKKKK

Thanks! How large is the Google Drive folder that you provided?

I think it is around 800GB.

Thanks, I'll try using the desktop app, but I do not have 800GB of memory on my machine.

You could also just use the cropped patches I uploaded, they are less than 100GB

What is the magnification of these patches? 20 or 5? The picture looks blurry

Raymvp avatar Feb 12 '24 15:02 Raymvp