cryodrgn icon indicating copy to clipboard operation
cryodrgn copied to clipboard

Downsample with .cs particles as source sometimes reads more particles than parse_pose and parse_ctf

Open epkumpu opened this issue 2 years ago • 5 comments

I used a particles.cs file (with --datadir argument to specify the folder) from an exported job in CS as input for downsample and it loads all particles nicely. Actually, it loads ALL particles in the extract folders located within the exported CS job, even though some of those particles are excluded in the actual particle stack in CS. On the other hand, parse_pose and parse_ctf read the correct number of particles from the .cs file and when I try to run training, I get an error because of the mismatch in particle numbers. Would it be possible to read in only the particles that are a part of the final stack and not all particles located within the extract folder?

I circumvented this for now by re-extracting the particles which worked just fine, but takes unnecessary space.

epkumpu avatar Aug 31 '22 11:08 epkumpu

Thanks for reporting. Are you providing a different .cs file to cryodrgn downsample than the cryodrgn parse_* commands?

zhonge avatar Sep 01 '22 14:09 zhonge

No, it is the same file.

epkumpu avatar Sep 02 '22 07:09 epkumpu

Could you email the .cs file to myself ([email protected]) and Vineet Bansal @vineetbansal ([email protected]). We will take a look. Thanks!

zhonge avatar Sep 03 '22 12:09 zhonge

Hi @epkumpu - we're not seeing anything obvious in the cryodrgn codebase that would cause this behavior. If you can send us your particles.cs file where you're seeing a different number of processed particles in downsample vs parse_pose, it will be immensely helpful for us to squash this bug. Thanks!

vineetbansal avatar Sep 09 '22 22:09 vineetbansal

Hello @epkumpu - thanks for the sample .cs data. We have a suspicion of what the problem might be, though to be sure, we were wondering if you can check whether the directory you're specifying as --datadir has your master mrc files directly inside it, without any intervening folders, for example:

  • <datadir>/011268377466662442732_FoilHole_25024273_Data_25002032_25002034_20220227_031106_fractions_patch_aligned_doseweighted_particles.mrc

  • <datadir>/011268377466662442732_FoilHole_25024273_Data_25002032_25002034_20220227_031106_fractions_patch_aligned_doseweighted_particles.mrc

in addition to where you expect them to be, i.e:

  • <datadir>/J777/extract/011268377466662442732_FoilHole_25024273_Data_25002032_25002034_20220227_031106_fractions_patch_aligned_doseweighted_particles.mrc

  • <datadir>/J777/extract/011268377466662442732_FoilHole_25024273_Data_25002032_25002034_20220227_031106_fractions_patch_aligned_doseweighted_particles.mrc

If that is the case, can I ask you to move those files out of --datadir and redo the downsample step and see if cryoDRGN picks the correct number of particles? Thanks!

vineetbansal avatar Sep 19 '22 13:09 vineetbansal