cryodrgn icon indicating copy to clipboard operation
cryodrgn copied to clipboard

Support .cs file writing for export to cryoSPARC

Open zhonge opened this issue 2 years ago • 10 comments

We should have a tool cryodrgn_utils write_cs to streamline re-importing particles to cryoSPARC.

If the input to cryoDRGN originated from cryoSPARC, this tool could keep certain (all?) fields like the uid from a reference .cs file. Then, the reimport to cryoSPARC wouldn't define an entirely new dataset in their database.

I'm not 100% on what we would need to implement yet, but maybe we can refine the idea in this thread.

zhonge avatar Sep 29 '22 14:09 zhonge

Related to issues #72, #101, #148.

zhonge avatar Sep 29 '22 14:09 zhonge

One additional complexity is that particles are typically downsampled before training cryoDRGN, so the question is do we want the .cs file to point to the (new) downsampled particle stack or refer to the original extracted particles (more error-prone).

In the latter case, the information that cryoDRGN provides is an index filtering. So maybe it makes sense to have a cryodrgn_utils filter_cs tool instead.

zhonge avatar Sep 29 '22 14:09 zhonge

Being able to send a selection of particles back to cryoSPARC would be so useful!

And I think it is best as only a selection, pointing to the original particles in the cryoSPARC project (not re-importing the downsampled particles), because the typical use case is to refine a subset of particles to high resolution, using the particles at their original pixel size.

Guillawme avatar Sep 30 '22 08:09 Guillawme

Related to issues https://github.com/zhonge/cryodrgn/issues/72, https://github.com/zhonge/cryodrgn/issues/101, https://github.com/zhonge/cryodrgn/issues/148.

Also #109

Guillawme avatar Sep 30 '22 08:09 Guillawme

Just recapping some salient points from discussions with @zhonge about this:

  • The write_star command currently takes in a .mrcs/.txt file, a required ctf .pkl, an optional index .pkl, an optional poses .pkl
  • The command will be enhanced to take in a .star input in addition to .mrcs/.txt
  • The --ref-star, --keep-micrograph, and --copy-header flags will go away.
  • If a .star is provided as input
    • All inputs other than an optional index .pkl will be ignored (and checked to make sure they are not specified).
    • All fields from the input .star will be carried over to the output .star file (including micrograph attributes, the original image name as _rlnImageName, ctf/pose information, anything else). Essentially the output .star will have exactly the same columns as in the input .star.
    • If an index .pkl is provided, the output .star will be filtered to have only those rows (row numbers are assumed 0-indexed).
  • If a .mrcs/.txt file is provided as input, the behavior will remain unchanged.

A corresponding write_cs command will be implemented exactly as outlined above, except that it will take in either an input .cs file or an input .mrcs/.txt file. The command will mimic the behavior above, except that for .cs files.

A couple of additional points I'd like to propose here:

  • The cryodrgn filter_star command will still be supported for existing users, though it will be refactored to internally use write_star. A DeprecationWarning will be emitted on its use and it may become unsupported at some future point (write_star can provide filtering as outlined above).
  • A cryodrgn filter_cs command will not be implemented (cryodrgn write_cs can provide filtering as outlined above).

vineetbansal avatar Oct 10 '22 21:10 vineetbansal

Seems like a great plan!

Sorry to bring this up again (https://github.com/zhonge/cryodrgn/pull/70#issuecomment-941061328), maybe you already discussed it and chose to not use an external library, but just in case you didn't: check out the starfile library, its goal is compatibility with star files from RELION, and it might simplify all your handling of star files (it turns star files into pandas dataframes and vice versa).

Guillawme avatar Oct 11 '22 07:10 Guillawme

Hi @Guillawme - I agree on using the starfile library. However, I'd like to handle that as a separate issue so the migration can be independent of anything we do here. I'll create an issue on that and hopefully we can get it done quickly.

vineetbansal avatar Oct 11 '22 13:10 vineetbansal

I think write_cs will also need to write a .csg file, right? In order to be able to import into Csparc using "Import Result Group"?

e.g. something like this - simple metadata file (see here for details):

group:
  description: A stack of imported particles. May or may not contain data, ctfs, pick
    locations, etc.
  name: imported_particles
  title: Imported particles
  type: particle
results:
  blob:
    metafile: '>J4539_imported_particles_exported.cs'
    num_items: 1616
    type: particle.blob
  ctf:
    metafile: '>J4539_imported_particles_exported.cs'
    num_items: 1616
    type: particle.ctf
version: v4.0.1```

olibclarke avatar Oct 27 '22 19:10 olibclarke

Here's how to reimport a particle stack filtered by cryoDRGN back into cryoSPARC, while pointing to the original particles in the cryoSPARC project.

  1. Export the original cryoSPARC particles from the associated Job so that there's a single .cs and .csg file describing the particle stack (i.e. no PXXX_JYYY_passthrough_particles.cs file). You can do this with the "Export" button in the Outputs tab.

image

  1. You'll find the .cs and .csg files in the exports subdirectory of your project directory: /path/to/project/directory/PXXX/exports/groups/JYYY_particles

  2. Filter the .cs file with the index selection .pkl file using cryodrgn_utils write_cs. For example, here is the command to filter J929_particles_exported.cs by a selection saved in ind_keep.214511_particles.pkl and save out a new J929_particles_filtered.cs file:

(cryodrgn) $ cryodrgn_utils write_cs J929_particles_exported.cs --ind ind_keep.214511_particles.pkl -o J929_particles_filtered.cs
  1. Make a copy of the .csg text file and replace the metafile field with the new .cs filename and the num_items field with the new number of particles. Here's a comparison of the before and after:
(cryodrgn) [Sat Mar 11 23:50 J929_particles] sdiff J929_particles_exported.csg J929_particles_filtered.csg
created: 2023-03-12 03:52:35.411011				created: 2023-03-12 03:52:35.411011
group:								group:
  description: All particles that were processed, including a	  description: All particles that were processed, including a
  name: particles						  name: particles
  title: All particles						  title: All particles
  type: particle						  type: particle
results:							results:
  alignments2D:							  alignments2D:
    metafile: '>J929_particles_exported.cs'		      |	    metafile: '>J929_particles_filtered.cs'
    num_items: 286801					      |	    num_items: 214511
    type: particle.alignments2D					    type: particle.alignments2D
  alignments3D:							  alignments3D:
    metafile: '>J929_particles_exported.cs'		      |	    metafile: '>J929_particles_filtered.cs'
    num_items: 286801					      |	    num_items: 214511
    type: particle.alignments3D					    type: particle.alignments3D
  blob:								  blob:
    metafile: '>J929_particles_exported.cs'		      |	    metafile: '>J929_particles_filtered.cs'
    num_items: 286801					      |	    num_items: 214511
    type: particle.blob						    type: particle.blob
  ctf:								  ctf:
    metafile: '>J929_particles_exported.cs'		      |	    metafile: '>J929_particles_filtered.cs'
    num_items: 286801					      |	    num_items: 214511
    type: particle.ctf						    type: particle.ctf
  location:							  location:
    metafile: '>J929_particles_exported.cs'		      |	    metafile: '>J929_particles_filtered.cs'
    num_items: 286801					      |	    num_items: 214511
    type: particle.location					    type: particle.location
  pick_stats:							  pick_stats:
    metafile: '>J929_particles_exported.cs'		      |	    metafile: '>J929_particles_filtered.cs'
    num_items: 286801					      |	    num_items: 214511
    type: particle.pick_stats					    type: particle.pick_stats
version: v4.1.2							version: v4.1.2
  1. In cryoSPARC, use the "Import Results Group" job type and reimport the new .csg file. :tada:

We can probably have cryodrgn_utils write_cs write out the csg file as well as @olibclarke suggested to skip over Step 4. It may be worth looking at the new cryosparc-tools API to see if there's a better way to write out the .csg file.

zhonge avatar Mar 12 '23 05:03 zhonge

I just tried this and it seems it is going to work! :tada: (The file was generated, but I will know for sure when I'm able to copy these newly generated .csg and .cs files to the correct location; on our cluster we don't have write permission to the cryosparc project directory, but cryosparc will only import result groups from there, so I need somebody else to copy the files for me or change permissions.)

It would be great for usability if this tool could work this way (merging steps 3 and 4 above, as you say):

  • we point it to the original .csg file (in the cryosparc exports directory) and the ind.pkl file (from the cryoDRGN job), and provide a file name for the .csg file to be newly created
  • the tool then automatically
    • finds the original .cs file to filter (the one that the original .csg file points to)
    • saves the filtered .cs file in the current directory with the same base name as the newly created .csg file
    • makes a copy of the original .csg file to the file name provided
    • and finally edits this newly created .csg file to point to the filtered .cs file and contain the correct number of particles.

Guillawme avatar Sep 14 '23 14:09 Guillawme