cryodrgn icon indicating copy to clipboard operation
cryodrgn copied to clipboard

`downsample` can't process `star` file with particles in multiple folders.

Open sholalkere opened this issue 1 year ago • 4 comments

Describe the bug If a star file has particles in multiple folders (e.g. EMPIAR 10028) then StarfileSource is unable to load the data. This is due to appending only the particle paths' basename to the datadir. In the case of EMPIAR 10028 the parent folder name is needed as well as there are duplicate basenames.

To Reproduce Here is the file structure:

.
+-- Particles
|    +-- MRC_0601
|        +-- 001_particles_shiny_nb50_new.mrcs
|    +-- MRC_1901
|        +-- 001_particles_shiny_nb50_new.mrcs
|    +-- shiny_2sets.star

Here is the command: cryodrgn downsample Particles/shiny_2sets.star -D 256 -o particles.256.mrcs

Expected behavior The command is expected to not error. Instead there is an (expected) FileNotFoundError.

Additional context It is impossible to use --datadir to fix this as suggested in the README. The issue arises from the fact that for StarfileSource the _MRCDataFrameSource constructor is called with a datadir argument (even if none is provided through the command line) which always triggers this branch. A hacky way to fix this is to set up the file structure as shown above and uncomment the aforementioned branch.

sholalkere avatar Jul 01 '24 19:07 sholalkere

Hey, thanks for the heads up, we are hoping to refactor this part of the code to make it easier to support .star files with multiple folders some time soon!

In this case, without the --datadir flag, datadir will be set to Particles/, since in StarfileSource.__init__() we have:

if not datadir:
    datadir = os.path.dirname(filepath)

and filepath in this case is Particles/shiny_2sets.star. This conflicts with the parent class' _MRCDataFrameSource.__init__(), where letting datadir=None allows for use of basename() to be avoided, as mentioned above, thus allowing file names with different folder names to be used:

if datadir:
    self.df["__mrc_filepath"] = self.df["__mrc_filename"].apply(
        lambda filename: os.path.join(datadir, os.path.basename(filename))
        )
else:
    self.df["__mrc_filepath"] = self.df["__mrc_filename"]

So a somewhat less hacky fix might be to add better logic in StarfileSource.__init__() when setting the value of datadir — I will try to look into this for v3.4.0!

michal-g avatar Jul 01 '24 20:07 michal-g

Were you able to get it to work using --datadir? IIRC, the behavior should be to append the relative path in the .star file to the provided --datadir, and if that doesn't work, then append just the filename to --datadir.

We may have changed this behavior after refactoring to ImageSource, but I recall fixing this.

Alternatively, assuming your .mrcs filename are all unique, you can symlink them all to a new directory that you can then provide to --datadir, e.g.:

mkdir new_datadir
cd new_datadir
for i in ../path/to/MRC_0601/*; do ln -s $i .; done
for i in ../path/to/MRC_1901/*; do ln -s $i .; done

zhonge avatar Jul 19 '24 21:07 zhonge

I wasn't able to get it to work without modifying the code. I don't think it appends the starfile path to the datadir (see 1 and 2). Since StarfileSource is always initialized with a non-empty datadir, the constructor for _MRCDataFrameSource always appends the os.basename of the particle paths, perhaps it was intended to append the full path instead? Doing this fixes the issue.

That solution seems reasonable, however the 10028 starfile provided in the cryodrgn_empiar repo doesn't have unique filenames.

sholalkere avatar Jul 19 '24 22:07 sholalkere

For the upcoming v3.4.0 release, I have ended up doing a small refactor of the ImageSource class hierarchy — and especially the classes that handle .star files — in order to resolve issues like this one and also to simplify and consolidate the code used to parse and operate on .star input.

I have tried an early version of the v3.4.0 release on the same 80S case you described above. When running using v3.3.3, it produces the following error as expected, with and without --datadir=empiar_10028_80S/360:

FileNotFoundError: [Errno 2] No such file or directory: 'empiar_10028_80S/360/001_particles_shiny_nb50_new.mrcs'

However, now with v3.4.0, we can use shiny_2sets.star as input without any problems with and without --datadir. We can also still use --datadir to specify alternate input directories in the case of e.g. downsampling (--datadir=empiar_10028_80S/256), but the default behaviour when --datadir is not given remains treating the directory in which the input .star file is located as the --datadir.

In the codebase, we have consolidated the --datadir parsing logic into MRCDataFrameSource.parse_filename() which is used by __init__(), thus making this logic also accessible by other parts of the code. We only use basename in the specific circumstance where --datadir is given, and the file names are absolute but don't match --datadir:

def StarfileSource.__init__():
    ...
    self.df["__mrc_filepath"] = self.df["__mrc_filename"].apply(self.parse_filename)
...
def parse_filename(self, filename: str) -> str:
    newname = (
        os.path.abspath(filename) if os.path.isabs(filename) else str(filename)
    )
    if self.datadir is not None:
        if os.path.isabs(newname):
            if os.path.commonprefix([newname, self.datadir]) != self.datadir:
                newname = os.path.join(self.datadir, os.path.basename(newname))
        else:
            newname = os.path.join(self.datadir, newname)

    return newname

This in some part recapitulates (and now replaces) the logic in the old starfile.prefix_paths() method, which was used in some older versions of cryoDRGN but not in v3.3.3:

def prefix_paths(mrcs: List, datadir: str):
    mrcs1 = ["{}/{}".format(datadir, os.path.basename(x)) for x in mrcs]
    mrcs2 = ["{}/{}".format(datadir, x) for x in mrcs]
    try:
        for path in set(mrcs1):
            assert os.path.exists(path)
        mrcs = mrcs1
    except AssertionError:
        for path in set(mrcs2):
            assert os.path.exists(path), f"{path} not found"
        mrcs = mrcs2
    return mrcs

I would much rather explicitly write out the cases where we want to use basename, rather than trying basename and then using the full filename which may lead unexpected behaviour — but if we decide to re-employ this method at some point in the future, it should be easier to do now by incorporating it into parse_filename.

You can access this early version (v3.4.0-a6 and above) using pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ 'cryodrgn<=3.4.0' --pre*. I have also created a write-up explaining the motivation and implementation details of this refactoring.

Please let us know if this works for you, and if you spot any other problems! I may still make some minor tweaks to this code over the next few days as I finalize testing for the v3.4.0 release, so stay tuned!

* note the use of single-quotes when specifying the cryoDRGN version, which is necessary in most bash environments!

michal-g avatar Aug 15 '24 17:08 michal-g