yt icon indicating copy to clipboard operation
yt copied to clipboard

Filtered particles lead to many repetitive reads of files

Open cphyc opened this issue 5 months ago • 6 comments

Take the following example:

import yt

@yt.particle_filter(filtered_type="io")
def DM_lores(pfilter, data):
    return data[pfilter.filtered_type, "particle_mass"].to("code_mass").d > 3.1e-6

ds = yt.load_sample("output_00080")
ds.add_particle_filter("DM_lores")
yt.set_log_level(10)  # to see the IO footprint
ds.r["DM_lores", "particle_position"]

This will lead to each file being read twice, once to filter on particle mass, and a second time to obtain the positions. This is suboptimal, since all the reading could be done in one pass.

Running on `main` with the following diff:
diff --git a/yt/frontends/ramses/io.py b/yt/frontends/ramses/io.py
index 6f241631a..0ba375075 100644
--- a/yt/frontends/ramses/io.py
+++ b/yt/frontends/ramses/io.py
@@ -91,6 +91,8 @@ def _ramses_particle_binary_file_handler(particle_handler, subset, fields, count
     ds = subset.domain.ds
     foffsets = particle_handler.field_offsets
     fname = particle_handler.fname
+    fields = list(fields)
+    mylog.debug("Reading %s: %s", fname, fields)
     data_types = particle_handler.field_types
     with FortranFile(fname) as fd:
         # We do *all* conversion into boxlen here.

Logs

yt : [DEBUG    ] 2025-08-04 11:21:25,431 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00001: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,437 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00002: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,439 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00003: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,440 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00004: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,441 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00005: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,447 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00006: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,447 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00007: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,447 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00008: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,447 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00009: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,448 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00010: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,448 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00011: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,448 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00012: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,449 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00013: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,451 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00014: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,452 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00015: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,453 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00016: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,462 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00001: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,466 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00002: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,467 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00003: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,469 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00004: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,469 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00005: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,476 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00006: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,476 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00007: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,477 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00008: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,477 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00009: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,477 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00010: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,477 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00011: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,478 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00012: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,479 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00013: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,481 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00014: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,481 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00015: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
yt : [DEBUG    ] 2025-08-04 11:21:25,482 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00016: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
This gets even worse when having filtered particles types of filtered types, for example with:
import yt

@yt.particle_filter(filtered_type="io")
def DM_lores(pfilter, data):
    return data[pfilter.filtered_type, "particle_mass"].to("code_mass").d > 3.1e-6

@yt.particle_filter(filtered_type="DM_lores")
def DM_lores_some_ids(pfilter, data):
    return data[pfilter.filtered_type, "particle_identity"] < 10000

ds = yt.load_sample("output_00080")
ds.add_particle_filter("DM_lores")
ds.add_particle_filter("DM_lores_some_ids")

yt.set_log_level(10)  # to see the IO footprint
ds.r["DM_lores_some_ids", "particle_position"]

The files will now be read 4 times!

yt : [DEBUG    ] 2025-08-04 11:26:15,891 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00001: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
[...]
yt : [DEBUG    ] 2025-08-04 11:26:16,065 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00001: [('io', 'particle_identity'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
[...]
yt : [DEBUG    ] 2025-08-04 11:26:16,099 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00001: [('io', 'particle_mass'), ('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
[...]
yt : [DEBUG    ] 2025-08-04 11:26:16,123 Reading /home/XXX/Documents/prog/yt-data/output_00080/part_00080.out00001: [('io', 'particle_position_x'), ('io', 'particle_position_y'), ('io', 'particle_position_z')]
[...]

In addition to repetively reading the same part of the files (positions are read each time), this also isn't cache-friendly, since we read chunks of each files instead of the whole. Here, we read x y z m, then x y z id then x y z m again then x y z to finish. It would be much more optimal to read x y z id m and then do the filtering(s).

cphyc avatar Aug 04 '25 09:08 cphyc

I wonder if we could improve this by making sure that a ParticleFilter records all of its base requirements rather than simply its direct requirements. So, at the time of adding, we recursively evaluate requirements and store the whole list. That might cause them to all be read in one go.

brittonsmith avatar Aug 04 '25 12:08 brittonsmith

Someone should correct me, but it looks to me as if a particle filter's required fields are never considered directly. Instead, the work is being done indirectly in ParticleField.apply. It is reading fields as needed, but never taking stock of everything it's going to need and asking for all of them.

brittonsmith avatar Aug 04 '25 13:08 brittonsmith

Ok, last contribution from me for today. The application of particle filters occurs in a completely separate place in yt/data_objects/selection_objects/data_selection_objects.YTSelectionContainer.get_data than field reading. More specifically, filtering happens before evaluation of dependencies for the requested field. I think that's where the work would have to happen.

brittonsmith avatar Aug 04 '25 13:08 brittonsmith

I am going through the code and trying to figure out why it's done this way, and if it was the result of beating my head at a problem and giving up or if it was an oversight/failure on my part.

I think the problem may be related to the fact that we allow the fields to filter based on the fields we return, rather than explicitly not including those, but I am not entirely sure.

matthewturk avatar Aug 04 '25 14:08 matthewturk

OK, I've convinced myself that requires is specifically designed to avoid overreading and this is indeed an error, not the result of giving up.

matthewturk avatar Aug 04 '25 14:08 matthewturk

I'm not sure I understood your last message - are you saying that the reason we have multiple reads is because of a bug?

cphyc avatar Aug 05 '25 12:08 cphyc