DC2-analysis
DC2-analysis copied to clipboard
Dask version of Validation Object Table for DC2 Run 2.2i - U/wmwv/dr6 dask refactor
This PR introduces a Dask-based Notebook to do some top-level validation of the DC2 Run2.2i DR6 Object Table.
- It uses Holoviews + Datashader to present interactive plots.
- It details memory management in Dask, and loads and drops data for each section of the Notebook to keep memory usage down.
- It also uses a Linux-specific solution to keep allocated memory down to what's actually needed and used, instead of the default behavior of keeping allocated memory around "just in case".
This is based on the existing validation/validate_dc2_run2.2i_object_table.ipynb
Notebook, which uses Pandas DataFrames and is limited to the available memory on one machine, which doesn't fit all of the needed data from DC2 DR6 Object Table.
Dask both allows the use of the full dataset and is actually faster for some operations as it makes better use of the available CPUs.
I've tagged @cwwalter @nsevilla @SimonKrughoff for reviewers. I have some more specific questions for each of you to focus on:
- @cwwalter Could you specifically look at
A. The memory usage solution with MALLOC_TRIM_THRESHOLD
B. Figure out if/how one can use
dask-mpi
instead of having the user cut-and-past commands to set up the workers. - @SimonKrughoff I send this to you A. For fun as thanks for pointing us to Holoview+Dask in the first place. B. An example of using Dask for next-to-database processing to compute aggregate statistics and visualizations on the Object Table. This is a specific use case that I would like to provide to DESC, and I would like Rubin to be aware of, even if the implied resources are beyond the baseline.
- @nsevilla I'm not sure how much you've been following the Spark+Dask work. A. Does this run for you? B. How could the discussion and introduction of Dask be improved? C. How might we use such an approach for V&V?
@yymao @heather999 I of course always welcome your feedback and suggestions. Please feel free to check this out, but no obligation.
@boutigny If you're interested, here's a fuller-fledged Dask example beyond the RA, Dec Notebook you tried out the other month. If you have a chance to run this at IN2P3, I would be interested in your experience. But no obligation.
And if you know more on the specific technical level about the use of the MALLOC_TRIM_THRESHOLD memory management specification, I would be most happy to learn more.
I've tagged @cwwalter @nsevilla @SimonKrughoff for reviewers. I have some more specific questions for each of you to focus on:
- @cwwalter Could you specifically look at A. The memory usage solution with MALLOC_TRIM_THRESHOLD B. Figure out if/how one can use
dask-mpi
instead of having the user cut-and-past commands to set up the workers.- @SimonKrughoff I send this to you A. For fun as thanks for pointing us to Holoview+Dask in the first place. B. An example of using Dask for next-to-database processing to compute aggregate statistics and visualizations on the Object Table. This is a specific use case that I would like to provide to DESC, and I would like Rubin to be aware of, even if the implied resources are beyond the baseline.
- @nsevilla I'm not sure how much you've been following the Spark+Dask work. A. Does this run for you? B. How could the discussion and introduction of Dask be improved? C. How might we use such an approach for V&V?
Pretty complete and thorough intro to using Dask for visualization/V&V, thanks. Comments:
- I was not able to load graphviz using the desc-python-bleed kernel. Also found a problem of missing libthrift.so.0.14.1 when running dd.read_parquet (found a workaround, see below, response A.).
- Mention/link what DPDD is.
- A few typos, not important, the most relevant are:
- Don't understand: "Basically you just machine that can hold the data in memory on the reader."
- Move the sentence "The nodes will be printed out..." to the cell above where it says "Then exit the environment and go to the second Node."
- My SLURM_NODELIST variable shows up empty.
- Memory Management and Expected Messages should have ### headings (one additional level) as I think they are part of "Start our Dask cluster"
- sum(compute) is mentioned at one point. I think it should be "sum(star)" for example
- Will I need different malloc sizes, depending on the tests run? Don't know where 128*1024 comes from.
- Maybe a cell summarizing the important ideas for working with dask? (creating indexes, persistence, client cancelling).
To your concrete questions:
A. I was able to run it once on Friday, on desc-python-bleed (great!) but then I ran into a missing libthrift.so.0.14.1 issue I mentioned above. I was able to run with desc-python though.
B. I think, given that you decided to go full on introducing many concepts, it would be great to point to resources explaining how Dask works at NERSC. The things I have to do according to this tutorial seem to me a bit like black magic at times, maybe a couple of sentences on what the commands are doing, if you think it is appropriate here. Also, things like: what happens to the nodes I spin up to run with Dask, after I stop using them? Are they still assigned to me the following day, do I have to allocate them again? I guess these are more 'NERSC' questions that I should know, but still.
C. We could definitely use this as basis for runs on complete, final or quasi-final data releases. For fast turnaround, I think there are quick ways that don't involve having the dask infrastructure (unless the overhead of having dask and compiled parquet files is compensated by this quick response).
I have concrete suggestions or comments about the tests themselves, TBD at some other time.
@wmwv I was looking at your question about MALLOC_TRIM_THRESHOLD.
https://distributed.dask.org/en/latest/worker.html#memory-not-released-back-to-the-os
Is really interesting; this was not in the documentation in January when I was trying to solve my issues. I have been doing some tests with my code again and it seems I might not be having the issue anymore. I'm still testing but this could be a change at cori, but also I wrote the healpixel into a copy of the the files so I didn't have to do in-place repartitions which was a huge memory and resource issue. Perhaps the problem was coming from there. Anyway, I will find out.
But, the main thing I wanted to ask was is it you have any evidence you really need this? I had never had a problem in the past and only had this leak issue when calculating 2pt-functions on the entire skysim 5000 data set. I never saw a possible need for it before.
Are you just doing this out of safety, or because you saw an issue? We should avoid it if possible since it is a bit confusing and it will degrade performance.
Sadly
and it seems I might not be having the issue anymore
is not true. When I bumped the sizes up I started seeing the same thing again. Later versions of Dask's dashboard now do a better job of showing you how much unmanaged memory you have. I can see I have a lot, but I tried the "one time debugging" fix in the MALLOC_TRIM_THRESHOLD section and it didn't kill most of it. So, I have isolated the problem more, but that doesn't seem to be it.
I'm also puzzling a little over how we would get that env variable set on the worker nodes on the other machines. I'll keep working..
But, the main thing I wanted to ask was is it you have any evidence you really need this? Are you just doing this out of safety, or because you saw an issue?
Yes, my testing explicitly showed that without this set, I hit memory limits and workers were killed very often (specifically at certain steps the generated large temporary arrays). With this set, memory limits were not hit, reported memory usage was 3-4x less and the Notebook ran fine.
We should avoid it if possible since it is a bit confusing
It is deeply confusing and I really hate having to do it. As @nsevilla says, this looks like deep black magic.
and it will degrade performance.
I'm not deeply concerned about this. In principle yes, but in practice I don't think the breaks to kernel to release memory are impacting performance that much. I don't see any evidence in the speed tests with and without.
I'm also puzzling a little over how we would get that env variable set on the worker nodes on the other machines. I'll keep working..
Thank you. I would be most grateful if you can figure out how to run all of this through more civilized options. I would really like to eliminate that ridiculously awkward cell with the various parts to cut and paste in different terminals and different loaded environments.
OK good news is I think I see how we can easily set this in the wrapper setup script. We are already setting some other variables that are set on each worker.
Still struggling with my problem..