BlendingToolKit
BlendingToolKit copied to clipboard
General refactoring issue
In a lot of ways the BTK library has matured enough to be released, and the API for the user is very clear. We have very comprehensive documentations and tutorials which is awesome. However, I still feel like there are some shortcomings in the code structure that make it hard to maintain, extend, and invite new contributors to the library to work on a subset of the code. This will become more important as both the CL and PZ groups are looking to incorporate blending effects into their codes.
I have been thinking about this a lot in the past few weeks and I wanted to write some of my thoughts down here in case other people have ideas too. I think we should still write the paper before this and make some 0.9.X.Y
release as well. Our plans should stay the same but it's good to think ahead a little in terms of what could benefit the community the most.
Brainstorm
Modularization
- Currently BTK is designed around a couple of generators
DrawBlendGenerator
,MeasureGenerator
,MetricsGenerator
,BlendGenerator
which are passed one after the other and are coupled together. This makes everything sequential and makes it difficult for the user to extract a particular functionality of BTK. - For example, a user who wants to generate images of galaxy blends need to fetch the sampling function and create a separate BTK catalog class which feels burdensome. The same thing happens with the metrics module which is still hard to use as a standalone as pointed out in #286.
- In general, I think we should avoid using Public classes when possible since that adds a layer of mental burden to the user. Example:
images = get_blend_images(catalog, survey, sample='uniform')
is a lot easier to understand thannext(draw_generator)
. Even if they makes sense to us the user might be confused by them and just want a getter function. (this is jut an example, not immediately suggesting we get rid of the Blend Generators)
Galaxy Model
- Currently our code for drawing single galaxies is together with the code creating the blends. I think we should have a separate module for individual galaxies. Is this possible?
Measurement
- I keep wondering if
measure.py
is necessary are at all or if it just makes the whole pipeline more difficult to maintain. If our simulation has clear outputs in the form of images and a truth catalog, the user should know how to run their detection / deblending algorithms. - We could still have a couple of functions in
measure.py
that run e.g. sextractor and scarlet. But this would really just be barebones implementations to test that our outputs are sensible. I don't think we need a Measure class like we have now.
Galcheat / Surveys
- Galcheat is very well documented and the available surveys and parameters are readily available to users even from the CLI. I still wonder if we should move the ability to customize/create new surveys to galcheat. The inheritance works well for now, but I think it's a bit strange that a user can't use a galcheat survey directly.
Galsim YAML
- As we discussed in the past and talked to Mike Jarvis about, Galsim already has a comprehensive module for reading inputs/outputs (including catalogs) and creating complex simulations. We should take advantage of this module and not reinvent the wheel.
- We should look at how to adapt Bulge+Disk galaxies to a galsim class that takes inputs in this YAML format. It would make it trivial for users to extend or change the simulation without have to add more features to BTK.
- Using the galsim I/O module will also allow us to eliminate the need of a separate Catalog class, since the input catalog must be compatible with the galsim module (we could point users to that documentation instead for how to input a catalog).
- I also heard of GCRCatalogs and I've been curious if we can adapt that interface.
Multiple types of inputs/outputs from batches + multi-resolution
- This probably comes up everytime we do a PR causing a bug or another. As @aboucaud has pointed out, it's hard to maintain a list for MR and a single object for everything else. This branches in the control flow make the code more complicated and hard to read (as well as making docstrings more complicated too!)
- There is a similar problem with batches handling and the ordering of the bands.
Supporting helpful messages for user incorrect input
- This might be fixed if we do the other things but for now I will mention it. The code that raises errors for the user is scattered throughout the code for drawing the images, etc. This makes it hard to read and know when additional error messages are required. We should separate it somehow, perhaps a separate function/class that validates all inputs at once.
Inspiration
- Other pipeline codes for simulations and catalogs have been written in DESC including TXPipe, RAIL, CLMM, etc. we should perhaps learn more about these and see if we can adapt some of their best practices.
Tasks
- [ ] ...
Additional discussion and ideas would be much appreciated 🙏 - tagging some people who might have opinions @aboucaud @thuiop @aguinot @b-biswas @herjy @mpaillassa
Several thoughts :
Modularization
Your last suggestion can probably be done without breaking anything, just adding a layer on top of the DrawBlendsGenerator. The generators are still useful though, if you wish to work with batches. The other suggestions require more thinking.
Galaxy Model
I do not have a strong opinion on this ; this could be easily achieved by having a function corresponding to each generator and simply calling it in the draw_blends module.
Measure
I appreciate the measurement module ; I think it should be pretty useful if you want to compare your algorithm with an already existing one. For instance if someone provides the function as required by BTK, any BTK user should be able to run his algorithm effortlessly.
Galsim YAML
I still feel that it is not really relevant to us and that we are still better off using the python interface directly. It seems more suited for doing large scale simulations instead of on-the-fly stamps like BTK. The GCRcatalog point is interesting (not directly related though ?) ; it would be good to interface BTK with it (probably by having a dedicated BTK Catalog and/or DBGenerator). This would be out of scope of the 1.0 though.
Error messages
There is probably a proper way to do that, yes.
Thanks @thuiop for your comments. I'll think more about this but probably not until later (after the release and we have a paper draft). This was more to brainstorm long-term improvements. Of course, if you would like we can discuss more in the meeting this Wednesday.
I thiink it could be nice to define a clear roadmap for the release/paper. I feel like we are pretty close to it and some of the stuffs you mentioned here are as Thomas said out of the scope for a release in a near future. We can also have a separate roadmap for future plan.
Modularization
It is always nice to have some modularity but as you mentioned, it would be nice to have clear delimitation between modules with a good description of input/output and try to make them as friendly as possible. I think that it could help to know what we want to target exactly and it will be easier to design the different modules. Which might already exist.
Galaxy models
- Wouldn't it be easier to have a main class that can create a single galaxy, which you can call if you don't want blend. And then a child class which call multiple times the single galaxy one to create blends? I think it could be very useful to ave the possibility to make single objects specially for testing purposes.
- A comment I have here, correct me if I am wrong, I don't think that right know you have the possibility to create a model without using an input catalogue right? I think it could be very useful to be able to create simple galaxy models without the need of an external catalogue, specially for testing.
Measure
If you already have a measure module I would keep it. Most of the time when you develop a new approach you are asked to compare it to already existing one. If you have an implementation of scarlet/SExtractor that could help users to make comparison in a controlled environnement. But it is more work to maintain it..
Galcheat
Using galcheat is a good idea and if you go in this direction I would recommend to handle all survey related stuff through galcheat to avoid having similar things in different places.
Galsim YAML
- Regarding the galsim YAML, I don't think you have to interface directly with Galsim. But it could be nice to re-use the parser (if possible) and have a very similar syntax. It could enable some nice functionality. For example if some one want to create his own galaxy model he could directly configure it in section of the config file and you could directly parse the information in a galsim format without having an overhead in BTK.
- Regarding the GCRCatalogue, I think it could be a nice feature. Though, it requires to have the catalogue on your computer (and it is a huge one) or to run it at NERSC or CC where the catalogue is already dowloaded. I think this will mainly concern LSST users.
Additional comments from recent BTK meeting (04/24/22)
We should define an end goal for the future of BTK
- more modular, means its hard to maintain; but should have simple/consistent/straightforward numpy inputs/outputs anyways
- Using galsim parser (not yaml necessarily)
- PIFF uses the galsim parser, see that repo for examples
- take a look at
desc_shear_sims
for galsim catalog - Function that takes in galaxy parametres and return BTK catalog
- add a very basic model w/out catalog with range of fluxes/sizes
- can enable simple test
- gcr catalog can be used but would make the project DESC focused (which is ok )
- easy way to enable DC2 catalog BTK usage in NERSC
Notes from meeting 09/02/22
@aboucaud recently proposed to wrap all the generators into a Pipeline
class which contains different methods for accessing images, measurements, and metrics. This was based on the observation that generators are an implementation detail rather than an actual intuitive element of the UI.
Another suggestion was to add shredder
as another deblender
Most of suggestions have been implemented in #412