Friendliness of Histogram pooling
Since filling histograms is a common task especially for new developers of ldmx-sw, I think focusing on improving its friendliness is helpful. First and foremost, the error being thrown from get is hard to parse for people new to ldmx-sw and ROOT files, so I'd like to reword this to help more.
https://github.com/LDMX-Software/ldmx-sw/blob/613a9d10b87235eb4ad3a6d3fc8229752692ad98/Framework/src/Framework/Histograms.cxx#L46-L53
Necessity of Singleton Pool
I can't remember why we need a singleton pool. Can we get away with a pool for each processor? This would allow us to avoid adding the processor's name as a prefix to the histogram name (I think) which is another source of confusion/annoyance. The goal here would get to a point where the histogram file could look like
hist.root:
analyzer1/
histogram
analyzer2/
histogram
so both analyzer1 and analyzer2 could share histogram names but the two histograms are distinct both in memory and in the output hist.root file.
Creation Helpers for Categorical Axes
I've written some helpers that create categorical axes that I think we could use within our own DQM analyzers. Maybe they don't get used immediately within the DQM in order to avoid breaking anything, but having them around for writing new categorical histograms may be helpful.
std::tuple<std::size_t,double,double> category_bins(
const std::vector<std::string>& categories,
int offset = 0
) {
std::size_t n_categories = categories.size();
double min = offset - 0.5;
double max = offset + n_categories + 1.5;
return std::make_tuple(n_categories, min, max);
}
void label_axis(TAxis* axis, const std::vector<std::string>& categories) {
for (std::size_t ibin{1}; ibin <= categories.size(); ibin++) {
axis->SetBinLabel(ibin, categories[ibin-1].c_str());
}
}
void create_categorical_histogram(
const std::string& name,
const std::vector<std::string>& categories
) {
auto [nbin, xmin, xmax] = category_bins(categories);
histograms_.create(name, "", nbin, xmin, xmax);
auto h{histograms_.get(name)};
label_axis(h->GetXaxis(), categories);
}
void create_categorical_histogram(
const std::string& name,
const std::vector<std::string>& categories,
const std::string& ylabel, int nybin, double ymin, double ymax
) {
auto [nxbin, xmin, xmax] = category_bins(categories);
histograms_.create(name, "", nxbin, xmin, xmax, ylabel, nybin, ymin, ymax);
label_axis(histograms_.get(name)->GetXaxis(), categories);
}
void create_categorical_histogram(
const std::string& name,
const std::vector<std::string>& xcategories,
const std::vector<std::string>& ycategories
) {
auto [nxbin, xmin, xmax] = category_bins(xcategories);
auto [nybin, ymin, ymax] = category_bins(ycategories);
histograms_.create(name, "", nxbin, xmin, xmax, "", nybin, ymin, ymax);
label_axis(histograms_.get(name)->GetXaxis(), xcategories);
label_axis(histograms_.get(name)->GetYaxis(), ycategories);
}
processor's name as a prefix to the histogram name (I think) which is another source of confusion/annoyance.
I second that point.
The other two make sense too!
I was able to get a basic pool working without using a central singleton, allowing us to have the histograms in directories defined when the pool is constructed.
// in file named: pool.C
class HistogramPool {
private:
TDirectory *d_;
std::unordered_map<std::string, TH1*> histograms_;
public:
HistogramPool(TFile& f, std::string dir) {
d_ = f.mkdir(dir.c_str());
}
void create(const std::string& name, const std::string& xlabel, int bins, float xmin, float xmax) {
auto old_pwd = gDirectory;
d_->cd();
// Create a histogram of type T
auto hist = new TH1F(name.c_str(), name.c_str(), bins, xmin, xmax);
// Set the title
hist->SetTitle("");
// Set the x-axis label
hist->GetXaxis()->SetTitle(xlabel.c_str());
histograms_[name] = hist;
old_pwd->cd();
}
TH1* get(const std::string& name) {
auto histo = histograms_.find(name);
if (histo == histograms_.end()) {
throw std::runtime_error("Histogram "+name+" not found.");
}
return histograms_[name];
}
void fill(const std::string& name, float x) {
get(name)->Fill(x);
}
};
int pool() {
TFile f{"test.root", "recreate"};
HistogramPool p1(f, "p1");
HistogramPool p2(f, "p2");
p1.create("h1", "one", 10, 0.0, 1.0);
p2.create("h1", "two", 100, 0.0, 1.0);
p1.fill("h1", 0.1);
p2.fill("h1", 0.9);
f.Write();
return 0;
}
running this with denv root pool.C produces a test.root file that contains two histograms both named "h1" but distinct because they reside within different TDirectory. It seems like the directory movement is the way to go.
I don't think I want to do this, but we could enable the following crazy fun syntax using a singleton.
"my_histogram"_h.fill(2);
https://godbolt.org/z/cvb4YKq5P
But as mentioned, since we are defining a "string literal", we'd need to have static access to the pool (and the other pool-related things like the output directory) which is not necessarily something I want to do. I'll think about it some more, maybe there's a way I can share this cool code with downstream analyzers without requiring a singleton.