pyhf icon indicating copy to clipboard operation
pyhf copied to clipboard

xml2json extremely slow

Open cburgard opened this issue 3 years ago • 12 comments

Summary

When trying to convert some XMLs to pyhf JSON, I found that the runtime was unacceptably slow (order of hours).

OS / Environment

NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
ANSI_COLOR="38;2;23;147;209"
HOME_URL="https://archlinux.org/"
DOCUMENTATION_URL="https://wiki.archlinux.org/"
SUPPORT_URL="https://bbs.archlinux.org/"
BUG_REPORT_URL="https://bugs.archlinux.org/"
LOGO=archlinux-logo

Steps to Reproduce

The XMLs are ATLAS internal, but I'm happy to share them with ATLAS members of the dev team on request.

File Upload (optional)

No response

Expected Results

I was hoping it would finish relatively fast.

Actual Results

It took hours.

pyhf Version

pyhf, version 0.6.3
uproot 4.2.2

Code of Conduct

  • [X] I agree to follow the Code of Conduct

cburgard avatar Jun 10 '22 12:06 cburgard

Thanks for the report @cburgard. Can you please share the files as a private GitHub Gist with the dev team?

matthewfeickert avatar Jun 10 '22 14:06 matthewfeickert

The files are too large for a gist, but I've shared them with you privately on CERNbox. Please distribute them to the team.

cburgard avatar Jun 10 '22 16:06 cburgard

@cburgard Please first verify that you have read through the translations docs and tell us how the workspaces were created. Then please tell us the exact commands that you are using.

It is important to understand how you created them, because the files you've provided as is are not reproducible given that you have hard paths throughout them like /home/cburgard/Physics/statistics/pyfittools/hww-xml/histograms.root.

matthewfeickert avatar Jun 10 '22 18:06 matthewfeickert

The workspaces were created with SFramwork, specifically with this function: https://gitlab.cern.ch/atlas-caf/CAFCore/-/blob/b1d40c29b0b4635c28d563db1483317d4a8b25d5/SFramework/Root/TSModelFactory.cxx#L176

The exact command I was using is

pyhf xml2json hww-xml/HWWRun2GGF.xml

cburgard avatar Jun 11 '22 15:06 cburgard

The workspaces were created with SFramwork, specifically with this function: https://gitlab.cern.ch/atlas-caf/CAFCore/-/blob/b1d40c29b0b4635c28d563db1483317d4a8b25d5/SFramework/Root/TSModelFactory.cxx#L176

@cburgard Okay, well as you'll note we have no translation recipe for SFramework in the docs, so we can't make any statements of support for SFramework until we have one (maybe you can suggest one that we can iterate on and add to the docs?). If SFramework is only capable of hardcoding full paths into the XML then to do any debugging we would need the original files. Though I'm assuming that this is just a choice that was enabled somewhere in SFramework? Can you produce the workspaces with relative paths? Otherwise this would mean the workspaces would be unusable anywhere but on the machine where they were originally produced.

matthewfeickert avatar Jun 14 '22 03:06 matthewfeickert

SFramework doesn't really use these XMLs, they are just produced as a debugging tool. SFramework mostly relies on the C++ interface of HistFactory.

If you'd be aware of a simple way to instruct the C++ implementation of HistFactory to output xmls with relative rather than absolute paths, I'd be happy to reproduce the xmls with that setting, otherwise, sed will have to suffice I fear.

cburgard avatar Jun 16 '22 14:06 cburgard

If you'd be aware of a simple way to instruct the C++ implementation of HistFactory to output xmls with relative rather than absolute paths, I'd be happy to reproduce the xmls with that setting, otherwise, sed will have to suffice I fear.

Thanks for the info as this helps move things forward. Manually changing paths isn't a reproducible workflow, and so can't be considered seriously. :grimacing:

Perhaps we he's back from holiday we can ask @kratsg if he has any insight onto how HistFitter is able to do things unless @longjon929 is able to comment on that sooner.

matthewfeickert avatar Jun 17 '22 17:06 matthewfeickert

@cburgard great news (I hope)! @kratsg is adding a feature in PR #1909 that is able to automatically deal with hardcoded absolute paths and when he does the conversion on an old 2011 macbook running macOS 10.11 (his current MBP is in for repairs :cry:) he gets the following

$ time pyhf xml2json --hide-progress -v $PWD:/home/cburgard/Physics/statistics/pyfittools/ --output-file test.json hww-xml/HWWRun2GGF.xml

real	1m7.212s
user	1m6.320s
sys	0m1.180s

so hopefully you'll get things even faster than that on your local machine.

matthewfeickert avatar Jul 01 '22 06:07 matthewfeickert

Hi @matthewfeickert , sorry for being slow and I imagine this isn't very useful. It looks like HistFitter writes out the XML by hand making a top level file from the fitconfig here and then each channel is looped over and dumps a file here. I imagine HF was developed with this structure in mind from HistFactory, which makes it straightforward to write out the xml.

longjon929 avatar Jul 01 '22 07:07 longjon929

PrintXML allows using relative paths too. The downside of using relative paths there is that you (currently) need to call pyhf xml2json from the same place where the relative paths originate from, which is not necessarily the place where the top-level xml is located.

alexander-held avatar Jul 01 '22 08:07 alexander-held

PrintXML allows using relative paths too. The downside of using relative paths there is that you (currently) need to call pyhf xml2json from the same place where the relative paths originate from, which is not necessarily the place where the top-level xml is located.

This will be better with #1909 but you can already use --basedir to change the base of the relative paths, by prefixing them, so it doesn't necessarily need to be in the same place.

kratsg avatar Jul 01 '22 11:07 kratsg

Thanks to some fantastic work by @kratsg, PR #1909 is in master now. So I believe that this should be resolved if people install from the development HEAD. I haven't actually taken the time to look at this particular Issue with PR #1909 in yet, but I can try to later this week.

matthewfeickert avatar Aug 11 '22 18:08 matthewfeickert

A final note on this, as I write up the release notes in PR #1705:

$ ls
hww-xml
$ time pyhf xml2json \
    --hide-progress \
    --mount $PWD:/home/cburgard/Physics/statistics/pyfittools/ \
    --output-file workspace.json \
    hww-xml/HWWRun2GGF.xml
pyhf xml2json --hide-progress --mount  --output-file workspace.json   14.71s user 2.33s system 119% cpu 14.269 total

:+1: Great work by @kratsg! :rocket:

matthewfeickert avatar Sep 22 '22 23:09 matthewfeickert