xml2json extremely slow
Summary
When trying to convert some XMLs to pyhf JSON, I found that the runtime was unacceptably slow (order of hours).
OS / Environment
NAME="Arch Linux"
PRETTY_NAME="Arch Linux"
ID=arch
BUILD_ID=rolling
ANSI_COLOR="38;2;23;147;209"
HOME_URL="https://archlinux.org/"
DOCUMENTATION_URL="https://wiki.archlinux.org/"
SUPPORT_URL="https://bbs.archlinux.org/"
BUG_REPORT_URL="https://bugs.archlinux.org/"
LOGO=archlinux-logo
Steps to Reproduce
The XMLs are ATLAS internal, but I'm happy to share them with ATLAS members of the dev team on request.
File Upload (optional)
No response
Expected Results
I was hoping it would finish relatively fast.
Actual Results
It took hours.
pyhf Version
pyhf, version 0.6.3
uproot 4.2.2
Code of Conduct
- [X] I agree to follow the Code of Conduct
Thanks for the report @cburgard. Can you please share the files as a private GitHub Gist with the dev team?
The files are too large for a gist, but I've shared them with you privately on CERNbox. Please distribute them to the team.
@cburgard Please first verify that you have read through the translations docs and tell us how the workspaces were created. Then please tell us the exact commands that you are using.
It is important to understand how you created them, because the files you've provided as is are not reproducible given that you have hard paths throughout them like /home/cburgard/Physics/statistics/pyfittools/hww-xml/histograms.root.
The workspaces were created with SFramwork, specifically with this function: https://gitlab.cern.ch/atlas-caf/CAFCore/-/blob/b1d40c29b0b4635c28d563db1483317d4a8b25d5/SFramework/Root/TSModelFactory.cxx#L176
The exact command I was using is
pyhf xml2json hww-xml/HWWRun2GGF.xml
The workspaces were created with SFramwork, specifically with this function: https://gitlab.cern.ch/atlas-caf/CAFCore/-/blob/b1d40c29b0b4635c28d563db1483317d4a8b25d5/SFramework/Root/TSModelFactory.cxx#L176
@cburgard Okay, well as you'll note we have no translation recipe for SFramework in the docs, so we can't make any statements of support for SFramework until we have one (maybe you can suggest one that we can iterate on and add to the docs?). If SFramework is only capable of hardcoding full paths into the XML then to do any debugging we would need the original files. Though I'm assuming that this is just a choice that was enabled somewhere in SFramework? Can you produce the workspaces with relative paths? Otherwise this would mean the workspaces would be unusable anywhere but on the machine where they were originally produced.
SFramework doesn't really use these XMLs, they are just produced as a debugging tool. SFramework mostly relies on the C++ interface of HistFactory.
If you'd be aware of a simple way to instruct the C++ implementation of HistFactory to output xmls with relative rather than absolute paths, I'd be happy to reproduce the xmls with that setting, otherwise, sed will have to suffice I fear.
If you'd be aware of a simple way to instruct the C++ implementation of HistFactory to output xmls with relative rather than absolute paths, I'd be happy to reproduce the xmls with that setting, otherwise,
sedwill have to suffice I fear.
Thanks for the info as this helps move things forward. Manually changing paths isn't a reproducible workflow, and so can't be considered seriously. :grimacing:
Perhaps we he's back from holiday we can ask @kratsg if he has any insight onto how HistFitter is able to do things unless @longjon929 is able to comment on that sooner.
@cburgard great news (I hope)! @kratsg is adding a feature in PR #1909 that is able to automatically deal with hardcoded absolute paths and when he does the conversion on an old 2011 macbook running macOS 10.11 (his current MBP is in for repairs :cry:) he gets the following
$ time pyhf xml2json --hide-progress -v $PWD:/home/cburgard/Physics/statistics/pyfittools/ --output-file test.json hww-xml/HWWRun2GGF.xml
real 1m7.212s
user 1m6.320s
sys 0m1.180s
so hopefully you'll get things even faster than that on your local machine.
Hi @matthewfeickert , sorry for being slow and I imagine this isn't very useful. It looks like HistFitter writes out the XML by hand making a top level file from the fitconfig here and then each channel is looped over and dumps a file here. I imagine HF was developed with this structure in mind from HistFactory, which makes it straightforward to write out the xml.
PrintXML allows using relative paths too. The downside of using relative paths there is that you (currently) need to call pyhf xml2json from the same place where the relative paths originate from, which is not necessarily the place where the top-level xml is located.
PrintXMLallows using relative paths too. The downside of using relative paths there is that you (currently) need to callpyhf xml2jsonfrom the same place where the relative paths originate from, which is not necessarily the place where the top-level xml is located.
This will be better with #1909 but you can already use --basedir to change the base of the relative paths, by prefixing them, so it doesn't necessarily need to be in the same place.
Thanks to some fantastic work by @kratsg, PR #1909 is in master now. So I believe that this should be resolved if people install from the development HEAD. I haven't actually taken the time to look at this particular Issue with PR #1909 in yet, but I can try to later this week.
A final note on this, as I write up the release notes in PR #1705:
$ ls
hww-xml
$ time pyhf xml2json \
--hide-progress \
--mount $PWD:/home/cburgard/Physics/statistics/pyfittools/ \
--output-file workspace.json \
hww-xml/HWWRun2GGF.xml
pyhf xml2json --hide-progress --mount --output-file workspace.json 14.71s user 2.33s system 119% cpu 14.269 total
:+1: Great work by @kratsg! :rocket: