root icon indicating copy to clipboard operation
root copied to clipboard

Unable to use EOS tokens with RDataFrame since 6.32

Open chrisburr opened this issue 1 year ago • 4 comments

Check duplicate issues.

  • [X] Checked for duplicates

Description

EOS tokens no longer work with RDataFrame in 6.32.04. In 6.30.08 everything is fine:

$ python3
Python 3.9.18 (main, Aug 23 2024, 00:00:00)
[GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ROOT
>>> url = 'root://eosuser.cern.ch//eos/user/c/cburr/hsimple.root?xrd.wantprot=unix&authz=' + open("token.txt").read().strip()
>>> ROOT.TFile.Open(url).ls()
TNetXNGFile**		root://eosuser.cern.ch//eos/user/c/cburr/hsimple.root	Demo ROOT file with histograms
 TNetXNGFile*		root://eosuser.cern.ch//eos/user/c/cburr/hsimple.root	Demo ROOT file with histograms
  KEY: TH1F	hpx;1	This is the px distribution
  KEY: TH2F	hpxpy;1	py vs px
  KEY: TProfile	hprof;1	Profile of pz versus px
  KEY: TNtuple	ntuple;1	Demo ntuple
>>> df = ROOT.RDataFrame("ntuple", url)
>>>

Reproducer

On lxplus:

$ source /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.32.04/x86_64-almalinux9.4-gcc114-opt/bin/thisroot.sh
$ cp /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.32.04/x86_64-almalinux9.4-gcc114-opt/tutorials/hsimple.root /eos/user/c/cburr/hsimple.root
$ EOS_MGM_URL=root://eoshome-c.cern.ch eos token --path /eos/user/c/cburr/hsimple.root --permission=rx --expires=$(date +%s -d "30 minutes") > token.txt
$ kdestroy
$ python3
Python 3.9.18 (main, Aug 23 2024, 00:00:00)
[GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ROOT
>>> url = 'root://eosuser.cern.ch//eos/user/c/cburr/hsimple.root?xrd.wantprot=unix&authz=' + open("token.txt").read().strip()
>>> ROOT.TFile.Open(url).ls()
TNetXNGFile**		root://eosuser.cern.ch//eos/user/c/cburr/hsimple.root	Demo ROOT file with histograms
 TNetXNGFile*		root://eosuser.cern.ch//eos/user/c/cburr/hsimple.root	Demo ROOT file with histograms
  KEY: TH1F	hpx;1	This is the px distribution
  KEY: TH2F	hpxpy;1	py vs px
  KEY: TProfile	hprof;1	Profile of pz versus px
  KEY: TNtuple	ntuple;1	Demo ntuple
>>> df = ROOT.RDataFrame("ntuple", url)
Error in <TNetXNGSystem::GetDirEntry>: Unable to give access - user access restricted - unauthorized identity used ; Permission denied
 *** Break *** segmentation violation

ROOT version

6.32.04

Installation method

sft.cern.ch

Operating system

Linux (lxplus)

Additional context

No response

chrisburr avatar Sep 19 '24 12:09 chrisburr

Dear @chrisburr ,

Thank you for reaching out and for the reproducer. I am on it. Meanwhile, I just wanted to point out that for the first case in 6.30, just calling ROOT.RDataFrame will not attempt to open the file, whereas 6.32 opens the file at construction time ( to homogenise the way different data formats are processed). Just as a confirmation, could you try running any operation that would need to read data from the file in the first case with 6.30?

vepadulano avatar Sep 20 '24 07:09 vepadulano

Thanks! This definitely used to be working (with 6.28 IIRC). If I find a minute I'll check with 6.30.

chrisburr avatar Sep 20 '24 09:09 chrisburr

The problem is that RDF tries to open the file to check that it's valid. The logic for the file opening is at https://github.com/root-project/root/blob/962009b8c2057199c2229c3ef9938ac4d315d10a/tree/dataframe/src/RLoopManager.cxx#L1133 . In particular, because of the presence of the ? token, the string is parsed as a glob. Now in many cases that would be harmless albeit a tiny overhead (it would just return the same file name to open), but in this particular case it triggers a faulty behaviour. The glob parsing attempts at traversing the remote xrootd directory (see here), but since the permission is just for the single file with the token and not for the entire directory, it leads to the user access restricted error you post above.

Now, I believe the most sane course of action would be to refine the logic that checks whether the input file name is a glob. I could simply add a check for the xrd.wantprot token, but probably we want to have a more authoritative list of all the tokens that should make the file name not be parsed as a glob. This probably includes not only xrootd tokens but also anything https-related. Or we could adopt a different strategy for glob detection altogether. Thoughts @dpiparo @pcanal ?

vepadulano avatar Sep 21 '24 16:09 vepadulano

Ah that makes sense. Extending the defintion of strings to add metadata to paths (globbing, the # syntax in TFile::Open, ...) is always going to be error prone.

but probably we want to have a more authoritative list of all the tokens that should make the file name not be parsed as a glob

This feels like an impossible task to define.

Maybe a simplier solution would be to not support ? when globbing and only apply globbing to the text before the query string? Or maybe just have a dedicated method (or argument type) for creating a RDataFrame from a glob rather than relying on huristics?

chrisburr avatar Sep 22 '24 21:09 chrisburr

Maybe a simplier solution would be to not support ? when globbing and only apply globbing to the text before the query string? Or maybe just have a dedicated method (or argument type) for creating a RDataFrame from a glob rather than relying on huristics?

The first option was implemented in the linked PR, aligning the functionality to that of TChain::Add. For the second option, I agree that would be a good possibility, but requires extending the RDF interface so I didn't want to introduce it for a bugfix. This can be reassessed later on when necessary.

vepadulano avatar Nov 12 '24 11:11 vepadulano