digitalearthau icon indicating copy to clipboard operation
digitalearthau copied to clipboard

lustre paths should be normalised to the symlink "/g/data"

Open jeremyh opened this issue 7 years ago • 11 comments

A few products have been indexed with their location resolved to a specific lustre drive (eg. /g/data2) rather than the symlink /g/data.

Query of unique file:/// prefixes:

prefix example
/g/data ///g/data/rs0/datacube/002/LS8_OLI_NBART/9_-49/LS8_OLI_NBART_3577_9_-49_20180618000358000000_v1536402046.nc
/g/data1b ///g/data1b/if87/datacube/002/S2_MSI_ARD/packaged/2017-12-31/S2A_OPER_MSI_ARD_TL_SGS__20171231T044842_A013184_T52KBU_N02.06/ARD-METADATA.yaml
/g/data2 ///g/data2/v10/AGDCv2/datacube-ingestion/indexed-products/geophysics/radiometrics.yaml
/short/v10 ///short/v10/scenes/nbar-scenes-tmp/ls8/2016/10/output/nbar/LS8_OLITIRS_NBAR_P54_GANBAR01-032_113_085_20161005/ga-metadata.yaml

These could cause many issues. Our scripts will treat them as different locations, which could cause duplicates, or worse, archiving data because "no other dataset points to it".

They could also break in the future when NCI moves data between drives.

Blame me

I've just added more datasets with this problem! (the new telemetry data paths).

Despite being within a /g/data/ folder, datacube dataset update translated my path to the specific /g/data2 drive.

No file location:

16:46:02 [jmh547@raijin3:/g/data/v10/archived/rawdata/0]$ datacube dataset info f5ccb63a-56aa-11e5-a073-ac162d791418

id: f5ccb63a-56aa-11e5-a073-ac162d791418
product: ls8_satellite_telemetry_data
status: active
indexed: 2016-08-11 11:37:49.173278+10:00
locations:
- mdss://v27/EODS_DATA/rawdata/0/2014/10/LS8_OLI-TIRS_STD-MDF_P00_LC81070560672014279LGN00_107_056-067_20141006T022802Z20141006T023041_1.tar
fields:
    creation_time: 2015-09-09 04:26:33.266492
    format: MD
    gsi: LGN
    instrument: OLI_TIRS
    label: LS8_OLITIRS_STD-MD_P00_LC81070560672014279LGN00_107_056-067_20141006T022802Z20141006T023041
    orbit: null
    platform: LANDSAT_8
    product_type: satellite_telemetry_data
    time: {begin: '2014-10-06T02:28:02.848000', end: '2014-10-06T02:30:41.631000'}

Add file location:

16:46:10 [jmh547@raijin3:/g/data/v10/archived/rawdata/0]$ datacube dataset update 2014/10/LS8_OLI-TIRS_STD-MDF_P00_LC81070560672014279LGN00_107_056-067_20141006T022802Z20141006T023041_1/ga-metadata.yaml
Updated f5ccb63a-56aa-11e5-a073-ac162d791418
1 successful, 0 failed

It added "/g/data2":

16:46:19 [jmh547@raijin3:/g/data/v10/archived/rawdata/0]$ datacube dataset info f5ccb63a-56aa-11e5-a073-ac162d791418
id: f5ccb63a-56aa-11e5-a073-ac162d791418
product: ls8_satellite_telemetry_data
status: active
indexed: 2016-08-11 11:37:49.173278+10:00
locations:
- file:///g/data2/v10/archived/rawdata/0/2014/10/LS8_OLI-TIRS_STD-MDF_P00_LC81070560672014279LGN00_107_056-067_20141006T022802Z20141006T023041_1/ga-metadata.yaml
- mdss://v27/EODS_DATA/rawdata/0/2014/10/LS8_OLI-TIRS_STD-MDF_P00_LC81070560672014279LGN00_107_056-067_20141006T022802Z20141006T023041_1.tar
fields:
    creation_time: 2015-09-09 04:26:33.266492
    format: MD
    gsi: LGN
    instrument: OLI_TIRS
    label: LS8_OLITIRS_STD-MD_P00_LC81070560672014279LGN00_107_056-067_20141006T022802Z20141006T023041
    orbit: null
    platform: LANDSAT_8
    product_type: satellite_telemetry_data
    time: {begin: '2014-10-06T02:28:02.848000', end: '2014-10-06T02:30:41.631000'}
16:46:27 [jmh547@raijin3:/g/data/v10/archived/rawdata/0]$ 

What should we do?

  • Each of our dea scripts (such as sync) probably need to normalise their paths. eodatasets has similar normalisation.
  • Perhaps we should stop cowboys like myself from running dataset update manually.
  • Fix ODC/datacube commands to absolute, not resolve, their input paths?

jeremyh avatar Oct 23 '18 06:10 jeremyh

Datacube is using Path(dir).absolute() , if dir is supplied as absolute path it will remain as /g/data/..., if however dir is relative to begin with if will be resolved to /g/data{2,1a,..etc} as we can't know about those symlinks and their meaning.

So I don't think datacube update resolves symlinks if given absolute paths on input, have you (@jeremyh ) observed that behaviour when supplying absolute paths?

Kirill888 avatar Oct 24 '18 01:10 Kirill888

You're right, datacube is only using absolute(). It looks like Python's os.getcwd() shows different results to bash's pwd and $PWD, so the plain absolute() gets "resolved":

>>> !pwd
/g/data/v10/agdc
>>> import os
>>> os.getcwd()
'/g/data2/v10/agdc'

(& it's fine with absolute paths)

jeremyh avatar Oct 24 '18 02:10 jeremyh

Python's using the underlying syscall, which doesn't remember symlinks:

Unfortunately, all the kernel maintains for each process is the i-node number and device identification for the current working directory. The kernel does not maintain the full pathname of the directory.

jeremyh avatar Oct 24 '18 02:10 jeremyh

hm, wonder how pwd does it's thing, would be nice to be able to just cd into symlinked path and index relative to that without worrying about symlinks upstream.

Kirill888 avatar Oct 24 '18 03:10 Kirill888

It looks like it reads the $PWD environment variable. We could do that ourselves with os.environ['PWD'] (it's in the posix standard, not specific to bash).

>>> import os
>>> os.environ['PWD']
'/g/data/v10/agdc'

There's always still the possibility of users making mistakes, so having dea's own scripts normalise the path on NCI still seems worthwhile to me. (Or we can stop having people index manually).

jeremyh avatar Oct 24 '18 04:10 jeremyh

I think the only sane way is going to be normalising to an absolute path, which ignores the symlinks. If the NCI or anyone else moves data around, we rewrite the locations.

Either that, or we put in a special case for the NCI which rewrites /g/data*/ to /g/data, but that just seems unnecessarily complicated.

omad avatar Oct 25 '18 00:10 omad

@omad I think using PWD if available is sane enough, so long as you are careful to check that it matches your actual working directory (not too hard to do). Having NCI specific rules is ok in digitalearthau repo, but not in odc.

Kirill888 avatar Oct 25 '18 00:10 Kirill888

At the risk of complicating the issue, do we want to do something with band paths too?

Some of the newer products are using absolute band paths, which we previously avoided.

Band path query:

absolute prefix count
/g/data/ 3,687,814
/g/data1/ 36
¤ 77,704,776

jeremyh avatar Oct 25 '18 05:10 jeremyh

With product names:

product prefix count
bom_rainfall_grids /g/data 41,591
srtm_dem1sv1_0 /g/data 3
s2a_level1c_granule /g/data 2,445,425
s2b_level1c_granule /g/data 1,200,795
gamma_ray /g/data1 36

jeremyh avatar Oct 25 '18 05:10 jeremyh

Absolute paths in dataset documents are a bad idea and should probably generate warning when indexing.

I think we need to fix those, and in case of sentinel fix the script that generated them

Kirill888 avatar Oct 25 '18 05:10 Kirill888

Got an email from NCI -- they're moving gdata2 projects to gdata3. If we still have any paths not normalised to the symlink they'll stop working.

https://opus.nci.org.au/pages/viewpage.action?pageId=48497032

proj date
v10 Mon, 30-Sept
rs0 Fri, 4-Oct
fk4 Mon, 7-Oct

jeremyh avatar Sep 23 '19 04:09 jeremyh