digitalearthau
digitalearthau copied to clipboard
lustre paths should be normalised to the symlink "/g/data"
A few products have been indexed with their location resolved to a specific lustre drive (eg. /g/data2) rather than the symlink /g/data.
Query of unique file:/// prefixes:
| prefix | example |
|---|---|
| /g/data | ///g/data/rs0/datacube/002/LS8_OLI_NBART/9_-49/LS8_OLI_NBART_3577_9_-49_20180618000358000000_v1536402046.nc |
| /g/data1b | ///g/data1b/if87/datacube/002/S2_MSI_ARD/packaged/2017-12-31/S2A_OPER_MSI_ARD_TL_SGS__20171231T044842_A013184_T52KBU_N02.06/ARD-METADATA.yaml |
| /g/data2 | ///g/data2/v10/AGDCv2/datacube-ingestion/indexed-products/geophysics/radiometrics.yaml |
| /short/v10 | ///short/v10/scenes/nbar-scenes-tmp/ls8/2016/10/output/nbar/LS8_OLITIRS_NBAR_P54_GANBAR01-032_113_085_20161005/ga-metadata.yaml |
These could cause many issues. Our scripts will treat them as different locations, which could cause duplicates, or worse, archiving data because "no other dataset points to it".
They could also break in the future when NCI moves data between drives.
Blame me
I've just added more datasets with this problem! (the new telemetry data paths).
Despite being within a /g/data/ folder, datacube dataset update translated my path to the specific /g/data2 drive.
No file location:
16:46:02 [jmh547@raijin3:/g/data/v10/archived/rawdata/0]$ datacube dataset info f5ccb63a-56aa-11e5-a073-ac162d791418
id: f5ccb63a-56aa-11e5-a073-ac162d791418
product: ls8_satellite_telemetry_data
status: active
indexed: 2016-08-11 11:37:49.173278+10:00
locations:
- mdss://v27/EODS_DATA/rawdata/0/2014/10/LS8_OLI-TIRS_STD-MDF_P00_LC81070560672014279LGN00_107_056-067_20141006T022802Z20141006T023041_1.tar
fields:
creation_time: 2015-09-09 04:26:33.266492
format: MD
gsi: LGN
instrument: OLI_TIRS
label: LS8_OLITIRS_STD-MD_P00_LC81070560672014279LGN00_107_056-067_20141006T022802Z20141006T023041
orbit: null
platform: LANDSAT_8
product_type: satellite_telemetry_data
time: {begin: '2014-10-06T02:28:02.848000', end: '2014-10-06T02:30:41.631000'}
Add file location:
16:46:10 [jmh547@raijin3:/g/data/v10/archived/rawdata/0]$ datacube dataset update 2014/10/LS8_OLI-TIRS_STD-MDF_P00_LC81070560672014279LGN00_107_056-067_20141006T022802Z20141006T023041_1/ga-metadata.yaml
Updated f5ccb63a-56aa-11e5-a073-ac162d791418
1 successful, 0 failed
It added "/g/data2":
16:46:19 [jmh547@raijin3:/g/data/v10/archived/rawdata/0]$ datacube dataset info f5ccb63a-56aa-11e5-a073-ac162d791418
id: f5ccb63a-56aa-11e5-a073-ac162d791418
product: ls8_satellite_telemetry_data
status: active
indexed: 2016-08-11 11:37:49.173278+10:00
locations:
- file:///g/data2/v10/archived/rawdata/0/2014/10/LS8_OLI-TIRS_STD-MDF_P00_LC81070560672014279LGN00_107_056-067_20141006T022802Z20141006T023041_1/ga-metadata.yaml
- mdss://v27/EODS_DATA/rawdata/0/2014/10/LS8_OLI-TIRS_STD-MDF_P00_LC81070560672014279LGN00_107_056-067_20141006T022802Z20141006T023041_1.tar
fields:
creation_time: 2015-09-09 04:26:33.266492
format: MD
gsi: LGN
instrument: OLI_TIRS
label: LS8_OLITIRS_STD-MD_P00_LC81070560672014279LGN00_107_056-067_20141006T022802Z20141006T023041
orbit: null
platform: LANDSAT_8
product_type: satellite_telemetry_data
time: {begin: '2014-10-06T02:28:02.848000', end: '2014-10-06T02:30:41.631000'}
16:46:27 [jmh547@raijin3:/g/data/v10/archived/rawdata/0]$
What should we do?
- Each of our dea scripts (such as sync) probably need to normalise their paths. eodatasets has similar normalisation.
- Perhaps we should stop cowboys like myself from running
dataset updatemanually. - Fix ODC/
datacubecommands to absolute, not resolve, their input paths?
Datacube is using Path(dir).absolute() , if dir is supplied as absolute path it will remain as /g/data/..., if however dir is relative to begin with if will be resolved to /g/data{2,1a,..etc} as we can't know about those symlinks and their meaning.
So I don't think datacube update resolves symlinks if given absolute paths on input, have you (@jeremyh ) observed that behaviour when supplying absolute paths?
You're right, datacube is only using absolute(). It looks like Python's os.getcwd() shows different results to bash's pwd and $PWD, so the plain absolute() gets "resolved":
>>> !pwd
/g/data/v10/agdc
>>> import os
>>> os.getcwd()
'/g/data2/v10/agdc'
(& it's fine with absolute paths)
Python's using the underlying syscall, which doesn't remember symlinks:
Unfortunately, all the kernel maintains for each process is the i-node number and device identification for the current working directory. The kernel does not maintain the full pathname of the directory.
hm, wonder how pwd does it's thing, would be nice to be able to just cd into symlinked path and index relative to that without worrying about symlinks upstream.
It looks like it reads the $PWD environment variable. We could do that ourselves with os.environ['PWD'] (it's in the posix standard, not specific to bash).
>>> import os
>>> os.environ['PWD']
'/g/data/v10/agdc'
There's always still the possibility of users making mistakes, so having dea's own scripts normalise the path on NCI still seems worthwhile to me. (Or we can stop having people index manually).
I think the only sane way is going to be normalising to an absolute path, which ignores the symlinks. If the NCI or anyone else moves data around, we rewrite the locations.
Either that, or we put in a special case for the NCI which rewrites /g/data*/ to /g/data, but that just seems unnecessarily complicated.
@omad I think using PWD if available is sane enough, so long as you are careful to check that it matches your actual working directory (not too hard to do). Having NCI specific rules is ok in digitalearthau repo, but not in odc.
At the risk of complicating the issue, do we want to do something with band paths too?
Some of the newer products are using absolute band paths, which we previously avoided.
Band path query:
| absolute prefix | count |
|---|---|
| /g/data/ | 3,687,814 |
| /g/data1/ | 36 |
| ¤ | 77,704,776 |
With product names:
| product | prefix | count |
|---|---|---|
| bom_rainfall_grids | /g/data | 41,591 |
| srtm_dem1sv1_0 | /g/data | 3 |
| s2a_level1c_granule | /g/data | 2,445,425 |
| s2b_level1c_granule | /g/data | 1,200,795 |
| gamma_ray | /g/data1 | 36 |
Absolute paths in dataset documents are a bad idea and should probably generate warning when indexing.
I think we need to fix those, and in case of sentinel fix the script that generated them
Got an email from NCI -- they're moving gdata2 projects to gdata3. If we still have any paths not normalised to the symlink they'll stop working.
https://opus.nci.org.au/pages/viewpage.action?pageId=48497032
| proj | date |
|---|---|
| v10 | Mon, 30-Sept |
| rs0 | Fri, 4-Oct |
| fk4 | Mon, 7-Oct |