validate
validate copied to clipboard
Referential integrity check takes much longer than it seems it should
Checked for duplicates
No - I haven't checked
🐛 Describe the bug
"bug" Is a strong word, but it's the closest category.
I have a bundle (MSAM2) with ~2 million products in it. I can do product-level validation in parallel using KDP or Nucleus or other technology to farm it out to a bunch of nodes. However, referential integrity (verifying the inventory files are correct and match the files present) has to be done on the bundle as a whole - I'm not aware of any way to split that up. (maybe by collection, but there are only 2 relevant collections here so that doesn't help much).
In order to do this, I'm running with product and content validation turned off. But it is still taking an inordinate amount of time. As of this writing, it's been running 6 days and per the log has gotten through 1,047,476 out of 2,013,873 products - about halfway. That's a rate of about 2 per second. Seems like it should be able to do better in this case.
🕵️ Expected behavior
Well I expected what I got ;-) but I would hope the RI checks could be faster.
📜 To Reproduce
Here's the command line:
/path/to/msam2/validate-3.5.1/bin/validate -target /path/to/msam2/annex_ehlmann_caltech_msl_msam2 --report-file bundle.valrpt -R pds4.bundle --skip-content-validation --skip-product-validation
🖥 Environment Info
$ uname -a
Linux machine-name 3.10.0-1160.76.1.el7.x86_64 #1 SMP Tue Jul 26 14:15:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ java -version
java version "17.0.11" 2024-04-16 LTS
Java(TM) SE Runtime Environment (build 17.0.11+7-LTS-207)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.11+7-LTS-207, mixed mode, sharing)
📚 Version of Software Used
$ /mnt/pdsdata/scratch/rgd/msam2/validate-3.5.1/bin/validate -version
gov.nasa.pds:validate
Version 3.5.1
Release Date: 2024-05-25 17:45:47
Copyright 2019, by the California Institute of Technology ("Caltech").
All rights reserved.
🩺 Test Data / Additional context
No response
🦄 Related requirements
No response
⚙️ Engineering Details
No response
🎉 Integration & Test
No response