bids-validator icon indicating copy to clipboard operation
bids-validator copied to clipboard

HED Validator integration into bids schema based validator.

Open rwblair opened this issue 2 years ago • 13 comments

For #1639.

bids-validator/src/deps/hed-validator dependency was generated by running npm run build on latest commit for hed-javascript: https://github.com/hed-standard/hed-javascript/commit/bba927b7b713f6aad7864b9aabc8b9afe70456cf

Two main functions located in bids-validator/src/validators/hed.ts are hedAccumulator and hedValidate

hedAccumulator is called once per context like other checks (filename validation, schema checks) in bids-validator/src/validators/bids.ts. It catches any events.tsv files and json files and builds up an object to be passed to the actual validator call.

Instead of recreating the old data structures that were passed into bids-validator/validators/events/hed.js I tried to get everything in shape a bit closer to where the outgoing calls were made, not sure if I got it quite right. The contexts have the proper sidecar data in them already so no need to mess with potentialSidecars.

hedValidate is called once, after the context walk is complete.

  • [x] copy over the hed-validator issue to bids-validator issue functions, modify to work with new issue objects.
  • [x] Add new issues to bids-validator/src/issues/list.ts as needed.
  • [x] Copy over detectHed function to skip calls when no HED data present.
  • [x] Pin dependency to actual release of hed-validator.
  • [x] Add tests

Currently am getting "ERROR: [HED_SCHEMA_LOAD_FAILED] The supplied schema specification is invalid. Specification: no sche..." when run on bids-examples/eeg_ds003645s_hed

rwblair avatar Mar 27 '23 22:03 rwblair

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 85.75%. Comparing base (ad6c900) to head (5beb84f).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1648   +/-   ##
=======================================
  Coverage   85.75%   85.75%           
=======================================
  Files          91       91           
  Lines        3785     3785           
  Branches     1218     1218           
=======================================
  Hits         3246     3246           
  Misses        453      453           
  Partials       86       86           

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Mar 28 '23 18:03 codecov[bot]

@rwblair --- should we be nervous that this is PR is still a draft? Is there anything that we need to prepare for? Thx! VisLab

VisLab avatar Aug 09 '23 14:08 VisLab

I know you have been swamped, but could you give us an update on this @nellh? Thx.

VisLab avatar Oct 16 '23 19:10 VisLab

@nellh ... thx for update. I am not sure what is going on with ds004395 -- I looked at it on openNeuro and it doesn't have any HED in it. The new HED validator works directly from the list of .tsv files and their associated sidecars as produced from the BIDS side. I wonder if we are somehow copying the data inadvertently before determining whether to even look at the files. @Alexander Jones @.***> let's look at this more closely.

Version 3.13.3 should be released in the next day or so. It has some minor bug fixes.

On Thu, Jan 11, 2024 at 11:58 AM Nell Hardcastle @.***> wrote:

@.**** approved this pull request.

I think this is ready to merge. I ran through some test cases and was able to match the failures to the legacy validator. ds004395 runs out of memory with the default heap size but does pass increasing the size to 8GB. There could be some memory optimization to be done because this doesn't happen without the HED patches but that dataset is an exception, every other test case passed with the default heap size.

— Reply to this email directly, view it on GitHub https://github.com/bids-standard/bids-validator/pull/1648#pullrequestreview-1816172159, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJCJOWF6OFLORSN5CB55JTYOAR4BAVCNFSM6AAAAAAWJW4A2SVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTQMJWGE3TEMJVHE . You are receiving this because you commented.Message ID: @.***>

VisLab avatar Jan 11 '24 19:01 VisLab

~~Looks like the memory issue is coming from code I wrote, via the hedAccumulator in the bids-validator. Commenting out the call to the hed-validator and to the file accumulator allows ds004395 to finish validation. Commenting out just the call to the hed-validator but leaving the accumulator active causes it to run out of memory on default heap size.~~

I should finish profiling before running my mouth.

rwblair avatar Jan 12 '24 21:01 rwblair

I'm not sure what's going on with the Deno build. The lines in the errors do not appear in the version of that file in our code base, and we don't use Buffer objects directly (though some dependencies might).

happy5214 avatar Jan 23 '24 22:01 happy5214

@rwblair Has there been any further progress on this?

happy5214 avatar Jun 07 '24 02:06 happy5214

@happy5214 Merged master to get it up to date. The out of memory issue is from keeping an object around with all the sideccar and tsv data in memory while the file tree is walked. I think I was mirroring how the old validator called the hed validator. I'm going to try to:

  1. Detect which files have hed data at the accumulator step, to keep the arguments for hedValidator.validator.BidsDataset leaner.
  2. Have the accumulator only track file names and then reload tsv and json data at hed validation time after hopefully all the per bids file context objects have been garbage collected.
  3. Generate a hedValidator.validator.BidsDataset for each events file and all of its associated sidecars. Are there any HED rules that require multiple tsv files to be loaded at the same time for validation? or is it only that we need to ensure all the sidecars are present for each separate tsv file?

rwblair avatar Jun 12 '24 19:06 rwblair

Are there any HED rules that require multiple tsv files to be loaded at the same time for validation? or is it only that we need to ensure all the sidecars are present for each separate tsv file?

HED processes each tsv and its sidecars separately and does not need to have multiple event files at one time.

@happy5214 will need to address the details of the other issues. Thanks for looking at this -- we'd like to get this resolved.

VisLab avatar Jun 12 '24 23:06 VisLab

I rewrote bids-validator/src/validators/hed.ts to work on a file per context basis in line with the 3rd option in my last post. This allows ds004395 to complete validation.

I need to rewrite/add tests for this way of invoking HED validation.

rwblair avatar Jun 13 '24 21:06 rwblair

Thx! In the longer term, we'll probably want to build the HED schema object from the dataset description before validation and pass it in as it doesn't change. We'll get back to you on this. @happy5214 comment?

To be honest, hearing that is a little annoying, because that's how it was originally implemented, and then we implemented a design philosophy to move as much of the code as possible to hed-validator. The schema-loading function is still exported (hedValidator.validator.buildSchemas), but there's nowhere to pass it right now in the external API in order to run BIDS validation. I'll discuss more with @VisLab when we meet today.

happy5214 avatar Jun 14 '24 11:06 happy5214

For now its still passing the dataset_description to each validation call for each file. Not sure how inefficient this is CPU wise, but the memory issue is gone for now.

rwblair avatar Jun 17 '24 16:06 rwblair

That is fine for now. We will work on the underlying implementation.

On Mon, Jun 17, 2024 at 11:27 AM Ross Blair @.***> wrote:

For now its still passing the dataset_description to each validation call for each file. Not sure how inefficient this is CPU wise, but the memory issue is gone for now.

— Reply to this email directly, view it on GitHub https://github.com/bids-standard/bids-validator/pull/1648#issuecomment-2173843368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJCJOUUFY6R34OZNAS7BE3ZH4E7TAVCNFSM6AAAAABI5ZOCD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZTHA2DGMZWHA . You are receiving this because you were mentioned.Message ID: @.***>

VisLab avatar Jun 17 '24 17:06 VisLab

I believe the failing test is because of an off-by-one error (you're converting the TSV data to the old format, which you shouldn't (BidsTsvFile's constructor now expects either a string or the new ColumnMap-style format), but you're forgetting to add the header as the first row).

happy5214 avatar Jul 30 '24 23:07 happy5214