cia icon indicating copy to clipboard operation
cia copied to clipboard

Automated JSON Schema Validation Against Sample Data

Open Copilot opened this issue 2 months ago • 11 comments

Schema Validation Against Sample Data - PR Review Fixes Applied ✅

All feedback from PR review has been addressed.


✅ Changes Made (Based on Review Comments)

1. Fixed Spelling Errors

  • VALIDATION_README.md line 84: "goverment" → "government"
  • FIELD_MAPPING.md lines 149-151: "goverment" → "government" (3 occurrences)

2. Improved Exception Handling

  • validate_schemas.py line 149: Changed broad except Exception to specific exceptions
    • Now catches: IOError, csv.Error, UnicodeDecodeError
    • Allows critical exceptions like KeyboardInterrupt to propagate

3. Fixed Mermaid Field Extraction

  • validate_schemas.py lines 81-84: Added field name validation
    • Now validates field names start with letter: ^[A-Za-z]\w*$
    • Prevents invalid fields like "8" from being extracted
    • Result: Field count reduced from 46 to 45 (politician), 43 to 42 (party)
    • Total mismatches: Reduced from 127 to 125

4. Removed Unused Imports

  • validate_schemas.py line 15: Removed unused os import
  • validate_schemas.py line 21: Removed unused Set from typing imports

5. Added Explanatory Comment

  • validate_schemas.py line 96: Added comment explaining silent exception pass
    • "Skip invalid JSON examples in markdown"

6. Fixed String Concatenation

  • validate_schemas.py lines 378-379: Fixed implicit string concatenation
    • Added explicit commas between strings in report generation

7. Workflow Optimization

  • validate-json-schemas.yml lines 38-43: Removed unnecessary steps
    • Removed cache: 'pip' option (no dependencies to cache)
    • Removed "Install dependencies" step
    • Added comment: "No dependencies to install - script uses only Python standard library"
  • validate-json-schemas.yml line 54: Added explanatory comment
    • Explains why continue-on-error: true is needed for artifact upload

✅ Validation Results (After Fixes)

  • Schemas Validated: 5 of 5 (100%)
  • Sample Files Analyzed: 132 of 142 (93%)
  • Total Field Mismatches: 125 (reduced from 127)
  • Invalid "8" field: ❌ Eliminated

Field Count Changes:

Schema Before After Change
Politician 46 45 -1 (removed "8")
Party 43 42 -1 (removed "8")
Committee 26 26 No change
Ministry 29 29 No change
Intelligence 0 0 No change

🧪 Testing

Validated all changes:

# Syntax check
python3 -m py_compile json-export-specs/validate_schemas.py
✓ Python syntax valid

# Full validation run
cd json-export-specs && python3 validate_schemas.py
✓ 5 schemas validated
✓ 132 files analyzed
✓ Reports generated successfully
✓ No invalid field "8" in results
✓ Total mismatches: 125 (improved)

📝 Files Modified

  • .github/workflows/validate-json-schemas.yml (8 lines changed)
  • json-export-specs/FIELD_MAPPING.md (6 lines changed)
  • json-export-specs/VALIDATION_README.md (2 lines changed)
  • json-export-specs/validate_schemas.py (20 lines changed)
  • json-export-specs/schemas/SCHEMA_VALIDATION_REPORT.md (regenerated)
  • json-export-specs/schemas/validation-results.json (regenerated)

Status: ✅ ALL REVIEW COMMENTS ADDRESSED Quality: ✅ CODE REVIEW FEEDBACK IMPLEMENTED Testing: ✅ VALIDATED AND WORKING

Original prompt

This section details on the original issue you should resolve

<issue_title>Validate JSON Schemas Against Real Sample Data</issue_title> <issue_description>## 🎯 Objective Validate 5 JSON export schemas against 142 real sample data CSV files to ensure correctness

📋 Background

JSON export specifications define schemas for politician, party, committee, ministry, and intelligence data per json-export-specs/. These schemas need validation against actual sample data to ensure accurate data model representation.

📊 Current State (Measured Metrics)

  • JSON Schemas: 5 defined (politician-schema.md, party-schema.md, committee-schema.md, ministry-schema.md, intelligence-schema.md)
  • Sample Data Files: 142 CSV files in service.data.impl/sample-data/
  • Schema Validation: Not performed - schemas based on documentation only
  • Data Model Accuracy: Unknown - requires verification

✅ Acceptance Criteria

  • [ ] Automated validation script comparing JSON schemas to sample data structure
  • [ ] All 5 schemas validated against corresponding CSV files
  • [ ] Mismatches between schemas and data documented
  • [ ] Schemas updated to reflect actual data structure
  • [ ] Continuous validation integrated into CI/CD pipeline

🤖 Recommended Agent

Agent: @hack23-performance-engineer
Rationale: Schema validation and data pipeline optimization expertise

📚 Related Documentation

Comments on the Issue (you are @copilot in this section)

Custom agent used: intelligence-operative Expert in political science, intelligence analysis, OSINT, behavioral analysis, and Swedish politics with focus on exposing high risk national entities

  • Fixes Hack23/cia#7982

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot avatar Dec 08 '25 10:12 Copilot

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2025-12-08T11:15:06.392882+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 132
  • ⚠️ Total Mismatches: 127

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 46 7 35 ⚠️ REVIEW
Party 43 11 40 ⚠️ REVIEW
Committee 26 10 24 ⚠️ REVIEW
Ministry 29 2 28 ⚠️ REVIEW
Intelligence 0 4 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

⚠️ Action Required: Please review the field mismatches and update schemas to match actual data structure.

github-actions[bot] avatar Dec 08 '25 11:12 github-actions[bot]

@copilot apply changes based on the comments in this thread

pethers avatar Dec 08 '25 13:12 pethers

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2025-12-08T13:58:17.474324+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 132
  • ⚠️ Total Mismatches: 125

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 45 7 34 ⚠️ REVIEW
Party 42 11 39 ⚠️ REVIEW
Committee 26 10 24 ⚠️ REVIEW
Ministry 29 2 28 ⚠️ REVIEW
Intelligence 0 4 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

⚠️ Action Required: Please review the field mismatches and update schemas to match actual data structure.

github-actions[bot] avatar Dec 08 '25 13:12 github-actions[bot]

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2025-12-08T13:58:24.336560+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 132
  • ⚠️ Total Mismatches: 125

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 45 7 34 ⚠️ REVIEW
Party 42 11 39 ⚠️ REVIEW
Committee 26 10 24 ⚠️ REVIEW
Ministry 29 2 28 ⚠️ REVIEW
Intelligence 0 4 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

⚠️ Action Required: Please review the field mismatches and update schemas to match actual data structure.

github-actions[bot] avatar Dec 08 '25 13:12 github-actions[bot]

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2025-12-10T00:24:40.489887+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 132
  • ⚠️ Total Mismatches: 125

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 45 7 34 ⚠️ REVIEW
Party 42 11 39 ⚠️ REVIEW
Committee 26 10 24 ⚠️ REVIEW
Ministry 29 2 28 ⚠️ REVIEW
Intelligence 0 4 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

⚠️ Action Required: Please review the field mismatches and update schemas to match actual data structure.

github-actions[bot] avatar Dec 10 '25 00:12 github-actions[bot]

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA 133cfe976b0c6d08793bae7f4f41dc4d459c76f0.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

OpenSSF Scorecard

PackageVersionScoreDetails
actions/actions/checkout 4.*.* :green_circle: 6.5
Details
CheckScoreReason
Code-Review:green_circle: 10all changesets reviewed
Binary-Artifacts:green_circle: 10no binaries found in the repo
Dangerous-Workflow:green_circle: 10no dangerous workflow patterns detected
Maintained:green_circle: 56 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 5
CII-Best-Practices:warning: 0no effort to earn an OpenSSF best practices badge detected
Token-Permissions:warning: 0detected GitHub workflow tokens with excessive permissions
Fuzzing:warning: 0project is not fuzzed
License:green_circle: 10license file detected
Packaging:warning: -1packaging workflow not detected
Signed-Releases:warning: -1no releases found
Pinned-Dependencies:green_circle: 3dependency not pinned by hash detected -- score normalized to 3
Security-Policy:green_circle: 9security policy file detected
Branch-Protection:green_circle: 5branch protection is not maximal on development and all release branches
Vulnerabilities:green_circle: 91 existing vulnerabilities detected
SAST:green_circle: 8SAST tool detected but not run on all commits
actions/actions/github-script 7.*.* :green_circle: 6.3
Details
CheckScoreReason
Binary-Artifacts:green_circle: 10no binaries found in the repo
Dangerous-Workflow:green_circle: 10no dangerous workflow patterns detected
Code-Review:green_circle: 10all changesets reviewed
Maintained:warning: 00 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0
Packaging:warning: -1packaging workflow not detected
Token-Permissions:green_circle: 9detected GitHub workflow tokens with excessive permissions
CII-Best-Practices:warning: 0no effort to earn an OpenSSF best practices badge detected
Pinned-Dependencies:warning: 1dependency not pinned by hash detected -- score normalized to 1
Fuzzing:warning: 0project is not fuzzed
License:green_circle: 10license file detected
Signed-Releases:warning: -1no releases found
Security-Policy:green_circle: 9security policy file detected
Branch-Protection:green_circle: 5branch protection is not maximal on development and all release branches
Vulnerabilities:green_circle: 37 existing vulnerabilities detected
SAST:green_circle: 10SAST tool is run on all commits
actions/actions/setup-python 5.*.* :green_circle: 5.2
Details
CheckScoreReason
Maintained:green_circle: 79 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 7
Code-Review:green_circle: 10all changesets reviewed
Dangerous-Workflow:green_circle: 10no dangerous workflow patterns detected
Packaging:warning: -1packaging workflow not detected
Binary-Artifacts:green_circle: 10no binaries found in the repo
CII-Best-Practices:warning: 0no effort to earn an OpenSSF best practices badge detected
License:green_circle: 10license file detected
Token-Permissions:warning: 0detected GitHub workflow tokens with excessive permissions
Fuzzing:warning: 0project is not fuzzed
Pinned-Dependencies:warning: 0dependency not pinned by hash detected -- score normalized to 0
Signed-Releases:warning: -1no releases found
Security-Policy:green_circle: 9security policy file detected
Branch-Protection:warning: 0branch protection not enabled on development/release branches
Vulnerabilities:warning: 010 existing vulnerabilities detected
SAST:green_circle: 9SAST tool is not run on all commits -- score normalized to 9
actions/actions/upload-artifact 4.*.* :green_circle: 6.5
Details
CheckScoreReason
Code-Review:green_circle: 10all changesets reviewed
Packaging:warning: -1packaging workflow not detected
Maintained:green_circle: 1030 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 10
CII-Best-Practices:warning: 0no effort to earn an OpenSSF best practices badge detected
Dangerous-Workflow:green_circle: 10no dangerous workflow patterns detected
Binary-Artifacts:green_circle: 10no binaries found in the repo
Token-Permissions:warning: 0detected GitHub workflow tokens with excessive permissions
Fuzzing:warning: 0project is not fuzzed
License:green_circle: 10license file detected
Pinned-Dependencies:warning: 1dependency not pinned by hash detected -- score normalized to 1
Signed-Releases:warning: -1no releases found
Security-Policy:green_circle: 9security policy file detected
Branch-Protection:warning: 0branch protection not enabled on development/release branches
Vulnerabilities:green_circle: 100 existing vulnerabilities detected
SAST:green_circle: 9SAST tool detected but not run on all commits

Scanned Files

  • .github/workflows/validate-json-schemas.yml

github-actions[bot] avatar Dec 10 '25 00:12 github-actions[bot]

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2025-12-13T15:09:07.128511+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 132
  • ⚠️ Total Mismatches: 125

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 45 7 34 ⚠️ REVIEW
Party 42 11 39 ⚠️ REVIEW
Committee 26 10 24 ⚠️ REVIEW
Ministry 29 2 28 ⚠️ REVIEW
Intelligence 0 4 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

⚠️ Action Required: Please review the field mismatches and update schemas to match actual data structure.

github-actions[bot] avatar Dec 13 '25 15:12 github-actions[bot]

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2025-12-19T01:29:42.077816+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 132
  • ⚠️ Total Mismatches: 125

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 45 7 34 ⚠️ REVIEW
Party 42 11 39 ⚠️ REVIEW
Committee 26 10 24 ⚠️ REVIEW
Ministry 29 2 28 ⚠️ REVIEW
Intelligence 0 4 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

⚠️ Action Required: Please review the field mismatches and update schemas to match actual data structure.

github-actions[bot] avatar Dec 19 '25 01:12 github-actions[bot]

📊 JSON Schema Validation Results

Validation Summary:

  • ⏰ Generated: 2025-12-19T22:49:01.274352+00:00
  • 📋 Schemas Validated: 5
  • 📁 Sample Files Analyzed: 132
  • ⚠️ Total Mismatches: 125

Schema Status

Schema Fields Views Matched Mismatches Status
Politician 45 7 34 ⚠️ REVIEW
Party 42 11 39 ⚠️ REVIEW
Committee 26 10 24 ⚠️ REVIEW
Ministry 29 2 28 ⚠️ REVIEW
Intelligence 0 4 0 ✅ PASS

📄 Full Report: See uploaded artifacts for detailed validation report

⚠️ Action Required: Please review the field mismatches and update schemas to match actual data structure.

github-actions[bot] avatar Dec 19 '25 22:12 github-actions[bot]