heritrix3 icon indicating copy to clipboard operation
heritrix3 copied to clipboard

Support for JSON serialization with REST API responses?

Open Querela opened this issue 4 weeks ago • 0 comments

The Heritrix REST API currently only supports application/xml responses (besides HTML?). Would it be possible to include a JSON serialization, too? I find JSON a lot easier to work with compared to XML and think this could be a useful addition.

Looking at the examples in the documentation, I saw no attributes being used, so a conversion to JSON should be straightforward We would probably need to include the extra dependency org.restlet:org.restlet.ext.json (which also adds org.json:json, both have no known security vulnerabilities currently) to implement it the same way as for XML (source (1), source (2), + some other locations).

I did start a first implementation, there's not that much to change, really. But looking at the XML serialization where the order of properties is fixed (manually): (1) is it important that for JSON the same property order has to be guaranteed? By default it seems to sort them alphabetically which should be stable (for parsing if that is a concern). Note that the JSON library does not support the original order (probably), so it might not be possible without switching to other libraries... And (2), should responses be wrapped with a engine or job key similar to XML?

XML Engine

curl -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine:

<?xml version="1.0" standalone='yes'?>
<engine>
  <heritrixVersion>3.13.1-SNAPSHOT-2025-12-17T14:28:51Z</heritrixVersion>
  <heapReport>
    <usedBytes>13595632</usedBytes>
    <totalBytes>100663296</totalBytes>
    <maxBytes>268435456</maxBytes>
  </heapReport>
  <jobsDir>/home/user/heritrix3/dist/target/heritrix-3.13.1-SNAPSHOT/jobs</jobsDir>
  <jobsDirUrl>https://localhost:8443/engine/jobsdir/</jobsDirUrl>
  <availableActions>
    <value>rescan</value>
    <value>add</value>
    <value>create</value>
  </availableActions>
  <jobs></jobs>
</engine>
JSON Engine

curl -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine:

{
  "availableActions": [
    "rescan",
    "add",
    "create"
  ],
  "heapReport": {
    "usedBytes": 14586312,
    "totalBytes": 100663296,
    "maxBytes": 268435456
  },
  "jobsDirUrl": "https://localhost:8443/engine/jobsdir/",
  "heritrixVersion": "3.13.1-SNAPSHOT-2025-12-17T14:28:51Z",
  "jobsDir": "/home/user/heritrix3/dist/target/heritrix-3.13.1-SNAPSHOT/jobs",
  "jobs": []
}
XML Job

curl -k -u admin:admin --anyauth --location -H "Accept: application/xml" https://localhost:8443/engine/job/test:

<?xml version="1.0" standalone='yes'?>
<job>
  <shortName>test</shortName>
  <statusDescription>Unbuilt</statusDescription>
  <availableActions>
    <value>build</value>
    <value>launch</value>
  </availableActions>
  <launchCount>0</launchCount>
  <lastLaunch/>
  <isProfile>false</isProfile>
  <primaryConfig>/home/user/heritrix3/dist/target/heritrix-3.13.1-SNAPSHOT/jobs/test/crawler-beans.cxml</primaryConfig>
  <primaryConfigUrl>https://localhost:8443/engine/job/test/jobdir/crawler-beans.cxml</primaryConfigUrl>
  <url>https://localhost:8443/engine/job/test/job/test</url>
  <jobLogTail></jobLogTail>
  <uriTotalsReport/>
  <sizeTotalsReport>
    <dupByHash>0</dupByHash>
    <dupByHashCount>0</dupByHashCount>
    <novel>0</novel>
    <novelCount>0</novelCount>
    <notModified>0</notModified>
    <notModifiedCount>0</notModifiedCount>
    <total>0</total>
    <totalCount>0</totalCount>
    <sizeOnDisk>0</sizeOnDisk>
  </sizeTotalsReport>
  <rateReport/>
  <loadReport/>
  <elapsedReport/>
  <threadReport/>
  <frontierReport/>
  <crawlLogTail></crawlLogTail>
  <configFiles></configFiles>
  <isLaunchInfoPartial>false</isLaunchInfoPartial>
  <isRunning>false</isRunning>
  <isLaunchable>true</isLaunchable>
  <hasApplicationContext>false</hasApplicationContext>
  <alertCount>0</alertCount>
  <checkpointFiles></checkpointFiles>
  <reports></reports>
  <heapReport>
    <usedBytes>14723912</usedBytes>
    <totalBytes>52428800</totalBytes>
    <maxBytes>268435456</maxBytes>
  </heapReport>
</job>
JSON Job

curl -k -u admin:admin --anyauth --location -H "Accept: application/json" https://localhost:8443/engine/job/test:

{
  "availableActions": [
    "build",
    "launch"
  ],
  "launchCount": 0,
  "isProfile": false,
  "reports": [],
  "jobLogTail": [],
  "sizeTotalsReport": {
    "notModifiedCount": 0,
    "total": 0,
    "notModified": 0,
    "dupByHashCount": 0,
    "novelCount": 0,
    "totalCount": 0,
    "sizeOnDisk": 0,
    "dupByHash": 0,
    "novel": 0
  },
  "checkpointFiles": [],
  "url": "https://localhost:8443/engine/job/test/job/test",
  "crawlLogTail": [],
  "primaryConfig": "/home/user/heritrix3/dist/target/heritrix-3.13.1-SNAPSHOT/jobs/test/crawler-beans.cxml",
  "primaryConfigUrl": "https://localhost:8443/engine/job/test/jobdir/crawler-beans.cxml",
  "statusDescription": "Unbuilt",
  "heapReport": {
    "usedBytes": 16858208,
    "totalBytes": 52428800,
    "maxBytes": 268435456
  },
  "configFiles": [],
  "hasApplicationContext": false,
  "isRunning": false,
  "isLaunchable": true,
  "alertCount": 0,
  "shortName": "test",
  "isLaunchInfoPartial": false
}

Querela avatar Dec 17 '25 11:12 Querela