bulk_extractor icon indicating copy to clipboard operation
bulk_extractor copied to clipboard

xml validation fails

Open simsong opened this issue 4 years ago • 4 comments

(base) simsong@nimi src % xmllint --valid out-emails1/report.xml|head -10                                                                          (slg-dev)bulk_extractor
out-emails1/report.xml:2: validity error : Validation failed: no DTD found !
<dfxml xmloutputversion='1.0'>
                             ^
<?xml version="1.0" encoding="UTF-8"?>
<dfxml xmloutputversion="1.0">
  <metadata xmlns="http://afflib.org/bulk_extractor/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/">
    <dc:type>Feature Extraction</dc:type>
  </metadata>
  <creator version="1.0">
    <program>BULK_EXTRACTOR</program>
    <version>2.0.0-dev</version>
    <build_environment>
      <compiler>4.2.1 (Apple LLVM 12.0.5 (clang-1205.0.22.11))</compiler>
(base) simsong@nimi src %                                                                                                                          (slg-dev)bulk_extractor

simsong avatar Sep 12 '21 10:09 simsong

Apparently I need a DTD. Perhaps @ajnelson-nist can help.

simsong avatar Sep 12 '21 10:09 simsong

The DFXML schema can be used to validate DFXML, though it needs to use the --schema flag, not the --valid flag. The Python code base's samples Makefile demonstrates this. I would recommend tracking the schema as a Git submodule, at the version where you want it to validate.

You may also be in for a bit of a data upgrade, as the DFXML schema identified many long-standing issues with the way DFXML was originally drafted. For one thing, namespaces are scoped to the element they're attached to, so your sample has no namespace to which it's claiming to conform. See Differencing test 0 for how to declare a <dfxml> element as in the DFXML namespace.

ajnelson-nist avatar Sep 13 '21 15:09 ajnelson-nist

Well, you are now the XML/DFXML expert. If you could give me a sample of how to add namespace other other scoping tags, I'll update bulk_extractor2.0 so that it produces conformant DFXML.

simsong avatar Sep 13 '21 22:09 simsong

@ajnelson-nist - I think that I'm making progress on this. Now the validation errors apparently require that I do an update to the DFXML schema or create my own namespace.

Here is the new head of the DFXML output of bulk_extractor:

<?xml version='1.0' encoding='UTF-8'?>
<dfxml version='1.0' xmlns='http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML'
  xmlns:debug='http://afflib.org/bulk_extractor/debug'
  xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
  xmlns:dc='http://purl.org/dc/elements/1.1/'>
  <metadata>
    <dc:type>Feature Extraction</dc:type>
  </metadata>
  <creator version='1.0'>
    <program>BULK_EXTRACTOR</program>
    <version>2.0.0-dev</version>
    <build_environment>
...

And here is what happens when I try to validate it:

% xmllint --noout --schema dfxml.xsd out-domexusers-be20v3/report.xml                                                                                                                                        (slg-dev)bulk_extractor
warning: failed to load external entity "ref/dc.xsd"
dfxml.xsd:34: element import: Schemas parser warning : Element '{http://www.w3.org/2001/XMLSchema}import': Failed to locate a schema at location 'ref/dc.xsd'. Skipping the import.
warning: failed to load external entity "ref/xml.xsd"
dfxml.xsd:43: element import: Schemas parser warning : Element '{http://www.w3.org/2001/XMLSchema}import': Failed to locate a schema at location 'ref/xml.xsd'. Skipping the import.
out-domexusers-be20v3/report.xml:14: element CPPFLAGS: Schemas validity error : Element '{http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}CPPFLAGS': This element is not expected. Expected is one of ( {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}compilation_date, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}library ).
out-domexusers-be20v3/report.xml:25: element cpuid: Schemas validity error : Element '{http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}cpuid': This element is not expected.
out-domexusers-be20v3/report.xml:49: element configuration: Schemas validity error : Element '{http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}configuration': This element is not expected. Expected is one of ( {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}source, ##other{http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}*, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}diskimageobject, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}partitionsystemobject, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}partitionobject, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}volume, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}fileobject, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}rusage, ##other{http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}* ).
out-domexusers-be20v3/report.xml fails to validate
%

I guess dc: is Dublin Core, so I will need to get a Dublin Core xsd file somewhere.

I'm not sure what xsi: is about. Any clue?

simsong avatar Oct 01 '21 11:10 simsong