ckanext-spatial icon indicating copy to clipboard operation
ckanext-spatial copied to clipboard

WAF Harvester parsing issues

Open benjwadams opened this issue 1 year ago • 1 comments

WAF harvesting can fail to parse on numerous things which are a de facto a WAF, such as this listing: https://gcoos4.tamu.edu/erddap/metadata/iso19115/xml/

Because the harvester is looking explicitly for "a href", anything that doesn't exactly follow that string ordering will fail to harvest? Is there any reason why a proper XML parsing library isn't used when finding links instead of using a parsing library, which has known pitfalls when parsing XML?

Also, on the above link, the "apache" parser is used due to the "Server" header, even though this is clearly not an Apache directory listing, but rather a reverse proxied application. This was difficult to track down when I had to create custom logic for the "other" parser to account for some of the shortcomings of the WAF parser mentioned above.

benjwadams avatar May 22 '23 18:05 benjwadams

@benjwadams you are right that the parser used in WAF is very brittle. Any improvements on that front would be a great contribution

amercader avatar Jul 06 '23 12:07 amercader