purple-hats PDFs are being scanned when they shouldn't be.

PDFs are being scanned when they shouldn't be.

Open mgifford opened this issue 10 months ago • 1 comments

I am not setting the filetype

-i, --fileTypes

With

node --max-old-space-size=6000 --no-deprecation purple-a11y/cli.js -u https://www.whitehouse.gov -c 2 -s same-domain -p 50 -a none --blacklistedPatternsFilename ./pa-gTracker-exclude-medicare.csv -k "Random Example:[email protected]"

But I am still finding PDFs in the list of URLs crawled. This shouldn't be the case.. If the default is html only then I shouldn't see any PDFs (or other docs) in my results.

Apr 15 '24 16:04 mgifford

Hi @mgifford, can I check which version of Purple A11y are you using to run the scan? E.g. 0.9.46, or newer (i.e. directly from GitHub master)?

If you are running a version from master, can you get the commit id so I can understand if this issue was already fixed? You can use the following command: git log -1 --format="%H"

I have not been able to replicate the issue of pdfs scanned when default strategy is html-only on latest master commit

Apr 19 '24 02:04 younglim

purple-hats purple-hats copied to clipboard

PDFs are being scanned when they shouldn't be.

purple-hats
purple-hats copied to clipboard