firecrawl icon indicating copy to clipboard operation
firecrawl copied to clipboard

Strange characters in PDF - potential fixes

Open calebpeffer opened this issue 9 months ago • 1 comments

The issue is that some pdfs are returning bad text. On top of fixing the root cause, may be-worth returning the original pdf so the user can parse with their own methods.

What it looks like in playground:

image

Offending pdf https://cha.house.gov/_cache/files/a/7/a78dca53-0f09-4177-af3f-4b0db28e1382/675BC81A2FDD21A8096B91CE41F10184.billmcfarrlandbiowith-seal-final.pdf

Ideal text output:


WILLIAM P. MCFARLAND
ACTING SERGEANT AT ARMS
UNITED STATES HOUSE OF REPRESENTATIVES
William P. McFarland was sworn in as Acting Sergeant at Arms on January 7, 2023, during the
1
st session of the 118th Congress. Mr. McFarland is an experienced security administrator and
educator offering years of progressive experience related to complex intelligence and securityrelated activities supporting the U.S. Government.
Mr. McFarland attended the University of Maryland, where he earned a Bachelor of Arts degree
in Criminology, and attended Webster University at Bolling Air Force Base, where he earned a
Master of Arts in Security Management.
Mr. McFarland began his professional career as a Security Aide/Facility Security Officer at the
National Security Agency from 1990 to 1991. In 1991, he began his career on Capitol Hill as a
Security Aide for the United States Capitol Police, a position he held until 1995. Mr. McFarland
then transitioned to the Permanent Select Committee on Intelligence, where he was Director of
Security from 1995 to 2005. In 2005, he became Director of the Office of House Security, and
remained until 2021. Mr. McFarland then left Capitol Hill to become Vice President of Security
in the private sector from 2021 until his being sworn in as Acting Sergeant at Arms.
Mr. McFarland resides in Severn, Maryland with his wife and two children.

@sayakmaity for the suggestion.

calebpeffer avatar May 16 '24 22:05 calebpeffer

@calebpeffer I just did a scrape on https://cha.house.gov/_cache/files/a/7/a78dca53-0f09-4177-af3f-4b0db28e1382/675BC81A2FDD21A8096B91CE41F10184.billmcfarrlandbiowith-seal-final.pdf and got the PDF with the content. It looks like this issue we had before were because we couldn’t check if the files were actually PDFs (closed on #29 ). Do you know if there is any other url with this problem?

rafaelsideguide avatar May 29 '24 17:05 rafaelsideguide

This seems to be fixed. Closing for now @calebpeffer @sayakmaity

nickscamara avatar Jun 03 '24 23:06 nickscamara