api.congress.gov icon indicating copy to clipboard operation
api.congress.gov copied to clipboard

Accessing debates -- API issues

Open joguldi opened this issue 1 year ago • 1 comments
trafficstars

Revered Librarians and Data Scientists, We are looking for the text of debates on the floor of the House of Representatives and the Senate since 2010, which we believe that the staff of Congress.gov have access to in “plain text” form. We have had limited success gaining access to this data from the API. As it is currently set up, the Congress.gov API “times out” after only a handful of requests. This built-in limit to how much data the public requests means that we effectively cannot get to their data. We have also tried scraping Congress.gov for text in pdf format, where it is possible to extract all the words said in the House of Representatives or Senate, but extremely labor-intensive to get information about which speaker said those words or on which date they were said. In effect, all we need is the “raw data” with speeches, speakers, and dates – which is probably something someone at Congress.gov could hand us on a zip drive. Alternatively, if Congress.gov could “whitelist” our IP addresses, we would be able to grab the data ourselves from their API. Our Emory University addresses follow. We are extremely grateful for your help and attention to this problem! With best wishes, Jo Guldi, Professor, Quantitative Theory and Methods, Emory University Stephanie Buongiorno, Postdoctoral Fellow in Engineering, Southern Methodist University

Emory University IP Addresses that need whitelisting Server: 170.140.1.1 Address: 170.140.1.1#53 Address: 170.140.140.248 Server: 170.140.1.1 Address: 170.140.1.1#53 Address: 170.140.140.250

joguldi avatar Sep 18 '24 17:09 joguldi