influence-texas icon indicating copy to clipboard operation
influence-texas copied to clipboard

Scraping Witness Lists

Open AmyMcCullough opened this issue 7 years ago • 17 comments

Witness lists are the lists of citizens and organizations who represent themselves or others by speaking before for or against a bill during its hearing by a legislative committee. Witness lists could be scraped from the Texas Legislature Online and added to INFLUENCE TX. NOTE Witnesses should be categorized as stakeholders and not listed as pro or con speakers, as the version of the bill on which they commented may have changed via various amendments before being voted on or passed into law.

AmyMcCullough avatar Oct 07 '17 05:10 AmyMcCullough

An update for people interested in this issue: I didn't have much success trying to parse the witness lists with regexes. (see my notebook and the output of that process). When you download the bills, be sure to get them from the ftp directory (ftp://ftp.legis.state.tx.us/bills/85R/witlistbill/). I believe the house and senate lists follow slightly different formats, so you might want to write separate parsers. Even within the same chamber of the legislature, there will be inconsistencies in formatting. I think the listings are not always consistent about what kind of information goes in the parentheses, and some fields are omitted on some lines. It's just messy data in that some witnesses will have filled out the sign-up form wrong, and their mistakes end up in the documents. There are Python libraries that do named entity recognition, so I suggest looking at those.

mscarey avatar Oct 26 '17 01:10 mscarey

Updated. The notebook should be easier to read and I think the output files are about 95% correct. The notebook still just uses regexes, which will never be perfect where names are involved.

mscarey avatar Nov 01 '17 22:11 mscarey

RE: Witness List issue. I was able to download the files from the SOT web site and run @mscarey’s script, That got me a HouseWitness.csv with 722 entries. I ran a script against that to match the names in the FullText field against the FirstName and LastName fields. That gave me a list of 108 mismatches. I ran some of those against the regex /\w*.\w+,.+(/ and seemed to get good results. I'd like to try it in the script, but I'm not sure how to insert it.

jpolache avatar Nov 29 '17 14:11 jpolache

@jpolache Are you saying you're having trouble adding new lines to the jupyter notebook and rerunning it? I'm not clear on where you got stuck, but maybe you want the Jupyter documentation?

mscarey avatar Nov 30 '17 04:11 mscarey

Not sure where in the code to insert my regex. Also, not sure how your regexs work. I usually figure out other people's code by stepping through it in a debugger (not that your code has bugs :) ). I'm not good enough with python pdb to figure it out.

jpolache avatar Dec 01 '17 04:12 jpolache

Fair enough, the code is hard to understand. I just committed a new version of the notebook that hopefully will be a little clearer. I added docstrings for each function. I also changed a few of the functions to use strings rather than lists as inputs and outputs, so I hope that doesn't break any code you've already written.

mscarey avatar Dec 05 '17 11:12 mscarey

Thanks Matt. I'm working through the new code version now.

jpolache avatar Dec 06 '17 01:12 jpolache

@mscarey Getting an error :(

import csv

def export(dir, witList): with open(dir,'w') as f: writer = csv.writer(f) writer.writerow(['FullText', 'Position', 'Bill', 'LastName', 'FirstName', 'Role', 'Organization', 'City', 'State']) writer.writerows(witList) # better just to include the FullText field. return None

houseDir = 'C:/Users/user/Documents/code3/venv/witness/HouseWitness1205.csv' #houseDir = '../data/witness-lists/HouseWitness.csv' export(houseDir, houseRows)


NameError Traceback (most recent call last) in () 10 houseDir = 'C:/Users/user/Documents/code3/venv/witness/HouseWitness1205.csv' 11 #houseDir = '../data/witness-lists/HouseWitness.csv' ---> 12 export(houseDir, houseRows)

NameError: name 'houseRows' is not defined

jpolache avatar Dec 06 '17 02:12 jpolache

@jpolache is it possible you only ran the code in that block, but you didn't run the code earlier in the notebook where a list is assigned to the variable "houseRows"?

mscarey avatar Dec 06 '17 06:12 mscarey

This time, I went to the menu and selected Kernel->Restart and Run All. Here is the error;


error Traceback (most recent call last) in () 22 folderName = 'C:/Users/user/Documents/code3/venv/witness/house_bills/HB00400_HB00499/' 23 #folderName = 'bills/85R/witlistbill/html/house_bills/' ---> 24 houseWit = extractRows(folderName)

in extractRows(folderName) 15 wit = HBWitness(source) 16 # trying to rejoin entries split across lines ---> 17 new = mergelines(wit) 18 new = mergelines(new) # Will doing it twice catch 3-line entries? 19 houseWit.extend(new)

in mergelines(wit) 28 changed += 1 29 ---> 30 elif row[0].count(')') != row[0].count('(') and wit[lineIndex + 1][0].count(')') != wit[lineIndex + 1][0].count('(') and addName(row[0]) != [None, None] and not re.search(endWithState, row[0]): 31 newList.append([row[0] + " " + wit[lineIndex + 1][0], row[1], row[2]]) 32 badList.append(wit[lineIndex + 1][0:3])

in addName(line) 17 for f in flags: 18 for r in regexes: ---> 19 nameRe = re.compile(r, f) 20 match = re.search(nameRe, line) 21 if match:

c:\users\user\documents\code3\venv\lib\re.py in compile(pattern, flags) 231 def compile(pattern, flags=0): 232 "Compile a regular expression pattern, returning a pattern object." --> 233 return _compile(pattern, flags) 234 235 def purge():

c:\users\user\documents\code3\venv\lib\re.py in _compile(pattern, flags) 299 if not sre_compile.isstring(pattern): 300 raise TypeError("first argument must be string or compiled pattern") --> 301 p = sre_compile.compile(pattern, flags) 302 if not (flags & DEBUG): 303 if len(_cache) >= _MAXCACHE:

c:\users\user\documents\code3\venv\lib\sre_compile.py in compile(p, flags) 560 if isstring(p): 561 pattern = p --> 562 p = sre_parse.parse(p, flags) 563 else: 564 pattern = None

c:\users\user\documents\code3\venv\lib\sre_parse.py in parse(str, flags, pattern) 853 854 try: --> 855 p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0) 856 except Verbose: 857 # the VERBOSE flag was switched on inside the pattern. to be

c:\users\user\documents\code3\venv\lib\sre_parse.py in _parse_sub(source, state, verbose, nested) 414 while True: 415 itemsappend(_parse(source, state, verbose, nested + 1, --> 416 not nested and not items)) 417 if not sourcematch("|"): 418 break

c:\users\user\documents\code3\venv\lib\sre_parse.py in _parse(source, state, verbose, nested, first) 766 if not source.match(")"): 767 raise source.error("missing ), unterminated subpattern", --> 768 source.tell() - start) 769 if group is not None: 770 state.closegroup(group, p)

error: missing ), unterminated subpattern at position 10

jpolache avatar Dec 06 '17 14:12 jpolache

Fixed the "error: missing ), unterminated subpattern at position 10" issue by using an escape "\(" on the open paren in my regex pattern. But the regex pattern is not compatible with the rest of the script, apparently because I am not using grouping (?:). So far I am not able to duplicate the regex I came up with using grouping and am hesitant to try and rewrite everything else to resolve the error. Research continues.

jpolache avatar Dec 09 '17 21:12 jpolache

@jpolache I can't quite tell which regular expression is triggering the error for you. Maybe the issue is that I was running a different version of Python or a library. I added lines to the notebook that show what I was running. Here's what they show:

Python 3.6.1 :: Continuum Analytics, Inc.
bs4 4.6.0
pandas 0.20.3

mscarey avatar Dec 13 '17 05:12 mscarey

Thanks Matt. I'll check it out.

On Dec 12, 2017 11:12 PM, "Matt Carey" [email protected] wrote:

@jpolache https://github.com/jpolache I can't quite tell which regular expression is triggering the error for you. Maybe the issue is that I was running a different version of Python or a library. I added lines to the notebook https://github.com/open-austin/influence-texas/blob/master/notebooks/witness-lists.ipynb that show what I was running. Here's what they show:

Python 3.6.1 :: Continuum Analytics, Inc. bs4 4.6.0 pandas 0.20.3

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-austin/influence-texas/issues/39#issuecomment-351284683, or mute the thread https://github.com/notifications/unsubscribe-auth/AKGAPgfnbS8pIfi4VfIABzqt68UCeFFrks5s_1yzgaJpZM4PxPbK .

jpolache avatar Dec 13 '17 12:12 jpolache

Matt,

Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)]

beautifulsoup4==4.6.0 bs4==0.0.1 pandas==0.21.0

freeze.txt

jpolache avatar Dec 15 '17 04:12 jpolache

I will be at the meetup on Monday evening. Will Matt or John be there as well? If so, we should be able to easily resolve this.

-Michael

On Thu, Dec 14, 2017 at 10:32 PM, jpolache [email protected] wrote:

Matt,

Python 3.6.3 (v3.6.3:2c5fed8, Oct 3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)]

beautifulsoup4==4.6.0 bs4==0.0.1 pandas==0.21.0

freeze.txt https://github.com/open-austin/influence-texas/files/1561598/freeze.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/open-austin/influence-texas/issues/39#issuecomment-351911009, or mute the thread https://github.com/notifications/unsubscribe-auth/AB2ob4-qvDAWP83aU_0HPRhmwZendk_mks5tAfZZgaJpZM4PxPbK .

lazarus1331 avatar Dec 17 '17 01:12 lazarus1331

@lazarus1331 Yeah, I'll be there. There isn't usually a bunch of time to do project work at the library, but I'll bring my computer.

mscarey avatar Dec 18 '17 22:12 mscarey

Sorry I was unable to attend the meetup. I now have the code running in my environment and have done some analysis of the output.

Let me know if you would like to discuss next steps.

Jonathan 512 659 6919

On Mon, Dec 18, 2017 at 4:59 PM, Matt Carey [email protected] wrote:

@lazarus1331 https://github.com/lazarus1331 Yeah, I'll be there. There isn't usually a bunch of time to do project work at the library, but I'll bring my computer.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-austin/influence-texas/issues/39#issuecomment-352585162, or mute the thread https://github.com/notifications/unsubscribe-auth/AKGAPrVy2rVJS-c224ISSJh5noO27m-8ks5tBu5SgaJpZM4PxPbK .

jpolache avatar Dec 20 '17 02:12 jpolache