pypatent icon indicating copy to clipboard operation
pypatent copied to clipboard

AttributeError: 'NoneType' object has no attribute 'find_next'

Open random1717 opened this issue 6 years ago • 8 comments

Error when running your example:

pypatent.Search('TTL/(tennis AND (racquet OR racket))')

AttributeError                            Traceback (most recent call last)
<ipython-input-2-a7c0dc5b3207> in <module>
----> 1 pypatent.Search('TTL/(tennis AND (racquet OR racket))')

/usr/local/lib/python3.7/site-packages/pypatent/__init__.py in __init__(self, string, results_limit, get_patent_details, pn, isd, ttl, abst, aclm, spec, ccl, cpc, cpcl, icl, apn, apd, apt, govt, fmid, parn, rlap, rlfd, prir, prad, pct, ptad, pt3d, pppd, reis, rpaf, afff, afft, in_, ic, is_, icn, aanm, aaci, aast, aaco, aaat, lrep, an, ac, as_, acn, exp, exa, ref, fref, oref, cofc, reex, ptab, sec, ilrn, ilrd, ilpd, ilfd)
    245         r = requests.get(url, headers=Constants.request_header).text
    246         s = BeautifulSoup(r, 'html.parser')
--> 247         total_results = int(s.find(string=re.compile('out of')).find_next().text.strip())
    248 
    249         patents = self.get_patents_from_results_url(url, limit=results_limit)

AttributeError: 'NoneType' object has no attribute 'find_next'

random1717 avatar Feb 05 '19 03:02 random1717

Just ran into this issue as well. The problem lies within the URL formatting, specifically line 232's replace method which changes spaces to hyphens. An easy fix is to remove that replace method and ensure that multi-word terms have escaped quotes, such as: pypatent.Search(an="\"hoffmann la roche\"", spec="diagnostics", results_limit=1).as_list()

codypilot avatar Feb 05 '19 04:02 codypilot

This is related to the issue I've been having as well. The problem is: Javascript is now enforced on the search site.

If you look at the failing requests and print the text of the results page, you will see this:

daneads avatar Mar 05 '19 02:03 daneads

Selenium may be a good alternative but it'd certainly be slower/have more overhead

codypilot avatar Mar 06 '19 00:03 codypilot

Hi there,

thanks @daneads for conceiving and maintaining this great library. I'm looking forward to use it from PatZilla, which might also spark your interest.

Introduction

Today, when trying to find an answer to https://github.com/ip-tools/uspto-opendata-python/issues/2, I gave pypatent a try and had the same issue:

>>> import pypatent
>>> pypatent.Search('TTL/(tennis AND (racquet OR racket))')

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/dev/sources/uspto-pbd/.venv3/lib/python3.7/site-packages/pypatent/__init__.py", line 247, in __init__
    total_results = int(s.find(string=re.compile('out of')).find_next().text.strip())
AttributeError: 'NoneType' object has no attribute 'find_next'

Investigation

After investigating a bit, I found the response body of the request to http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=0&p=1&f=S&l=50&Query=TTL%2F%28tennis+AND+%28racquet+OR+racket%29%29&d=PTXT to be valid HTML without any Javascript obfuscation and - as it does contain the phrase "Hits 1 through 50 out of 378" - it actually should be parseable.

I verified this detail by requesting the URL using non-Javascript capable clients like curl and HTTPie.

Runtime error

However, I can confirm the code

s = BeautifulSoup(r, 'html.parser')
total_results = int(s.find(string=re.compile('out of')).find_next().text.strip())

currently still fails on that response.

With kind regards, Andreas.

Outlook

P.S.: When Javascript obfuscation things like /TSPD/08a752ce24ab200072a9cd92ec33dd5eff668cb1017860a8b5fb68de1351a3b1958ef77169637fb8?type=7 will still be an issue, please let me know as I might come up with a more detailed information about the specific obfuscation mechanism which might be used there. Been there, seen that... ;]

Background:

The problem is: Javascript is now enforced on the search site.

This is obviously not always the case. It only might be looking like this, but the respective Javascript obfuscation is in fact optional and depends on the origin (country) where the request has been issued from.

amotl avatar Mar 14 '19 00:03 amotl

Just wanted to let you know that running this code on the Python REPL prompt works perfectly fine for me

>>> import re
>>> import requests
>>> from bs4 import BeautifulSoup

>>> r = requests.get('http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=0&p=1&f=S&l=50&Query=TTL%2F%28tennis+AND+%28racquet+OR+racket%29%29&d=PTXT', headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'})
>>> s = BeautifulSoup(r.text, 'html.parser')
>>> int(s.find(string=re.compile('out of')).find_next().text.strip())
378

while

>>> import pypatent
>>> pypatent.Search('TTL/(tennis AND (racquet OR racket))')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/amo/dev/elmyra/sources/uspto-pbd/.venv3/lib/python3.7/site-packages/pypatent/__init__.py", line 247, in __init__
    total_results = int(s.find(string=re.compile('out of')).find_next().text.strip())
AttributeError: 'NoneType' object has no attribute 'find_next'

still fails.

Bummer. Currently, I'm clueless about the root cause of this as I was expecting to essentially run the same code through both variants here.

amotl avatar Mar 15 '19 13:03 amotl

@codypilot I'd say the best route now is to implement some sort of headless browser via Selenium. A pain/trick to install, but would get around this JS issue.

daneads avatar Mar 17 '19 22:03 daneads

Dear @daneads,

thanks for follwing up on this. I have some thoughts about this I would like to share with you.

Investigating the problem further

Do you still hit the *wall the USPTO apparently has employed recently? I still experience flawless direct access from Germany. To investigate this further, may I humbly ask you to run a curl command like outlined at [1] and tell me about its output and the country your request might have originated from?

the respective Javascript obfuscation is in fact optional and depends on the origin (country) where the request has been issued from.

Been there, seen that

Been there already with other resources published by organizations from the field of intellectual property and found out many details about the protection mechanism lingering through by

<script type="text/javascript" src="/TSPD/08a752ce24ab200072a9cd92ec33dd5eff668cb1017860a8b5fb68de1351a3b1958ef77169637fb8?type=7"></script>

Solution

to implement some sort of headless browser via Selenium

Right. When hitting that wall recently elsewhere and analyzing some of its details, I figured that would be the only viable solution. Coming from that, there's a Python implementation based on Marionette in my toolbox now which might be about 95% finished already. Please let me know if you would be interested in that to be added to pypatent.

With kind regards, Andreas.

[1] https://gist.github.com/amotl/bc99f3a3b7cd77c19475f74cfcbee999

amotl avatar Mar 18 '19 18:03 amotl

Any update on this one?

random1717 avatar May 02 '19 00:05 random1717