Don't insist on answer component of URL

Open opoudjis opened this issue 8 years ago • 0 comments

crawler.py can be used to retrieve blogs from Quora, not just answers. But if it is, the constraint that the URL fetched needs to match quora.com/answer/... needs to be relaxed:

# Get the part of the URL indicating the question title; we will save under this name
m1 = re.search('quora\.com/([^/]+)/answer', url)
# if there's a context topic
m2 = re.search('quora\.com/[^/]+/([^/]+)/answer', url)
filename = added_time + ' '
if not m1 is None:
    filename += m1.group(1)
elif not m2 is None:
    filename += m2.group(1)
else:
    print('[ERROR] Could not find question part of URL %s; skipping' % url, file=sys.stderr)
    continue

I change the last two lines to:

    # blog post
    m3 = re.search('quora\.com/([^/]+)', url)
    filename += m3.group(1)

Mar 19 '17 03:03 opoudjis