corpuscrawler icon indicating copy to clipboard operation
corpuscrawler copied to clipboard

Improve readme documentation on how to provide a new crawler

Open hugolpz opened this issue 3 years ago • 5 comments

This /CONTRIBUTING.md is a License Agreement / Code of Conduite to sign. As far as I can see, this very cherishable project has no actual tutorial.

In don't have Python and code knowledge to fix this documentation issue but I can map the road so it become easier for the next person to do so.

Wanted

If an user want to add a language such as Catalan from Barcelona (ca, cat : missing). What do he needs to jump in quickly ? What should he provide ?

  • What isn the local structure :
    • util.py : store functions uses by multiple languages crawlers
    • main.py : stores the 1000+ crawlers calls, run them all.
    • crawl_{iso}.py : stores language-specific copora's source url and processing functions.
  • What tools I have :
  • What input(s) : python list of url ?
  • What are the classic parts of a crawler function ?
  • What output format : raw text ? html is fine because a html balise wiper is then used ?
  • Example of easily hackable base-code.

API (to complete)

Defined functions within util.py, by order of apparition as of 2021/02/26. If you have some relevant knowledge, please help for a sub-section or one item.

Some tools

  • daterange(start, end): __
  • urlpath(url): __
  • urlencode(url): __

Main element

  • class Crawler(object):
    • __init__(self, language, output_dir, cache_dir, crawldelay): __
    • get_output(self, language=None): __
    • close(self): __
    • fetch(self, url, redirections=None, fetch_encoding='utf-8'): __
    • fetch_content(self, url, allow_404=False): __
    • fetch_sitemap(self, url, processed=set(), subsitemap_filter=lambda x: True): __
    • is_fetch_allowed_by_robots_txt(self, url): __
    • crawl_pngscriptures_org(self, out, language): __
    • _find_urls_on_pngscriptures_org(self, language): __
    • crawl_abc_net_au(self, out, program_id): __
    • crawl_churchio(self, out, bible_id): __
    • crawl_aps_dz(self, out, prefix): __
    • crawl_sverigesradio(self, out, program_id): __
    • crawl_voice_of_america(self, out, host, ignore_ascii=False): __
    • set_context(self, context): __

Some crawlers for multi-languages sites

  • crawl_bbc_news(crawler, out, urlprefix): __
  • crawl_korero_html(crawler, out, project, genre, filepath): __
  • write_paragraphs(et, out): __
  • crawl_deutsche_welle(crawler, out, prefix, need_percent_in_url=False): __
  • crawl_radio_free_asia(crawler, out, edition, start_year=1998): __
  • crawl_sputnik_news(crawler, out, host): __
  • crawl_udhr(crawler, out, filename): __
  • crawl_voice_of_nigeria(crawler, out, urlprefix): __
  • crawl_bibleis(crawler, out, bible): __
  • crawl_tipitaka(crawler, out, script): __
  • find_wordpress_urls(crawler, site, **kwargs): __

Some cleaners

  • unichar(i): __
  • replace_html_entities(html): __
  • cleantext(html): __
  • clean_paragraphs(html): __
  • extract(before, after, html): __
  • fixquotes(s): __

Shorter way to do so

In code comments can do a lot. Pointing to wisely chosen sections too. If you have the required know how, please add comments onto a chosen, existing crawler and point to it as an in-code tutorial.

@sffc, @brawer : anyone could help on that ?

hugolpz avatar Feb 25 '21 22:02 hugolpz

@hugolpz Can I help adding steps to readme about how to add new crawler starting with basics of installing python?

Aayush-hub avatar Mar 14 '21 17:03 Aayush-hub

Hello Aayush. thank you for jumping in. I think we can assume ability to install python. Readme.md should just have a section "Requirement" with python version and associated pip dependency

### Requirements
* python x.x+

### Dependencies
`
pip3 instal {package1}
pip3 instal {package2}
pip3 instal {package3}

This would help yes.

I made a large review of this project but I'am JS dev so I walk quite blind here. Yet I think this project isn't that hard to contribute to : the main obstacle is 1. how to start and 2. what kind of output each crawler must provides, how, where.

@brawer, would you temporarily grant me maintainer status so I could handle the possible PRs ? I would be happy to give that userright back as soon as a new, active python dev emerges.

hugolpz avatar Mar 14 '21 23:03 hugolpz

@hugolpz Sure, looking to add required dependencies information in README :)

Aayush-hub avatar Mar 15 '21 05:03 Aayush-hub

@hugolpz Getting an error no module found : corpuscrawler when running main.py. Can you please help debugging it?

Aayush-hub avatar Mar 15 '21 08:03 Aayush-hub

JS dev here, I try to help around but I don't know python. I can look for python help but it will need at least 5 days.

hugolpz avatar Mar 15 '21 21:03 hugolpz