corpuscrawler
corpuscrawler copied to clipboard
Improve readme documentation on how to provide a new crawler
This /CONTRIBUTING.md is a License Agreement / Code of Conduite to sign. As far as I can see, this very cherishable project has no actual tutorial.
In don't have Python and code knowledge to fix this documentation issue but I can map the road so it become easier for the next person to do so.
Wanted
If an user want to add a language such as Catalan from Barcelona (ca
, cat
: missing). What do he needs to jump in quickly ? What should he provide ?
- What isn the local structure :
- What tools I have :
- lists of available modules
- API of key functions
- What input(s) : python list of url ?
- What are the classic parts of a crawler function ?
- What output format : raw text ? html is fine because a html balise wiper is then used ?
- Example of easily hackable base-code.
API (to complete)
Defined functions within util.py
, by order of apparition as of 2021/02/26. If you have some relevant knowledge, please help for a sub-section or one item.
Some tools
-
daterange(start, end)
: __ -
urlpath(url)
: __ -
urlencode(url)
: __
Main element
-
class Crawler(object):
-
__init__(self, language, output_dir, cache_dir, crawldelay)
: __ -
get_output(self, language=None)
: __ -
close(self)
: __ -
fetch(self, url, redirections=None, fetch_encoding='utf-8')
: __ -
fetch_content(self, url, allow_404=False)
: __ -
fetch_sitemap(self, url, processed=set(), subsitemap_filter=lambda x: True)
: __ -
is_fetch_allowed_by_robots_txt(self, url)
: __ -
crawl_pngscriptures_org(self, out, language)
: __ -
_find_urls_on_pngscriptures_org(self, language)
: __ -
crawl_abc_net_au(self, out, program_id)
: __ -
crawl_churchio(self, out, bible_id)
: __ -
crawl_aps_dz(self, out, prefix)
: __ -
crawl_sverigesradio(self, out, program_id)
: __ -
crawl_voice_of_america(self, out, host, ignore_ascii=False)
: __ -
set_context(self, context)
: __
-
Some crawlers for multi-languages sites
-
crawl_bbc_news(crawler, out, urlprefix)
: __ -
crawl_korero_html(crawler, out, project, genre, filepath)
: __ -
write_paragraphs(et, out)
: __ -
crawl_deutsche_welle(crawler, out, prefix, need_percent_in_url=False)
: __ -
crawl_radio_free_asia(crawler, out, edition, start_year=1998)
: __ -
crawl_sputnik_news(crawler, out, host)
: __ -
crawl_udhr(crawler, out, filename)
: __ -
crawl_voice_of_nigeria(crawler, out, urlprefix)
: __ -
crawl_bibleis(crawler, out, bible)
: __ -
crawl_tipitaka(crawler, out, script)
: __ -
find_wordpress_urls(crawler, site, **kwargs)
: __
Some cleaners
-
unichar(i)
: __ -
replace_html_entities(html)
: __ -
cleantext(html)
: __ -
clean_paragraphs(html)
: __ -
extract(before, after, html)
: __ -
fixquotes(s)
: __
Shorter way to do so
In code comments can do a lot. Pointing to wisely chosen sections too. If you have the required know how, please add comments onto a chosen, existing crawler and point to it as an in-code tutorial.
@sffc, @brawer : anyone could help on that ?
@hugolpz Can I help adding steps to readme about how to add new crawler starting with basics of installing python?
Hello Aayush. thank you for jumping in. I think we can assume ability to install python. Readme.md should just have a section "Requirement" with python version and associated pip dependency
### Requirements
* python x.x+
### Dependencies
`
pip3 instal {package1}
pip3 instal {package2}
pip3 instal {package3}
This would help yes.
I made a large review of this project but I'am JS dev so I walk quite blind here. Yet I think this project isn't that hard to contribute to : the main obstacle is 1. how to start and 2. what kind of output each crawler must provides, how, where.
@brawer, would you temporarily grant me maintainer status so I could handle the possible PRs ? I would be happy to give that userright back as soon as a new, active python dev emerges.
@hugolpz Sure, looking to add required dependencies information in README :)
@hugolpz Getting an error no module found : corpuscrawler when running main.py
. Can you please help debugging it?
JS dev here, I try to help around but I don't know python. I can look for python help but it will need at least 5 days.