tiny-web-crawler
tiny-web-crawler copied to clipboard
First Major Release v1.0.0
This is a place holder Issue for the first major release v1.0.0
Please feel free to create issue from this list
Scope and Features: First major version v1.0.0
Functional Requirements
- [x] Basic Crawling Functionality #1
- [x] Configurable options for maximum links to crawl #1
- [x] Handle both relative and absolute URLs #1
- [x] Save crawl results to a specified file #1
- [x] Configurable verbosity levels for logging #7
- [x] Concurrency and custom delay #7
- [x] Support Regular expression #16
- [x] Crawl internal / external links only #11
- [x] Return optional html in response #19
- [ ] Crawl depth per website/domain #37
- [x] Logging #38
- [x] Retry mechanism for transient errors #39
- [ ] Support Javascript heavy dynamic websites #10
- [x] (Optional) Respect Robots.txt #42
- [ ] (Optional) User-Agent Customization
- [ ] (Optional) Proxy support
- [ ] (Optional) Use Asynchronous I/O
- [ ] (Optional) Crawl output to database (Mongo mabye)
Non Functional Requirements
- [X] Git workflow for CI/CD #4
- [ ] Documentation (API and Developer) #18
- [x] Test coverage above 80% #28
- [x] Git hooks #22
- [x] Modular and Extensible Architecture #17
- [ ] (Optional) Memory Benchmark: Monitor Monitor memory usage during the crawling process
- [ ] (optional) Security considerations (e.g., handling of malicious content)
You forgot to check "Return optional html in response https://github.com/indrajithi/tiny-web-crawler/pull/19" ;)
@indrajithi On Git hooks maybe you should link my second pr on that feature (#25 ) so people also see the pre-commit install --hook-type pre-push command :)
@indrajithi you can check "Test coverage above 80%" now ;)