webmagic
webmagic copied to clipboard
A scalable web crawler framework for Java.
PriorityScheduler源码如截图: data:image/s3,"s3://crabby-images/42e56/42e567a346027dc3d698e85c19539b0549c6f0cb" alt="image" 问题:为什么需要使用三个queue?直接把QueueScheduler的队列换成PriorityBlockingQueue就可以了吧?而且统计队列剩余数量好像是错的,只统计一个队列的,请作者看看。 QueueScheduler源码如截图: data:image/s3,"s3://crabby-images/e7d87/e7d87b38b7c9199ca086319c21ed9de58c9ad2f8" alt="image" 请作者指点一下,谢谢!
Can't find any crawler policy and\or property to restrict crawling depth. Is it missed and only way how we can restrict depth is by choosing suitable selector in PageProcessor?
比如有些下载地址是因为网络波动 读取超时 。 这时onError有异常信息才能比较好即时处理
Throws a exception when the waiting time detectably elapsed before return from the method.
- diamond operator since JAVA 7 - naming conventions - duplication
Adding a try-catch-finally clause to properly close the configFileReader file
Found a code smells on a missing decorator. A fix on a code smells, trying to use pull requests for school.
1. Add @Deprecated annotation with both @deprecated Javadoc tag just to enable tools such as IDEs to warn about referencing deprecated elements and to highlight a user when the element...
哪位大佬能给解释下
Hello ! this in this pull request, Im correcting and deleting Code smells, but also refactoring some methods for a better readability.