Ken Krugler

Results 27 issues of Ken Krugler

The robots.txt file at http://www.scotsman.com/robots.txt has a number of issues... 1. There's no blank line between rule sections 2. The "*" user agent rule section is obviously intended to be...

robots

Related to PR https://github.com/crawler-commons/crawler-commons/pull/220. See https://dzone.com/articles/currency-format-validation-and re details of using Apache Commons Validator to validate/parse a wide range of currency formats.

enhancement
Priority-Low
sitemaps

Currently a text column can be created without any forward index, which is useful when using the column only for filtering. In this situation, the raw (original) text data is...

In Progress

When processing a large file (> 2000 entries), I got this error: ``` Traceback (most recent call last): File "vcardtools.py", line 281, in main() File "vcardtools.py", line 214, in main...

Without this, trying to build the project results in: ``` ld: framework not found JavaVM clang: error: linker command failed with exit code 1 (use -v to see invocation) ```...

From the Flink mailing list: > +1 to using reporters. > > You will have to explicitly pass a configuration with the reporter settings to the environment via StreamExecutionEnvironment#createLocalEnvironment(int, Configuration)....

We'd want to test with a big list of seed URLs (good to confirm it handles that anyway, especially from S3, with parallelism of 1). Ensure we force parallelism of...

in progress

Currently we've got: ``` java if (parsedSiteMap instanceof SiteMapIndex) { // Log this - so we can deal with this in the future LOGGER.info("Unexpected SiteMapIndex encountered while parsing sitemap url:...

enhancement

Since anything inflight is assumed lost, we need to restore state for any URLs with state == `FETCHING` to what it was before. We already keep the previous state around...

bug

When this is set, then require that a checkpoint dir is specified, and use the RocksDB state backend vs. the filesystem backend.

enhancement