Benjamin Estes
Benjamin Estes
Sorry, is this stuff living @ CactusExamples now? Happy to edit that instead.
Fixed with generic try/except loop. Does anyone know what went wrong here?
Reference: https://ferd.ca/simhashing-hopefully-made-simple.html
Reference: http://benwhitmore.altervista.org/simhash-and-solving-the-hamming-distance-problem-explained/
@cstrouse I've read up and understand enough to see how this is useful for associating nearly duplicate content. However, I'm not sure how to use it in practice. You'd have...
I'll have a think about this. INT64 values aren't supported in JS UDFs, so would have to use a BYTES field.
I think this is a good case for test-driven development. I'll start by creating some pathologically flawed config files — though really each can only trigger one error, so I...
Just experienced a good case that could use special handling: a config file with no `From` field, when in spider mode. Spider mode is useless without `From`, so crawl should...
Does Go's unmarshalling pattern allow for this to be easily checked?
I think CSS Selectors are the way to go. The content already has to be parsed once to do the scraping internal to the crawler. If we can use CSS...