Aécio Santos

Results 54 comments of Aécio Santos

Hi @tpolo777 , apologies for the late response. ACHE is able to _download_ PDF files when it finds links to them, but it won't necessarily prioritize downloading such files. More...

I believe it shouldn't be very hard. HTML parsing is done in here: https://github.com/ViDA-NYU/ache/blob/17577ccc9a43121f722843ce914ab02f0538be41/src/main/java/focusedCrawler/crawler/async/FetchedResultHandler.java#L48-L62 For PDFs, we basically would need to detect PDF mime types, try to parse them, and...

Unfortunately, the current HTML parser and tokenization does not have great support for UTF-8 and other encodings and is not well tested with many languages. We started a full rewrite...

Sorry for the long delay to respond. I have just seen other people running into this same problem on Windows. This is related with the underlying RocksDB database engine that...

Seems like ACHE is not able to connect with Elasticsearch to check if index already exists. So the problem might be with your Elasticsearch instance.

We usually don't test or support running the crawler on Windows. Also, without any more detailed errors logs, it is hard to know what is happening.

Are we talking about a new link classifier? I assumed it was a target page classifier.

I was thinking in having a more generic classifier that can combine a list of any other existing classifier. For example, it would be configured like this: ```yaml type: combiner...

What do you mean by complex boolean expresions? Could you give an example?

Got it. I think nesting `combiner` classifiers would enable complex boolean expressions and could be easily supported. It would also enable combination of weka models with arbitrary regex-based classifiers. For...