Tim Allison comments

Results 93 comments of


                                            Tim Allison

[FEATURE] Reimplement BulkProcessor

Any updates on this? This is a blocker on https://issues.apache.org/jira/browse/NUTCH-2994. Let me know if I can help.

WACZ futurism: mimetype and Pronom ID

We're adding wacz detection (maybe parsing?) over on Apache Tika now. As a temporary placeholder at least, is `application/wacz` appropriate ? https://issues.apache.org/jira/browse/TIKA-3696

Memory leak with wildcard inside double quotes

Exciting! Ugh. Should probably create a MatchAllDocsQuery for that like we do for `*:*` if we're not?

Memory leak with wildcard inside double quotes

Y, this is Solr's behavior: ``` // called from parser protected Query getWildcardQuery(String field, String termStr) throws SyntaxError { checkNullField(field); // *:* -> MatchAllDocsQuery if ("*".equals(termStr)) { if ("*".equals(field) ||...

Memory leak with wildcard inside double quotes

Can you do me a favor and see if the ComplexPhraseQueryParser dies on "foo *"? I'm happy enough converting * to a MatchAllDocsQuery when it is outside of a SpanQuery,...

Memory leak with wildcard inside double quotes

The other question is do we want to do this at the Lucene level or at the Solr level? My pref would be to do this at the Lucene level,...

Memory leak with wildcard inside double quotes

@sjwoodard, I may have some time to work on this soon. Let me know if you still care.

Memory leak with wildcard inside double quotes

If we fix it in Solr, how do these tests look: ``` public void testMatchAllDocs() throws Exception { assertJQ(req("defType", "span", "q", "*"), "/response/numFound==4"); assertJQ(req("defType", "span", "q", "*:*"), "/response/numFound==4"); assertJQ(req("df", "text0",...

Memory leak with wildcard inside double quotes

My one concern is: ``` assertJQ(req("df", "text0", "defType", "span", "q", "*"), "/response/numFound==3"); ``` This does return the correct documents, but it returns the wildcard query: `text0:*`, which could still blow...

More information on co-occurence collector

Sorry for my delay. The cooccurrence code, as you pointed out, is not optimized for performance. It does perform re-analysis. Even on corpora of a few million documents, Lucene is...