Mark Feblowitz

Results 13 comments of Mark Feblowitz

I was only able to make it go away by purging the apostrophes and/or the commas from the text.

OK - so it appears that there might be multiple reasons why this failure comes up: complex sentences, unhandled special characters in the text (including embedded apostrophes or commas, ......

Edits to the original problem statement were made after the comments above.

Question to @Anton-Velikodnyy RE #477: Does this this fix address the situation above, or merely prevent a mid-processing crash? The latter is good, the former would also be good.

Um, no. Sorry to have not been clear. Updating the description. I'm pulling the pdfs from the web and extracting from them. Thus, I have no control of the production...

Interesting... The origin of the pdf document (linked above) was the product of saving that web page to a pdf file. The contents are (mostly) binary. And pdftotext indeed revealed...

Now, if only there was a way to be alerted when the ligature substitution _might have occurred_, so excruciating manual examination of all processed documents would not be required...

That's the rub. To know whether it has the characters, you'd need a good extraction to compare against. Or you'd need a comprehensive (huge) set of patterns to look for...

Ok - I have one, and a comment about the error handling. First, the query. Submit this query to _http://live.dbpedia.org/sparql_ : ``` PREFIX rdfs: PREFIX : PREFIX d: PREFIX do:...

I'm under a time crunch. I'll try the upgrade again when I get a chance and will send the diagnostics.