ERROR: missing data for column "sentence_index"
Hi,
I am doing the Adding NLP markups step on my own corpus following the spouse example in tutorials. However, I got an error: missing data from column "sentence_index" after running the program for quite a while. I guess the DeepDive parser might have had trouble parsing one of the documents in my corpus. But I don't know what is the exact reason. I checked my corpus but found it nothing special.
P.S., I have successfully run it on another corpus without having this issue.
Any help would be highly appreciated!
check if its related to string literals (tabs etc). ab tab in your content may break it if you don't escape it properly
Hi zian92,
Can u elaborate your explanation with an example.So many of us would get befitted.
Thanks, Bala
I experienced DD to be a little sloppy with encoding (may be related to python 2.7). E.g.
@tsv_extractor
@returns(lambda
doc_id="text"
: [])
def extract(
id="text",
):
yield "\t"
produces "ERROR: extra data after last expected column" as the string is not encoded properly and DD detects a 2nd column (which it doesn't expect). It took me some time to get this.
I am sure to be ran into this problem but unable to reproduce ist. @hugochan can you identify the text that produces the error? And at which step?
maybe i was a little wrong here: i have the following UDF
@tsv_extractor
@returns(lambda
doc_id="text",
feature="text",
: [])
def extract(
doc_id="text",
feature="text",
counter="int",
):
#(1)
print sys.stderr, doc_id, feature, counter
for _ in range(counter):
yield [doc_id, feature]
#(2)
yield [doc_id, feature + " " + str(counter)]
if (1) is used ((2) as comment), then the UDF fails: ERROR: missing data for column "feature"
if (2) is used, then it works. I dont know why it's like that and don't see a difference.
Hi Zian92,
Thanks for your explanation.
I have one more issue with Python Encoding Character with BOM .
I have my python code to extract the contents from the documents and writing it in a tsv file but at this stage everything goes fine.
While i am processing the same(tsv) file with Deepdive. Deepdive identifies the character(1yQ11CQEAP1X ) from the tsv file and it causes a failure in deepdive do sentences. But I am not sure that this special character causes this issue.
Could u please help me to get rid off this issue.?
@Balachandar-R : i don't see a relation to the original topic of this ticket ;)
It may be necessary to decode the rows from the database and to encode your results to be stored in the db. I am not familiar with BOM, but the web should provide an answer to your problem.
Hi Zian92,
Thanks for your answers.
I will have the following issue while "deepdive do sentences"
user@Azmachine:~/pedia$ deepdive do sentences ârun/RUNNINGâ -> â20170817/042914.419451517â 2017-08-17 04:29:14.710491 process/ext_sentences_by_nlp_markup/run.sh unloading: 0:00:00 715KiB [2.29MiB/s] ([2.29MiB/s]) unloading: 0:00:00 2 [6.55 /s] ([6.55 /s]) loading dd_tmp_sentences: 1:33:29 277 B [50.6miB/s] ([ 0 B/s]) loading dd_tmp_sentences: 1:33:29 5 [ 891u/s] ([ 0 /s])
along with
2017-08-17 04:31:10.411673 Loading parser from serialized file edu/stanford/nlp/models/srparser/englishSR.ser.gz ...OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x000000072b980000, 113770496, 0) failed; error='Cannot allocate memory' (errno=12)
I am using Deepdive 0.8 stable version.
Thanks in advance, Bala
@Balachandar-R can you try the following? in run.sh in udf/bazaar/parser try changing the -Xmx4g to -Xmx2g. basically, you are telling Stanford CoreNLP to use a maximum of 2gbs of RAM instead of 4. maybe this will help fix your problem