marklogic-data-hub icon indicating copy to clipboard operation
marklogic-data-hub copied to clipboard

Mlcp incorrectly shows number of failed documents [BUG]

Open plackowk opened this issue 3 years ago • 2 comments

The steps to reproduce the issue:

  1. Create mock ingestion flow referring to some step definition
  2. In the module defined in the step definition throw an error
  3. run mlcp command like this:
mlcp.sh import\
  -mode "local" -host "localhost" -port "..."\
  -username "..." -password "..."
  -input_file_path "<some input path>"\
  -input_file_type "documents"\
  -output_collections "test"\
  -output_permissions "...,read,...,update"\
  -document_type "json"\
  -generate_uri "true" \
  -transform_module '/data-hub/5/transforms/mlcp-flow-transform.sjs'\
  -transform_param "flow-name=Test,step=1"\
  -thread_count "4"\
  -batch_size "100"
  1. Here is mlcp output logged to console you will get:
21/05/19 14:38:37 INFO contentpump.LocalJobRunner: Content type: JSON
21/05/19 14:38:38 INFO contentpump.ContentPump: Job name: local_1866826114_1
21/05/19 14:38:38 INFO contentpump.FileAndDirectoryInputFormat: Total input paths to process : 1
May 19, 2021 2:38:38 PM com.marklogic.xcc.impl.ContentSourceImpl loadLoggingPropertiesFromResource
WARNING: property file not found: <something>
21/05/19 14:38:38 INFO contentpump.LocalJobRunner:  completed 100%
21/05/19 14:38:38 INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter:
21/05/19 14:38:38 INFO contentpump.LocalJobRunner: INPUT_RECORDS: 1
21/05/19 14:38:38 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 1
21/05/19 14:38:38 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 1
21/05/19 14:38:38 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0
21/05/19 14:38:38 INFO contentpump.LocalJobRunner: Total execution time: 0 sec
  1. Nothing will be written to database (as expected, because an error was thrown). If you check jobs database related to database you tried to write to, you will find these two files:

jobsmlcp-3808065635738816326json.txt (this file actually is pretty strange itself also - why is timeEnded = N/A?) jobsbatches25430a21-d66f-449b-891e-12a39f80dc5bjson.txt

You can see in them, that the error was caught.

What behavior were you expecting?

I was expecting to have info about 1 failed document printed in console, like this (it was working like that with dhf4 transform):

21/05/19 14:38:38 INFO contentpump.LocalJobRunner: INPUT_RECORDS: 1
21/05/19 14:38:38 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 1
21/05/19 14:38:38 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 0
21/05/19 14:38:38 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 1

And also the error message printed there.

Please provide the following version information:

  • Operating System = w10
  • MarkLogic = 9.0-13.1
  • Data Hub = 5.4.2

plackowk avatar May 19 '21 12:05 plackowk

The issue here is the MLCP error reporting mechanism and it's a known bug for the MLCP team. If you change the thread_count param for MLCP it shows different counts for records committed and records that failed.

The team is aware that the job document has timeEnded = N/A and jobStatus = started. The issue here is that DHF doesn't have any way of knowing when the last batch was processed by MLCP as a result DHF can't go and update the job document and the job document remains stuck in the started state. The only reason we decided to keep the job document was that the batch document provides some value as it contains the processed URI's and errors that were thrown.

akshaysonvane avatar May 21 '21 16:05 akshaysonvane

I discovered that errors are probably not being caught by mlcp only in case transformation module is in sjs, and you can make a temporary fix to DHF by adding such module:

xquery version "1.0-ml";

module namespace mlcpFlow = "http://marklogic.com/data-hub/mlcp-flow-transform";

declare function mlcpFlow:transform(
  $content as map:map,
  $context as map:map
) as map:map*
{
  let $s := 'var content, context; const transform = require("/data-hub/5/transforms/mlcp-flow-transform.sjs"); transform.transform(content, context)'
  return xdmp:javascript-eval($s,
    map:new((
      map:entry("content", $content),
      map:entry("context", $context)
    )),
    map:entry("isolation", "same-statement")
  )
};

As long as isolation options is set to "same-statement" it seems to work for me and catches errors properly.

edit: after some checks: sometimes this "solution" catches error that is not being caught in case of default dhf5 transform, sometimes it's opposite.

plackowk avatar May 24 '21 12:05 plackowk

This is covered by: https://github.com/marklogic/marklogic-contentpump/issues/180

ollsage avatar Sep 06 '23 18:09 ollsage