MachineLearningNotebooks ParallelRunStep output file rows clobber each other

Very recently, a process that had been fine for months is failing at a pandas read_csv step, when we try to import the dataset created from a ParallelRunStep. The issue seems to be that two rows are merged in the output file, and so a row is longer than expected.

We expect 70 columns in our data. When this happens, one row in the file has more than 70. The number varies (we've seen 2 failures, one had 137 columns, one 100).

In the bad row, it's clear that the second row data comes in in the middle of a row. The first field from the next row is squished onto some field in the middle of the previous row. This is like the following (if our dataset had 3 fields)

we expect: 1 blue balloon 2 yellow car 3 red shirt

but we see instead: 1 blue ballo2 yellow car 3 red shirt

In both of our 2 failures so far the row position and specific case differed (no not the same rows each time). But we saw only one bad row per run (and we had very large files).

We can work around by ignoring bad rows when we do the read_csv, but this is definitely an Azure bug.

Apr 26 '22 19:04 vla6

I want to add we saw back-to-back failures starting April 24 and also on April 25 when we ran this process, which had never failed before.

Apr 26 '22 19:04 vla6

hi, thanks for report it. I suppose you use the append_row output action and return your result in run() function. Sometimes two of your result were written in one line. Is this right?

Apr 27 '22 06:04 shift202

Yes that is correct. We append rows in the output and do run(mini_batch). We expect the same number of input and output rows if all goes well. But very recently two rows are merged into 1.

Apr 27 '22 09:04 vla6

do you use window or linux docker image?

Apr 27 '22 16:04 shift202

Linux.

FYI I also have an official support ticket open for this issue. Case 2204260040007488

Apr 28 '22 00:04 vla6

This problem appears to be worsening. We have a greater proportion of dripped rows. We tried to work around with error_bad_lines=False which was OK for a while but now lines are split so data types are not matching and we have very short rows which are the end of a previous row. We are losing a lot of data to these errors.

Dec 13 '22 16:12 vla6

It looks like adding engine='python' to our read_csv can work as a workaround but this is a serious Azure issue. We are losing about 10% of our rows. The new problem is a bit different - instead of clobbering rows it seems the process is splitting some rows.

The original ticket was resolved (2204260040007488). That was a rare clobber. In our new process about 9% of our rows are clobbered

Dec 13 '22 21:12 vla6

@vla6 sorry, this is a bug which happened a single mini-batch output size exceeds the 32KB, we have deployed the fix to production regions, please try submit job again and check the output.

Dec 14 '22 01:12 bupt-wenxiaole

MachineLearningNotebooks MachineLearningNotebooks copied to clipboard

ParallelRunStep output file rows clobber each other

MachineLearningNotebooks
MachineLearningNotebooks copied to clipboard