FlatFiles icon indicating copy to clipboard operation
FlatFiles copied to clipboard

Ignore bad data found

Open dowmeister opened this issue 5 years ago • 4 comments

Hi,

i'm trying to import a csv with quotes. With CsvHelper it works but FlatFiles is way more better and i would to use it.

A field has this value = "LICEO ARTISTICO "GAUDENZIO FERRARI"" (quotes inside not escaped).

SeparatedValueReader fires the error "SeparatedValueSyntaxException: A syntax error was encountered: Unmatched quote."

I would to ignore this error and import the field in any case.

There is a way to achieve this? Maybe with Preprocessor? I cannot figure how.

Thank you

dowmeister avatar Jan 23 '19 10:01 dowmeister

I'm not ignoring you, btw. I just haven't had time to respond yet. It would be really useful if you could write a unit test to demonstrate this not working. That would save me some time when I have time to look into this feature. It's technically not valid CSV, but very rarely is CSV ever following some sort of standard.

It makes sense to support this, though. As long as a quote isn't followed by a comma, I can see it being treated as part of the token. I am curious how I handle this today -- I am guessing I just throw an exception -- I am not sure. A unit test would be really useful.

jehugaleahsa avatar Jan 30 '19 13:01 jehugaleahsa

Encountered this challenge as well, a tab-delimited file whose values may contain quote characters as part of the value. For example ... with this record, the final value contains an unterminated quote that is part of the value. In reality, there should be a closing quote after noncash, but either way, I know this file does not quote values but instead includes quotes within values.

DebtConversionConvertedInstrumentAmount1	us-gaap/2019	0	0	monetary	D	C	Debt Conversion, Converted Instrument, Amount	The value of the financial instrument(s) that the original debt is being converted into in a noncash (or part noncash) transaction. "Part noncash refers to that portion of the transaction not resulting in cash receipts or cash payments in the period.

Within SeparatedValueRecordParser.cs (line 86) in function GetNextToken on line 101 this block exists:

           if (reader.IsMatch1(Options.Quote))
            {
                return GetQuotedToken();
            }

In this case, it is encountering a quote " character and treats it as a quoted value. When it does not find the closing quote it throws an error within GetQuotedToken(). It would be great if an option existed that allowed quote characters to be ignored and treated as normal characters. Then the block above would be:

           if (Options.ProcessQuotes && reader.IsMatch1(Options.Quote))
            {
                return GetQuotedToken();
            }

Without this option, I have worked around this by specifying a quote character that I know will not exist in the file hence turning actual quotes " into normal tokens and bypassing this check:

var options = new SeparatedValueOptions()
                {
                    IsFirstRecordSchema = true,
                    Separator = "\t",
                    PreserveWhiteSpace = false,
                    QuoteBehavior = QuoteBehavior.Never,
                    // Hack to have parser treat quotes " as regular tokens
                    Quote = '\u0000'
                };

mlesk avatar Apr 23 '21 19:04 mlesk

I'll take a look. The example data you posted seems a bit odd, being just one line. Can you provide some more background on the schema? I'm just going to operate under the assumption it's just tsv and go from there.

jehugaleahsa avatar Apr 24 '21 19:04 jehugaleahsa

I created a test with the data you posted above. Specifically, I tested how my library handles embedded quotes. The only time my library should care about quotes is if they are the first character at the start of a value. In your example, the quote seems to be in the middle of a value.

This is different than what @shardick is running into, which is where a value is starting with a quote and an embedded quote is not a terminating quote.

I am going to ask you to create a new ticket, providing expected vs actual data because, so far, I am not sure what issue you are reporting. Here is the test I wrote and it is passing:

            string source = "DebtConversionConvertedInstrumentAmount1\tus-gaap/2019" +
                "\t0\t0\tmonetary\tD\tC\tDebt Conversion, Converted Instrument, Amount" +
                "\tThe value of the financial instrument(s) that the original debt is being converted into in a noncash (or part noncash) transaction. \"Part noncash refers to that portion of the transaction not resulting in cash receipts or cash payments in the period.";
            StringReader stringReader = new StringReader(source);
            SeparatedValueOptions options = new SeparatedValueOptions()
            {
                IsFirstRecordSchema = false,
                Separator = "\t"
            };
            SeparatedValueReader reader = new SeparatedValueReader(stringReader, options);
            object[][] expected = new object[][]
            {
                new object[] 
                { 
                    "DebtConversionConvertedInstrumentAmount1",
                    "us-gaap/2019", 
                    "0", 
                    "0",
                    "monetary",
                    "D",
                    "C",
                    "Debt Conversion, Converted Instrument, Amount",
                    "The value of the financial instrument(s) that the original debt is being converted into in a noncash (or part noncash) transaction. \"Part noncash refers to that portion of the transaction not resulting in cash receipts or cash payments in the period."
                }
            };
            assertRecords(expected, reader);

Also please include any version information to help me diagnose this. I just tried with your exact SeparatedValueOptions (minus the special quote character of course) and the test is passing there too.

Btw, I think using \u0000 is totally fine and is an interesting way to achieve your needs. 👍

jehugaleahsa avatar Apr 25 '21 01:04 jehugaleahsa