constellation Inconsistent CSV import error/invalid data handling

Inconsistent CSV import error/invalid data handling

Open serpens24 opened this issue 2 years ago • 8 comments

Prerequisites

[ ] Put an X between the brackets on this line if you have done all of the following:
- Running the latest version of Constellation
- Attached the Support Package via Help > Support Package
- Checked the FAQs: https://github.com/constellation-app/constellation/wiki/FAQ
- Checked that your issue isn’t already filed: https://github.com/constellation-app/constellation/issues
- Checked that there is not already a module that provides the described functionality: https://github.com/constellation-app/constellation/wiki/Catalogue-of-Repositories

Description

Invalid data for CSV import can result in inconsistant "Import From File" error dialogs

Steps to Reproduce

Select File->Import->From File...
Import file sample1.csv sample1.csv
Drag Source Identifier to the 'from' column and Destination Identifier to the 'to' column
Drag Transaction Datetime to the 'date' column
Confirm that the column is marked red and an error displayed, preventing the mapping (as per image)
Exit out of the import and clear import file
Select File->Import->From File...
Import file sample2.csv sample2.csv
Drag Source Identifier to the 'from' column and Destination Identifier to the 'to' column
Drag Transaction Datetime to the 'date' column
Note that the mapping is allowed, despite format not matching - as per image
Question: why is this now allowed ?
Hit import and note the error importing
Acknowledge the error.
right click on the Datetime mapping on the 'date' column and set a default value, noting erros are now removed
Hit import and note the error importing as per step 13

Behaviour

Expected behaviour: [What you expect to happen]

Consistency in behaviour. If the importer is smart enough to detect invalid Datetime data when all rows are present and prevent mapping, removing data from one of these rows shouldnt allow the import. Steps 13 & 16 concern me in that 'detectable' errors result in exception type dialogs.
I note that if there is valid data in a column it is highlighted red, see below - so that suggests its possible to detect it ahead of time (albeit only in preview rows)

Actual behaviour: Somewhat inconsistent behaviour and in some cases exception dialog raised

Reproduces how often: 100% given correct data

Additional Information

Sep 27 '21 02:09 serpens24

This issue is stale because it has been open for 6 months with no activity. Consider reviewing and taking an action on this issue.

Mar 29 '22 00:03 github-actions[bot]

While the error message seen at step 13 has changed slightly, the fundamental issue noted in this ticket is still occuring. In thta a nice error dialog is presented in cases where all rows have date time data there is a nice clean error, if this is not the case the nless attractive erros including stack traces are present.

May 06 '22 02:05 serpens24

line 475 of RunPane.java is the line that is deciding to show the initial 'neater' error message at step 5,

Investigation showed that the column.validate mrethod is incorrectly determining validity, refer to code below - the value of columnFailed is updated on each row iteration, as such, so long as the last row is valid the entire dataset it treated as vaild.

This should be modified - as a minimum to use an or condition to 'or' the previous columnFailed value together with next check... to be confirmed is if validator only runs over preview rolws or all rows.

May 06 '22 02:05 serpens24

This should be modified - as a minimum to use an or condition to 'or' the previous columnFailed value together with next check... to be confirmed is if validator only runs over preview rolws or all rows.

Just a thought but assuming the column fails validation as soon as one row fails, potentially you could find a way to write it such that as soon as one row fails validation that it breaks out of the loop. That would save a bunch of unnecessary iterating afterwards (given true || anything is true)

May 06 '22 03:05 antares1470

My implementation is as per antares comment above ....

This does still leave us a problem:

Only preview rows (100) are validated, so lets assume we have 1000 rows of data, and the first 100 are valid, it will get through the preview validation and except-out once you hit import.

Are we happy for exceptions to be thrown, or should we consider handling thing more cleanly. ImportDelimitedPlugin handles the data import and already has code in place to allow rows to be skipped over (due to filtering) ... we could potentially also skip over invalid rows and NOT add them .... but potentially log how many invalid rows were skipped !!

If we did this, then you sort of have to ask why bother validating in the preview in the first place .... who cares if the first 100 rows are all valid if you then either reject import entirely or skip over/filter out the next 500 rows because they are invalid.

Some options are (IMO):

Use the change I called out above and fix the validation bug. Then allow import errors to throw exceptions - resulting in the import failing
Remove the validation of sample data, and just skip over any rows with invalid data - and log/show import summary, possibly higfhlight invalid rows/cells in sample data
modify the validator to validate ALL rows, not just sample rows (slowest performance)
add a toggle giving users the option to cancel improt if an error is doscovered, or exclude invalid rows ..... ie, let them choose between options 1 and 2 on the fly.

In summary, even after making the fix we still have code that validates 100 rows cleanly, then uses stack traces/exceptions to abort import if any other row is invalid.

Thoughts - @antares1470 & @arcturus2

May 06 '22 06:05 serpens24

@serpens24 It would seem, as you've identified, that the main source of the problem is that only the preview rows are being validated. This would be ok if the preview being validated implied the whole file was validated, but clearly that isn't the case. Therefore I think as you've alluded to, we either have the validate the whole file, or remove validation all together. The answer to that one will depend on whether performance or reliability is considered more important.

I think I lean towards options 3 or 4 from your suggestions but which one to go with will likely depend on being able to answer the above question. If we went with 3, you'd want to investigate how significant the performance decrease is (especially on larger files). If we went with 4, I would guess there isn't much point in keeping the validation besides being able to inform users before the import starts that not everything will be imported (which may not actually be a bad thing from a UX perspective)

May 11 '22 00:05 antares1470

I agree with you, I am leaning towards 4, which would allow uswrs to decide if they want to fail the import or just import valid rows... I will probably foreward this ticket to @spica198 to take forward and will walk through it, given I'll be working on other stuff.

May 11 '22 03:05 serpens24

option 4 I think is good for the reason you describe @serpens24. There are a couple of things I have thought of though regarding implementation that should be considered:

When implementing option 2, it's important to be smart about how you log/highlight invalid rows. If you log/highlight each individual row that failed to import this will be fine with a small file but with a large file (say a million rows), you could potentially crash the application if they all fail to import
It's worth noting that options 1 and 2 won't just handle things differently after pressing Import, the UI itself may also need to behave differently per option. Currently the UI is designed for option 1. Namely if you try and apply a datetime formatter onto a column of data that doesn't match that format, the UI doesn't allow you to apply the attribute to that column at all. For option 2 though, this would need to change as you need to be able to apply the attribute regardless of whether errors are detected (since you still want to be able to import valid rows). While the option 2 workflow would still work with option 1, the changes that have been made to current workflow were to improve UX (cause if the import is going to fail anyway, better to not allow the user to apply a mapping guaranteed to fail in the first place rather than telling them after trying to import) so ideally, you'd want to keep the current workflow for option 1, and then change it for option 2. I don't know how easy that would be to do though

May 23 '22 01:05 antares1470

Thank you @spica198 for making the improvements here, they are great. I have been trying this out with lots of my files that have date issues and they can be skipped over and imported which is handy as I don't need to correct the original rows in the dataset.

I've raised an issue for the Runtime error that is being displayed when you don't select to Skip the invalid rows as it would be good if this error was captured and not presented to the end users. Not sure how easy that is to fix.

As an enhancement it would be great to adjust the wording on the message pop up just to make it a bit clearer the number refers to the count of the rows and not the row number with the issue.

I have also noticed the Excel date time formatter is not working anymore which I have made a new issue for.

Jan 11 '23 02:01 GammaVel

constellation constellation copied to clipboard

Inconsistent CSV import error/invalid data handling

Prerequisites

Description

Steps to Reproduce

Behaviour

Additional Information

constellation
constellation copied to clipboard