constellation
constellation copied to clipboard
Inconsistent CSV import error/invalid data handling
Prerequisites
-
[ ] Put an X between the brackets on this line if you have done all of the following:
-
Running the latest version of Constellation
-
Attached the Support Package via
Help
>Support Package
-
Checked the FAQs: https://github.com/constellation-app/constellation/wiki/FAQ
-
Checked that your issue isn’t already filed: https://github.com/constellation-app/constellation/issues
-
Checked that there is not already a module that provides the described functionality: https://github.com/constellation-app/constellation/wiki/Catalogue-of-Repositories
-
Description
Invalid data for CSV import can result in inconsistant "Import From File" error dialogs
Steps to Reproduce
- Select File->Import->From File...
- Import file sample1.csv sample1.csv
- Drag Source Identifier to the 'from' column and Destination Identifier to the 'to' column
- Drag Transaction Datetime to the 'date' column
- Confirm that the column is marked red and an error displayed, preventing the mapping (as per image)
- Exit out of the import and clear import file
- Select File->Import->From File...
- Import file sample2.csv sample2.csv
- Drag Source Identifier to the 'from' column and Destination Identifier to the 'to' column
- Drag Transaction Datetime to the 'date' column
- Note that the mapping is allowed, despite format not matching - as per image
- Question: why is this now allowed ?
- Hit import and note the error importing
- Acknowledge the error.
- right click on the Datetime mapping on the 'date' column and set a default value, noting erros are now removed
- Hit import and note the error importing as per step 13
Behaviour
Expected behaviour: [What you expect to happen]
- Consistency in behaviour. If the importer is smart enough to detect invalid Datetime data when all rows are present and prevent mapping, removing data from one of these rows shouldnt allow the import. Steps 13 & 16 concern me in that 'detectable' errors result in exception type dialogs.
- I note that if there is valid data in a column it is highlighted red, see below - so that suggests its possible to detect it ahead of time (albeit only in preview rows)
Actual behaviour: Somewhat inconsistent behaviour and in some cases exception dialog raised
Reproduces how often: 100% given correct data
Additional Information
This issue is stale because it has been open for 6 months with no activity. Consider reviewing and taking an action on this issue.
While the error message seen at step 13 has changed slightly, the fundamental issue noted in this ticket is still occuring. In thta a nice error dialog is presented in cases where all rows have date time data there is a nice clean error, if this is not the case the nless attractive erros including stack traces are present.
line 475 of RunPane.java is the line that is deciding to show the initial 'neater' error message at step 5,
Investigation showed that the column.validate mrethod is incorrectly determining validity, refer to code below - the value of columnFailed is updated on each row iteration, as such, so long as the last row is valid the entire dataset it treated as vaild.
This should be modified - as a minimum to use an or condition to 'or' the previous columnFailed value together with next check... to be confirmed is if validator only runs over preview rolws or all rows.
This should be modified - as a minimum to use an or condition to 'or' the previous columnFailed value together with next check... to be confirmed is if validator only runs over preview rolws or all rows.
Just a thought but assuming the column fails validation as soon as one row fails, potentially you could find a way to write it such that as soon as one row fails validation that it breaks out of the loop. That would save a bunch of unnecessary iterating afterwards (given true || anything is true)
My implementation is as per antares comment above ....
This does still leave us a problem:
Only preview rows (100) are validated, so lets assume we have 1000 rows of data, and the first 100 are valid, it will get through the preview validation and except-out once you hit import.
Are we happy for exceptions to be thrown, or should we consider handling thing more cleanly. ImportDelimitedPlugin handles the data import and already has code in place to allow rows to be skipped over (due to filtering) ... we could potentially also skip over invalid rows and NOT add them .... but potentially log how many invalid rows were skipped !!
If we did this, then you sort of have to ask why bother validating in the preview in the first place .... who cares if the first 100 rows are all valid if you then either reject import entirely or skip over/filter out the next 500 rows because they are invalid.
Some options are (IMO):
- Use the change I called out above and fix the validation bug. Then allow import errors to throw exceptions - resulting in the import failing
- Remove the validation of sample data, and just skip over any rows with invalid data - and log/show import summary, possibly higfhlight invalid rows/cells in sample data
- modify the validator to validate ALL rows, not just sample rows (slowest performance)
- add a toggle giving users the option to cancel improt if an error is doscovered, or exclude invalid rows ..... ie, let them choose between options 1 and 2 on the fly.
In summary, even after making the fix we still have code that validates 100 rows cleanly, then uses stack traces/exceptions to abort import if any other row is invalid.
Thoughts - @antares1470 & @arcturus2
@serpens24 It would seem, as you've identified, that the main source of the problem is that only the preview rows are being validated. This would be ok if the preview being validated implied the whole file was validated, but clearly that isn't the case. Therefore I think as you've alluded to, we either have the validate the whole file, or remove validation all together. The answer to that one will depend on whether performance or reliability is considered more important.
I think I lean towards options 3 or 4 from your suggestions but which one to go with will likely depend on being able to answer the above question. If we went with 3, you'd want to investigate how significant the performance decrease is (especially on larger files). If we went with 4, I would guess there isn't much point in keeping the validation besides being able to inform users before the import starts that not everything will be imported (which may not actually be a bad thing from a UX perspective)
I agree with you, I am leaning towards 4, which would allow uswrs to decide if they want to fail the import or just import valid rows... I will probably foreward this ticket to @spica198 to take forward and will walk through it, given I'll be working on other stuff.
option 4 I think is good for the reason you describe @serpens24. There are a couple of things I have thought of though regarding implementation that should be considered:
-
When implementing option 2, it's important to be smart about how you log/highlight invalid rows. If you log/highlight each individual row that failed to import this will be fine with a small file but with a large file (say a million rows), you could potentially crash the application if they all fail to import
-
It's worth noting that options 1 and 2 won't just handle things differently after pressing Import, the UI itself may also need to behave differently per option. Currently the UI is designed for option 1. Namely if you try and apply a datetime formatter onto a column of data that doesn't match that format, the UI doesn't allow you to apply the attribute to that column at all. For option 2 though, this would need to change as you need to be able to apply the attribute regardless of whether errors are detected (since you still want to be able to import valid rows). While the option 2 workflow would still work with option 1, the changes that have been made to current workflow were to improve UX (cause if the import is going to fail anyway, better to not allow the user to apply a mapping guaranteed to fail in the first place rather than telling them after trying to import) so ideally, you'd want to keep the current workflow for option 1, and then change it for option 2. I don't know how easy that would be to do though
Thank you @spica198 for making the improvements here, they are great. I have been trying this out with lots of my files that have date issues and they can be skipped over and imported which is handy as I don't need to correct the original rows in the dataset.
I've raised an issue for the Runtime error that is being displayed when you don't select to Skip the invalid rows as it would be good if this error was captured and not presented to the end users. Not sure how easy that is to fix.
As an enhancement it would be great to adjust the wording on the message pop up just to make it a bit clearer the number refers to the count of the rows and not the row number with the issue.
I have also noticed the Excel date time formatter is not working anymore which I have made a new issue for.