ckanext-xloader icon indicating copy to clipboard operation
ckanext-xloader copied to clipboard

Table remains in Datastore after failed load due to column mismatch

Open dpuertolas opened this issue 9 months ago • 2 comments

Table remains in Datastore after failed load due to column mismatch

Description

Hi,

I'm experiencing an issue when uploading an XLS file with ckanext-xloader. The file is processed correctly, and a table is created in the Datastore. However, if the file has structural issues (e.g., more data columns than headers), an error occurs during loading:

Error: ('Found data in column %s but resource only has %s header(s)', 20, 19)

This behavior is expected, as the file format (XLS) is supported, but its internal structure makes it incompatible with Datastore.

The problem is that, even after the error, the table remains in the Datastore, and CKAN continues to offer the available formats (CSV, TSV, JSON, etc.). The only way to fix this is to manually delete the table. However, when xloader retries, it creates the table again and fails again, keeping the incorrect state.

Suggested Fix

Would it make sense to modify xloader so that it automatically deletes the table when such an error occurs? Something like:

 delete_datastore_resource(resource_id)  # Remove table if an error occurs
 raise LoaderError("Found data in column %s but resource only has %s header(s)",
                                    index + 1, header_count)

This would prevent CKAN from exposing an incomplete or invalid Datastore resource.

Let me know if this is the expected behavior or if there's a better approach to handle this case.

Thanks!

dpuertolas avatar Mar 07 '25 09:03 dpuertolas

When the datastore is used in a highly visible app, its better to keep the old datastore table then to have a broken table/datastore breaking downstream usage.

The second reason is that it does not delete is that the current 'column' types is stored as comments against said columns. If the datastore is table is 'purged' then that data is lost when you do load a valid file and the next upload may not set timestamps or number fields correcrtly (leaving them as strings breaking sql queries)

You can plug in the ckanext-validation plugin which allows you to set the schema and provide validation error reports at upload time or after the fact (with blocking xloader till its valid).

Usually if you wish to delete a datastore i.e. blanking, then you can upload a header only csv. A datastore action to 'purge' datastore on the datastore page may be more useful.

duttonw avatar Mar 07 '25 11:03 duttonw

However, when xloader retries, it creates the table again and fails again, keeping the incorrect state.

Well, if something has triggered XLoader to retry, then that probably means that the resource has changed in such a way that it might theoretically work the second time. So, disabling that behaviour doesn't seem ideal.

ThrawnCA avatar Jul 17 '25 03:07 ThrawnCA