mathesar
mathesar copied to clipboard
Reduce time for preview while toggling 'Use first row as header' during import
Problem
Currently while importing any CSV or TSV in mathesar and toggling the 'Use first row as header' the Preview takes a lot of time to load.
Proposed solution
To solve this, we should:
- explore ways to make the type inference and loading pipeline quicker in general.
- Figure out a way to handle the first row of a table separately when loading an imported table from a file, and inferring the type.
- Consider adding some intelligence to our initial import that determines whether the first row looks like headers or data.
Additional context
Here's a video showing the issue:
@Anish9901 Configuration of the first row being a header should be a property of the data_file and not the imported table. I believe the APIs are structured in such a way to be REST compliant. Also, the client cannot handle patching the column names since:
- If the user configures the first row to be the header, that particular row needs to be removed from the table.
- If the user then toggles and the row is not a header, that row needs to be added back to the table.
I'll leave this to @kgodey or @mathemancer to take a look once. I've marked the issue as a draft
.
Overall, @pavish is correct. The biggest problems is that the data in the first row may affect the inferred column data type, which needs to be verified by the user during import. (This inference is what takes so long).
- If we default to having the first row included as data, and it happens to be a header, we'll likely infer every column to be
TEXT
type, regardless of the actual column data. Then changing to using the first row as the header requires re-inferring the data types. - If we default to using the first row as a header, and it happens to be data, we'd need to check that the first row string doesn't require a different data type. In the best case, we can do this as a one-off, i.e., we check the type and make sure it complies with the type inferred for the rest of the column. In non-best cases, however, we'd have to re-run the entire inference algorithm for the whole column, since failure of the first row to comply with the type inferred for the rest of the column doesn't give enough information to choose the correct type.
- It is technically possible to search back up the inference DAG and try to find a node with which the whole column including the first row complies, but that would add a lot of complexity for probably not much benefit.
The best way to reduce time when toggling that would be to default to using the first row as a header, and then hope for the best case when toggling. Toggling in the reverse direction would require re-running the inference algo in any case, though, so I'm not sure this should be worked on at this point.
@pavish @mathemancer Can we have the preview just as the frontend, so the frontend has 2 column names for each column
- Default column name
column_n
with typeTEXT
. - Using the first row as headers, with the inferred types.
This is a problem in the case of a headless CSV, since then we would never do the type inference step on the full column, no?
This is a problem in the case of a headless CSV, since then we would never do the type inference step on the full column, no?
No, the inference step is executed the first time the user imports the data.
Maybe we can just add a check for the type of the first row if it matches the types we inferred by default when we first inferred types then we save the data with inferred types else we use TEXT
.
@mathemancer I assigned this to you to follow up on and get this out of draft stage (or close if unnecessary). Please unassign yourself once complete.
@kgodey This needs some research, as the proposed solution didn't really make sense. I've added some steps that need to be taken (at least one needs to be workable) in order to get this out of draft status. I don't think it's high enough priority to do the research myself at the moment, but I'm open to seeing other proposed solutions from contributors.
A few thoughts after working on issue #1014
Description:
- a user has a file
headless_data.csv
- the user knows that the file is HEADLESS
Problem: the user is forced to wait twice for a data type guess with current implementation.
Possible solution: move (or add) check box Use first row as header
to the 1st step of the Import Data
process.
@A1O We can explore that option when we redesign the import preview page, see https://github.com/centerofci/mathesar/issues/934.
Users may not always know whether their CSV files have headers until they look at their data, though. Non-technical users may not know enough about CSV files to even know what that question means. So we'd still need a way to toggle it at the preview stage.
This issue has not been updated in 90 days and is being marked as stale.
Closing this, we can reassess performance improvements after upcoming architecture work has been completed.