mathesar icon indicating copy to clipboard operation
mathesar copied to clipboard

Reduce time for preview while toggling 'Use first row as header' during import

Open Anish9901 opened this issue 2 years ago • 10 comments

Problem

Currently while importing any CSV or TSV in mathesar and toggling the 'Use first row as header' the Preview takes a lot of time to load.

Proposed solution

To solve this, we should:

  • explore ways to make the type inference and loading pipeline quicker in general.
  • Figure out a way to handle the first row of a table separately when loading an imported table from a file, and inferring the type.
  • Consider adding some intelligence to our initial import that determines whether the first row looks like headers or data.

Additional context

Here's a video showing the issue:

Untitled design

Anish9901 avatar Apr 01 '22 15:04 Anish9901

@Anish9901 Configuration of the first row being a header should be a property of the data_file and not the imported table. I believe the APIs are structured in such a way to be REST compliant. Also, the client cannot handle patching the column names since:

  • If the user configures the first row to be the header, that particular row needs to be removed from the table.
  • If the user then toggles and the row is not a header, that row needs to be added back to the table.

I'll leave this to @kgodey or @mathemancer to take a look once. I've marked the issue as a draft.

pavish avatar Apr 01 '22 17:04 pavish

Overall, @pavish is correct. The biggest problems is that the data in the first row may affect the inferred column data type, which needs to be verified by the user during import. (This inference is what takes so long).

  • If we default to having the first row included as data, and it happens to be a header, we'll likely infer every column to be TEXT type, regardless of the actual column data. Then changing to using the first row as the header requires re-inferring the data types.
  • If we default to using the first row as a header, and it happens to be data, we'd need to check that the first row string doesn't require a different data type. In the best case, we can do this as a one-off, i.e., we check the type and make sure it complies with the type inferred for the rest of the column. In non-best cases, however, we'd have to re-run the entire inference algorithm for the whole column, since failure of the first row to comply with the type inferred for the rest of the column doesn't give enough information to choose the correct type.
    • It is technically possible to search back up the inference DAG and try to find a node with which the whole column including the first row complies, but that would add a lot of complexity for probably not much benefit.

mathemancer avatar Apr 04 '22 03:04 mathemancer

The best way to reduce time when toggling that would be to default to using the first row as a header, and then hope for the best case when toggling. Toggling in the reverse direction would require re-running the inference algo in any case, though, so I'm not sure this should be worked on at this point.

mathemancer avatar Apr 04 '22 03:04 mathemancer

@pavish @mathemancer Can we have the preview just as the frontend, so the frontend has 2 column names for each column

  • Default column name column_n with type TEXT.
  • Using the first row as headers, with the inferred types.

Anish9901 avatar Apr 04 '22 06:04 Anish9901

This is a problem in the case of a headless CSV, since then we would never do the type inference step on the full column, no?

mathemancer avatar Apr 04 '22 06:04 mathemancer

This is a problem in the case of a headless CSV, since then we would never do the type inference step on the full column, no?

No, the inference step is executed the first time the user imports the data. Maybe we can just add a check for the type of the first row if it matches the types we inferred by default when we first inferred types then we save the data with inferred types else we use TEXT.

Anish9901 avatar Apr 04 '22 07:04 Anish9901

@mathemancer I assigned this to you to follow up on and get this out of draft stage (or close if unnecessary). Please unassign yourself once complete.

kgodey avatar Apr 04 '22 20:04 kgodey

@kgodey This needs some research, as the proposed solution didn't really make sense. I've added some steps that need to be taken (at least one needs to be workable) in order to get this out of draft status. I don't think it's high enough priority to do the research myself at the moment, but I'm open to seeing other proposed solutions from contributors.

mathemancer avatar Apr 07 '22 08:04 mathemancer

A few thoughts after working on issue #1014

Description:

  • a user has a file headless_data.csv
  • the user knows that the file is HEADLESS

Problem: the user is forced to wait twice for a data type guess with current implementation. Untitled-2

Possible solution: move (or add) check box Use first row as header to the 1st step of the Import Data process. Untitled-1

A1O avatar Apr 07 '22 12:04 A1O

@A1O We can explore that option when we redesign the import preview page, see https://github.com/centerofci/mathesar/issues/934.

Users may not always know whether their CSV files have headers until they look at their data, though. Non-technical users may not know enough about CSV files to even know what that question means. So we'd still need a way to toggle it at the preview stage.

kgodey avatar Apr 07 '22 21:04 kgodey

This issue has not been updated in 90 days and is being marked as stale.

github-actions[bot] avatar Apr 24 '23 21:04 github-actions[bot]

Closing this, we can reassess performance improvements after upcoming architecture work has been completed.

kgodey avatar Feb 02 '24 17:02 kgodey