astro-sdk icon indicating copy to clipboard operation
astro-sdk copied to clipboard

Improve load_file so users can customize table column types

Open tatiana opened this issue 3 years ago • 0 comments
trafficstars

Please describe the feature you'd like to see At the moment (Astro 0.7.0), load_file automagically detects the types of data loaded using Pandas and creates a SQL table based on this. It also assumes users want to import all the original file(s) data. While this approach works in many cases, it can bring challenges in other cases, such as #104. The task load_file should also allow users who want to have the control, to define the SQL column types of the target table.

Describe the solution you'd like Have an additional parameter in the Table definition, metadata, to define the destination table columns and their types. The metadata can be declared using SQLAlchemy metadata.

An example of how this could look like:

from sqlalchemy import Column, Integer, String, ForeignKey
from sqlalchemy.orm import declarative_base

   Base = declarative_base()

   class Movie(Base):
        __tablename__ = 'imdb_movies'
        X = Column(Integer, Sequence('id_seq'), primary_key=True)
        Title = Column(String)
        Rating = Column(Float)
        Foreign = Column(Integer)

    imdb_movies = aql.load_file(
        path="https://raw.githubusercontent.com/astro-projects/astro/main/tests/data/imdb.csv",
        task_id="load_csv",
        output_table=Table(
            table_name="imdb_movies", database="sqlite", conn_id="sqlite_default", metadata=Movie
        ),
    )

The expected output of this DAG would be a SQL table named imdb_movies which only contains the columns of interest, with the given column types.

Acceptance Criteria

  • [ ] If the target Table does not contain metadata, we should keep the current behaviour o load_file
  • [ ] If the target table has metadata, we should use that information to create the SQL table
  • [ ] All checks and tests in the CI should pass
  • [ ] Unit tests (90% code coverage or more, once available)
  • [ ] Integration tests (if the feature relates to a new database or external service)
  • [ ] Example DAG
  • [ ] Docstrings in reStructuredText for each of methods, classes, functions and module-level attributes (including Example DAG on how it should be used)
  • [ ] Exception handling in case of errors
  • [ ] Logging (are we exposing useful information to the user? e.g. source and destination)
  • [ ] Improve the documentation (README, Sphinx, and any other relevant)
  • [ ] How to use Guide for the feature (example)
  • [ ] Typehints in functions/methods which were updated or added

tatiana avatar Mar 23 '22 14:03 tatiana