astro-sdk
astro-sdk copied to clipboard
Improve load_file so users can customize table column types
Please describe the feature you'd like to see
At the moment (Astro 0.7.0), load_file automagically detects the types of data loaded using Pandas and creates a SQL table based on this. It also assumes users want to import all the original file(s) data. While this approach works in many cases, it can bring challenges in other cases, such as #104. The task load_file should also allow users who want to have the control, to define the SQL column types of the target table.
Describe the solution you'd like
Have an additional parameter in the Table definition, metadata, to define the destination table columns and their types. The metadata can be declared using SQLAlchemy metadata.
An example of how this could look like:
from sqlalchemy import Column, Integer, String, ForeignKey
from sqlalchemy.orm import declarative_base
Base = declarative_base()
class Movie(Base):
__tablename__ = 'imdb_movies'
X = Column(Integer, Sequence('id_seq'), primary_key=True)
Title = Column(String)
Rating = Column(Float)
Foreign = Column(Integer)
imdb_movies = aql.load_file(
path="https://raw.githubusercontent.com/astro-projects/astro/main/tests/data/imdb.csv",
task_id="load_csv",
output_table=Table(
table_name="imdb_movies", database="sqlite", conn_id="sqlite_default", metadata=Movie
),
)
The expected output of this DAG would be a SQL table named imdb_movies which only contains the columns of interest, with the given column types.
Acceptance Criteria
- [ ] If the target Table does not contain
metadata, we should keep the current behaviour oload_file - [ ] If the target table has
metadata, we should use that information to create the SQL table - [ ] All checks and tests in the CI should pass
- [ ] Unit tests (90% code coverage or more, once available)
- [ ] Integration tests (if the feature relates to a new database or external service)
- [ ] Example DAG
- [ ] Docstrings in reStructuredText for each of methods, classes, functions and module-level attributes (including Example DAG on how it should be used)
- [ ] Exception handling in case of errors
- [ ] Logging (are we exposing useful information to the user? e.g. source and destination)
- [ ] Improve the documentation (README, Sphinx, and any other relevant)
- [ ] How to use Guide for the feature (example)
- [ ] Typehints in functions/methods which were updated or added