D.C. Hess
D.C. Hess
@zkagin Can we prioritize this piece of logic before moving forward more on the syncing logic?
@zkagin Yeah. I think we'd want to maintain a list of data that has previously come through so the truncation of the main table doesn't result in data loss. @denglender...
@zkagin Thinking about this more, I'm wondering if the deleted column is necessary? Shouldn't IDs be unique? Couldn't we append then remove duplicate IDs preserving the matching ID with the...
@zkagin That makes sense to me (minimizing db writes) but I'm also concerned about memory limitations by keeping the diffing in local memory. I've been hitting some memory thresholds on...
@zkagin I'm not sure its a delete in all cases. In some it's more about preserving previously pulled data (ie when the start date changes). Memory has been an issue...
@zkagin check the docs for sqlsorcery I have some examples of how to do updates/deletes by dipping into sqlalchemy functions.
@zkagin Updates: https://sqlsorcery.readthedocs.io/en/latest/cookbook/etl.html#update-table-values Deletes: https://sqlsorcery.readthedocs.io/en/latest/cookbook/etl.html#delete-specific-records
@zkagin One approach would be to: - append new records to the table - query all records with with duplicate IDs and their updateTime - delete records with the MIN...
@zkagin I think we can accomplish this simply by creating a new method for Courses: ```python def get_recent_course_ids(self): try: coursework = pd.read_sql_table("GoogleClassroom_CourseWork", con=self.sql.engine, schema=self.sql.schema) courses = coursework.loc[coursework.creationTime >= self.config.SCHOOL_YEAR_START] return...
Actually, we may want to use a different env var for this date.