dlt icon indicating copy to clipboard operation
dlt copied to clipboard

improve `progress` in normalize and load steps

Open rudolfix opened this issue 7 months ago • 0 comments

Background Progress reporting in normalize and load steps are far from perfect.

  1. in normalize we report progress on file level but that only is updated when a worker process is finished
  2. in load the reported metrics do not survive restarts (see #853 )

Tasks Step1. fix normalize:

  • use metrics collected in extract (per job and resource) to correctly report processed row per resource (where we have total number of records as well)
  • right now there's no communication between worker and main process. but we need to start reporting metrics back. so we need to update

Step 2. Fix load:

  • see #853 use package state to track the elapsed times (task created, start, stop of job)
  • we are interested in following metrics to be displayed: jobs processed, average elapsed time, average lag (from job created to job started)

Implementation

  1. you'll need to use package state to store extract metrics (ExtractInfo) and normalize metrics
  2. if those elements are not present in the state you must fallback gracefully ie. reporting only the progress of the files. the job processing must be plain: if there are files they will be processed even if state is not present

ADDITIONAL THOUGHTS (@IlyaFaer ): There are two different cases:

  • We extract and then normalize data - in this case we can take rows count from ExtractInfo
  • We normalize the data, extracted earlier

rudolfix avatar Jul 09 '24 09:07 rudolfix