incubator-devlake [Refactor][helper] Declarative ApiCollector

What and why to refactor

Currently, we have 2 types of ApiCollectors

ApiCollector - Stateless API collection helper
ApiCollectorStateManager - Stateful API collection helper on top of ApiCollector which tracks the LastSuceededTime

work in 4 modes

FullSync from ApiCollector: Note that it means ApiCollector doesn't offer any help regarding Incremental collection, and the developer must implement the feature on his/her own.
IncrementalSync by updatedAt from ApiCollectorStateManager: offers aids when API supports filtering records by the updatedAt field.
IncrementalSync by createdAt from ApiCollectorStateManager: offers aids when API supports filtering records by the createdAt or returned records are sorted by createdAt. Useful for collecting Pipelines/Jobs that can be Finalized(Can not be re-opened) and Automated Short-lived entities (no human operation involved and it will be closed within days)
IncrementalSync by createdAt plus refresh unfinished records from ApiCollectorStateManager: offers refreshing unfished records based on the previous mode. Useful for collecting PR while the API doesn't support filtering by updatedAt

The problem: Developers must figure out how they work, the details/differences of each mode, and which one to use with what parameters.

Take the jenkins builds collector as an example:

func CollectApiBuilds(taskCtx plugin.SubTaskContext) errors.Error {
	data := taskCtx.GetData().(*JenkinsTaskData)
	db := taskCtx.GetDal()
	collector, err := helper.NewStatefulApiCollectorForFinalizableEntity(helper.FinalizableApiCollectorArgs{
		RawDataSubTaskArgs: helper.RawDataSubTaskArgs{
			...
		},
		ApiClient: data.ApiClient,
		CollectNewRecordsByList: helper.FinalizableApiCollectorListArgs{
			PageSize:    100,
			Concurrency: 10,
			FinalizableApiCollectorCommonArgs: helper.FinalizableApiCollectorCommonArgs{
				UrlTemplate: fmt.Sprintf("%sjob/%s/api/json", data.Options.JobPath, data.Options.JobName),
				Query: func(reqData *helper.RequestData, createdAfter *time.Time) (url.Values, errors.Error) {
					...
				},
				ResponseParser: func(res *http.Response) ([]json.RawMessage, errors.Error) {
					...
				},
			},
			GetCreated: func(item json.RawMessage) (time.Time, errors.Error) {
				...
			},
		},
		CollectUnfinishedDetails: &helper.FinalizableApiCollectorDetailArgs{
			BuildInputIterator: func() (helper.Iterator, errors.Error) {
				...
			},
			FinalizableApiCollectorCommonArgs: helper.FinalizableApiCollectorCommonArgs{
				UrlTemplate: fmt.Sprintf("%sjob/%s/{{ .Input.Number }}/api/json?tree=number,url,result,timestamp,id,duration,estimatedDuration,building",
					data.Options.JobPath, data.Options.JobName),
				ResponseParser: func(res *http.Response) ([]json.RawMessage, errors.Error) {
					...
				},
			},
		},
	})

	if err != nil {
		return err
	}

	return collector.Execute()
}

It is hard to just copy the code and make a new collector correctly.

One would need to understand all collectors and all modes
One would need to understand what can API endpoint can offer

Describe the solution you'd like

The problem can be solved by offering a document with detailed descriptions/tutorials of how to use them against different APIs, but it is a huge effort, for both author and readers.

I believe a better solution is to refactor the ApiCollector and make it Declarative:

collector := &DeclartiveApiCollector{
  RawDataSubTaskArgs: ...,
  ApiClient: ...,
  TimeAfterFiltering: {
    ByUpdateAt: {
      Supported: true/false,
      Via: QUERY_STRING,
      KeyName: "updated_at_after"
    },
    ByCreatedAt: {
      RecordFinalizable: true,  // panic if false was given
      Strategy: COLLECT_FINALIZED_RECORDS_ONLY | REFRESH_UNFINALIZED, 
      Via: QUERYSTRING | SORTED_RECORDS,  // returned records are sorted by `createdAt`
      KeyName: "",
      GetCreated: func(record json.RawMessage) {...}  // extract createdAt from the json
    },
  },
}

return collector.Execute()

Oct 11 '23 07:10 klesh

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

Dec 11 '23 00:12 github-actions[bot]

This issue has been closed because it has been inactive for a long time. You can reopen it if you encounter the similar problem in the future.

Dec 18 '23 00:12 github-actions[bot]

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

Mar 27 '24 00:03 github-actions[bot]

This issue has been closed because it has been inactive for a long time. You can reopen it if you encounter the similar problem in the future.

Apr 03 '24 00:04 github-actions[bot]

This issue has been closed because it has been inactive for a long time. You can reopen it if you encounter the similar problem in the future.

Apr 11 '24 00:04 github-actions[bot]

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

Jun 11 '24 00:06 github-actions[bot]

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

Aug 12 '24 00:08 github-actions[bot]