qlib feat: Add TuShare data collector (incremental update & resume support)

Description

This PR introduces a new data collector for TuShare (daily frequency) under qlib/scripts/data_collector/tushare/collector.py. It provides a robust ETL pipeline similar to the Yahoo collector but tailored for TuShare's API and A-share market features.

Key features:

Incremental Update & Resume:
- Supports "resume from breakpoint" by checking existing CSVs and only downloading data newer than the local max date.
- update_data_to_bin dumps only newly added dates (using a temporary directory) to improve performance, instead of full redump.
Data Consistency:
- Includes listed, delisted, and paused stocks (status L,D,P) to avoid survivorship bias.
- Normalizes output to Qlib standard: date, open, high, low, close, volume, [amount], factor, symbol. amount is optional.
- Explicitly handles duplicates and ensures monotonic dates.
Robustness:
- Prefers TuShare's trade_cal for calendar acquisition, with a fallback to Qlib's default.
- Enforces a baseline requirement for incremental updates (raises error instead of auto-downloading incompatible Yahoo sample data).

Motivation and Context

The existing collectors (Yahoo) are less stable for CN market data. Users often need a production-ready TuShare collector that supports large-scale historical fetch (with rate limits) and daily incremental updates without redownloading entire history. This implementation fills that gap with a structure consistent with Qlib's existing collectors.

How Has This Been Tested?

[ ] Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
[x] If you are adding a new feature, test on your own test scripts.

Test Details: Verified with local unit/integration tests (pytest tests/test_tushare_collector.py - Note: test file not included in this PR to keep it minimal, but logic verified):

Normalization: Validated against fixed CSV fixtures (ensuring correct column mapping, date parsing).
Incremental Logic: Verified update_data_to_bin correctly identifies the incremental window and creates temp storage.
Baseline Check: confirmed it raises RuntimeError if qlib_data_1d_dir is missing/invalid during update.

Screenshots of Test Results (if appropriate):

Pipeline test: (Skipped as strict environment required)

Your own tests: (All passed locally)

tests/test_tushare_collector.py ..... [100%]
5 passed in 2.43s

Types of changes

[ ] Fix bugs
[x] Add new feature
[ ] Update documentation

Dec 09 '25 04:12 JakobWong

Hi, @JakobWong , First of all thank you for contributing the code, I see that the description of this pull request implements a lot of functionality. However, it doesn't seem to be clear how to use these features, and I was wondering if it would be possible to add documentation or a docstring to help people understand how to use them.

Dec 10 '25 07:12 SunsetWolf

Hi, @JakobWong , First of all thank you for contributing the code, I see that the description of this pull request implements a lot of functionality. However, it doesn't seem to be clear how to use these features, and I was wondering if it would be possible to add documentation or a docstring to help people understand how to use them.

Hi @SunsetWolf, Thanks for reviewing! The data collector I added is so that users can collect data via the tushare api easily (currently supporting only CN data, daily).

I added documentation for the TuShare daily collector in qlib/scripts/data_collector/tushare/README.md, covering prerequisites (TUSHARE_TOKEN), a one-shot pipeline command, step-by-step download/normalize/dump, incremental updates, and validation. I also listed the TuShare collector in the data_collector overview.

Please let me know if you’d like further details or more examples.

Dec 10 '25 15:12 JakobWong