qlib icon indicating copy to clipboard operation
qlib copied to clipboard

feat: Add TuShare data collector (incremental update & resume support)

Open JakobWong opened this issue 2 weeks ago • 1 comments

Description

This PR introduces a new data collector for TuShare (daily frequency) under qlib/scripts/data_collector/tushare/collector.py. It provides a robust ETL pipeline similar to the Yahoo collector but tailored for TuShare's API and A-share market features.

Key features:

  1. Incremental Update & Resume:
    • Supports "resume from breakpoint" by checking existing CSVs and only downloading data newer than the local max date.
    • update_data_to_bin dumps only newly added dates (using a temporary directory) to improve performance, instead of full redump.
  2. Data Consistency:
    • Includes listed, delisted, and paused stocks (status L,D,P) to avoid survivorship bias.
    • Normalizes output to Qlib standard: date, open, high, low, close, volume, [amount], factor, symbol. amount is optional.
    • Explicitly handles duplicates and ensures monotonic dates.
  3. Robustness:
    • Prefers TuShare's trade_cal for calendar acquisition, with a fallback to Qlib's default.
    • Enforces a baseline requirement for incremental updates (raises error instead of auto-downloading incompatible Yahoo sample data).

Motivation and Context

The existing collectors (Yahoo) are less stable for CN market data. Users often need a production-ready TuShare collector that supports large-scale historical fetch (with rate limits) and daily incremental updates without redownloading entire history. This implementation fills that gap with a structure consistent with Qlib's existing collectors.

How Has This Been Tested?

  • [ ] Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
  • [x] If you are adding a new feature, test on your own test scripts.

Test Details: Verified with local unit/integration tests (pytest tests/test_tushare_collector.py - Note: test file not included in this PR to keep it minimal, but logic verified):

  1. Normalization: Validated against fixed CSV fixtures (ensuring correct column mapping, date parsing).
  2. Incremental Logic: Verified update_data_to_bin correctly identifies the incremental window and creates temp storage.
  3. Baseline Check: confirmed it raises RuntimeError if qlib_data_1d_dir is missing/invalid during update.

Screenshots of Test Results (if appropriate):

  1. Pipeline test: (Skipped as strict environment required)
  2. Your own tests: (All passed locally)
    tests/test_tushare_collector.py ..... [100%]
    5 passed in 2.43s
    

Types of changes

  • [ ] Fix bugs
  • [x] Add new feature
  • [ ] Update documentation

JakobWong avatar Dec 09 '25 04:12 JakobWong

Hi, @JakobWong , First of all thank you for contributing the code, I see that the description of this pull request implements a lot of functionality. However, it doesn't seem to be clear how to use these features, and I was wondering if it would be possible to add documentation or a docstring to help people understand how to use them.

SunsetWolf avatar Dec 10 '25 07:12 SunsetWolf

Hi, @JakobWong , First of all thank you for contributing the code, I see that the description of this pull request implements a lot of functionality. However, it doesn't seem to be clear how to use these features, and I was wondering if it would be possible to add documentation or a docstring to help people understand how to use them.

Hi @SunsetWolf, Thanks for reviewing! The data collector I added is so that users can collect data via the tushare api easily (currently supporting only CN data, daily).

I added documentation for the TuShare daily collector in qlib/scripts/data_collector/tushare/README.md, covering prerequisites (TUSHARE_TOKEN), a one-shot pipeline command, step-by-step download/normalize/dump, incremental updates, and validation. I also listed the TuShare collector in the data_collector overview.

Please let me know if you’d like further details or more examples.

JakobWong avatar Dec 10 '25 15:12 JakobWong