feat: Add TuShare data collector (incremental update & resume support)
Description
This PR introduces a new data collector for TuShare (daily frequency) under qlib/scripts/data_collector/tushare/collector.py.
It provides a robust ETL pipeline similar to the Yahoo collector but tailored for TuShare's API and A-share market features.
Key features:
- Incremental Update & Resume:
- Supports "resume from breakpoint" by checking existing CSVs and only downloading data newer than the local max date.
update_data_to_bindumps only newly added dates (using a temporary directory) to improve performance, instead of full redump.
- Data Consistency:
- Includes listed, delisted, and paused stocks (status
L,D,P) to avoid survivorship bias. - Normalizes output to Qlib standard:
date, open, high, low, close, volume, [amount], factor, symbol.amountis optional. - Explicitly handles duplicates and ensures monotonic dates.
- Includes listed, delisted, and paused stocks (status
- Robustness:
- Prefers TuShare's
trade_calfor calendar acquisition, with a fallback to Qlib's default. - Enforces a baseline requirement for incremental updates (raises error instead of auto-downloading incompatible Yahoo sample data).
- Prefers TuShare's
Motivation and Context
The existing collectors (Yahoo) are less stable for CN market data. Users often need a production-ready TuShare collector that supports large-scale historical fetch (with rate limits) and daily incremental updates without redownloading entire history. This implementation fills that gap with a structure consistent with Qlib's existing collectors.
How Has This Been Tested?
- [ ] Pass the test by running:
pytest qlib/tests/test_all_pipeline.pyunder upper directory ofqlib. - [x] If you are adding a new feature, test on your own test scripts.
Test Details:
Verified with local unit/integration tests (pytest tests/test_tushare_collector.py - Note: test file not included in this PR to keep it minimal, but logic verified):
- Normalization: Validated against fixed CSV fixtures (ensuring correct column mapping, date parsing).
- Incremental Logic: Verified
update_data_to_bincorrectly identifies the incremental window and creates temp storage. - Baseline Check: confirmed it raises
RuntimeErrorifqlib_data_1d_diris missing/invalid during update.
Screenshots of Test Results (if appropriate):
- Pipeline test: (Skipped as strict environment required)
- Your own tests:
(All passed locally)
tests/test_tushare_collector.py ..... [100%] 5 passed in 2.43s
Types of changes
- [ ] Fix bugs
- [x] Add new feature
- [ ] Update documentation
Hi, @JakobWong , First of all thank you for contributing the code, I see that the description of this pull request implements a lot of functionality. However, it doesn't seem to be clear how to use these features, and I was wondering if it would be possible to add documentation or a docstring to help people understand how to use them.
Hi, @JakobWong , First of all thank you for contributing the code, I see that the description of this pull request implements a lot of functionality. However, it doesn't seem to be clear how to use these features, and I was wondering if it would be possible to add documentation or a docstring to help people understand how to use them.
Hi @SunsetWolf, Thanks for reviewing! The data collector I added is so that users can collect data via the tushare api easily (currently supporting only CN data, daily).
I added documentation for the TuShare daily collector in qlib/scripts/data_collector/tushare/README.md, covering prerequisites (TUSHARE_TOKEN), a one-shot pipeline command, step-by-step download/normalize/dump, incremental updates, and validation. I also listed the TuShare collector in the data_collector overview.
Please let me know if you’d like further details or more examples.