Add bulk import mode
The bulk import mode allows administrators and developers to setup a new cluster and import existing hashes without getting bottlenecks due to rate limiting or Solr commits. It increases the import speed sigificantly. This feature is spread accross multiple PRs.
Changes:
Documentation:
- Added environment flag and usage documentation.
API:
- Disable rate limiting when import mode is enabled.
- Disable search with a error message to prevent accidental production deploys.
Loader:
- Disable commiting when import mode is enabled.
The import function is useful but I want a more complete implementation of this feature. Like how it scan for hashes in file system? import from another server directly? how about import part of the database? and how a server can export its database in a standard format? Before settling down the details, I'd prefer to avoid a half-made import function. But exposing the rate limit option to env var makes sense, I think server owners want that configurable depending on their needs.
Adding a dedicated import tool would allow administrators to select which series and episodes the would like to import into their database without providing the media files. It would be a seperate web application that has it's own API and web pages exposed as long as the import tool instance is running.
The import tool can have two import modes:
- Importing the provided database dumps without media
- Importing database dumps from their own aquired media.
Importing provided database dumps without media
The administrators should download one of the latest database dumps and copy them to the import folder. Then the import tool will query the contents and display the available data on a web page where administrators can decide which series and or episodes they want to import.
The tool is allowed to access the API without rate limit. It generates the required database entries to trigger the loader and tell it to import the data without commiting it. After starting an import the tool keeps track of the progress and once the import process is complete, it notifies all the Solr instances to commit the imported data.
Importing database dumps with media
I would implement something like RIOT Games has done with their patching system to quickly transform an existing library to the desiderd state. Supercharging Data Delivery: The New League Patcher
It indexes the current state of the media and then computes hashes for the files based on content-defined chunking (CDF). This allows them to deduplicate similar data and reduce the required data to be downloaded.
The import tool would do something similar and could expose a temporary API to import the data from. Similar to the old approach, administrators can pick the series and episodes to transform the data to the desiderd state.
Importing provided database dumps without media
I would include a sql file in next database dump. So admin doesn't need to run another script to generate it.
As you know, trace.moe doesn't have admin UI to manage files and DB entries. So admin has to figure out their ways to include/exclude particular series. I still hesitate if this should be part of trace.moe-www or a new thing.
Importing database dumps with media
You mean the system can somehow detect if there are some media file that's the same as those hashed by trace.moe?
Regarding the rate limit issue, I think worker instances should not be rate limited at all, regardless of import mode or not. Occasionally I see hashers (not just loaders) encounted rate limit when a lot of short videos are uploaded at once. But then I'll have to figure out a way to authenticate whether each request is from an authenticated worker or not, while making sure the worker instances can be run on a separate computer.
since revision https://github.com/soruly/trace.moe-api/commit/16b5847669a7d4a099b92618636e2d68634820cc , the worker process is now spawn directly in trace.moe-api. The rate limit would no longer affect hashing/loading, so bulk import mode is no longer needed. However, some features you suggested like maintenance mode and skipping solr commit would be implemented separately, which can be tracked in the link above. I'll revise this docker-compose later, thank you.