[Misc] Use persistent thread pool
Previously, each function call of MultiThreadCompileGrammar created its own thread pool. This could be misleading, as the total number of active worker threads might significantly exceed the configured max_threads — potentially reaching up to $n \times \text{max-threads}$ for $n$ concurrent compilation tasks.
This PR changes the implementation to use a shared global thread pool across all compilation tasks in one compiler, ensuring that the number of worker threads stays within the specified limit. For different grammar compilers, they still have their own thread pool.
Note: This change may introduce performance regressions in scenarios where the old behavior implicitly allowed over-subscription of threads, as thread usage is now strictly bounded.
Updated cc @Ubospica @Seven-Streams . The rate limit policy should be refined later to achieve a balance between fairness(FIFO) and shortest-first (greedy-execution). FIFO may cause head-of-line blocking, while shortest-first may lead to starvation (worse average latency), deteriorating grammars that needs longer compilation (worse tail latency).
The old implementation creates a new thread pool per-compilation, which is closer to the latter I guess.