BentoML
                                
                                 BentoML copied to clipboard
                                
                                    BentoML copied to clipboard
                            
                            
                            
                        feat: implement batching strategies
This adds a new configuration value, runner.batching.target_latency_ms, which controls how long the dispatcher will wait before beginning to execute requests.
Could probably do with a little bit of testing to see how setting it to 0 performs vs leaving as ~, but for now adding more knobs users can tweak is probably a good thing; I suspect there will be at least a few people who want the behavior of infinite max latency but not long wait times for requests after a burst.
EDIT: This PR has now been updated to provide a strategy option in the configuration, which allows a user to define which strategy they would like to use.
/cc @timliubentoml
Codecov Report
Merging #3630 (9db629e) into main (33c8440) will increase coverage by
31.85%. Report is 112 commits behind head on main. The diff coverage is9.09%.
:exclamation: Current head 9db629e differs from pull request most recent head 56088fe. Consider uploading reports for the commit 56088fe to get more accurate results
@@            Coverage Diff             @@
##            main    #3630       +/-   ##
==========================================
+ Coverage   0.00%   31.85%   +31.85%     
==========================================
  Files        166      146       -20     
  Lines      15286    12038     -3248     
  Branches       0     1989     +1989     
==========================================
+ Hits           0     3835     +3835     
+ Misses     15286     7928     -7358     
- Partials       0      275      +275     
| Files Changed | Coverage Δ | |
|---|---|---|
| src/bentoml/_internal/configuration/v1/__init__.py | 48.83% <ø> (+48.83%) | :arrow_up: | 
| src/bentoml/_internal/marshal/dispatcher.py | 0.00% <0.00%> (ø) | |
| src/bentoml/_internal/models/model.py | 77.59% <ø> (+77.59%) | :arrow_up: | 
| src/bentoml/_internal/server/runner_app.py | 0.00% <ø> (ø) | |
| src/bentoml/triton.py | 0.00% <ø> (ø) | |
| src/bentoml/_internal/runner/runner.py | 56.61% <66.66%> (+56.61%) | :arrow_up: | 
Oh, I'd forgotten about ruff. Man, it checks fast :sweat_smile:
This one is waiting on me to change some naming around, need to get to that.
Had some discussion about this PR with Sauyon. These are decisions:
- adding back pressure handling logic to the new strategy
- adjust the refactoring, move statistical regression into Intelligent Wait strategy.
- move max_batch_sizeandmax_latencyinto strategy_options
@bojiang this should be ok to look at for now, broad strokes.
I think this should be ready for review now if anybody wants to take a look (@bojiang I implemented wait time).
Once I add some tests I'll probably factor this into separate commits.
status: We probably want a load test before merging this one in.
Is this likely to be reviewed and merged?
Hello @sauyon! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
- In the file src/bentoml/_internal/marshal/dispatcher.py:
Line 72:80: E501 line too long (83 > 79 characters) Line 103:80: E501 line too long (101 > 79 characters) Line 131:80: E501 line too long (85 > 79 characters) Line 162:80: E501 line too long (84 > 79 characters) Line 213:80: E501 line too long (92 > 79 characters) Line 319:80: E501 line too long (88 > 79 characters) Line 476:80: E501 line too long (87 > 79 characters) Line 541:80: E501 line too long (82 > 79 characters) Line 558:80: E501 line too long (81 > 79 characters)
- In the file src/bentoml/_internal/runner/runner.py:
Line 202:80: E501 line too long (107 > 79 characters) Line 203:80: E501 line too long (111 > 79 characters) Line 205:80: E501 line too long (110 > 79 characters) Line 267:80: E501 line too long (125 > 79 characters) Line 271:80: E501 line too long (86 > 79 characters) Line 274:80: E501 line too long (122 > 79 characters) Line 284:80: E501 line too long (80 > 79 characters) Line 286:80: E501 line too long (158 > 79 characters) Line 300:80: E501 line too long (83 > 79 characters)