serving
serving copied to clipboard
ShardedReadPtrs can overflow shards_ array in unusual situations
Bug Report
If this is a bug report, please fill out the following form in full:
System information
- **OS Platform and Distribution: Linux Ubuntu 16.04
- **TensorFlow Serving installed from source
- TensorFlow Serving version: 2.2.0, git commit d22fc192c7ad7b48d9a81346224aff637b8988f1
Describe the problem
In the initialization of num_shards_ in class ShardedReadPtrs, num_shards_ is initialized to port::NumTotalCPUs() if that function succeeds, otherwise it is initialized to kRandomShards (which is 16). GetShards() then uses port::GetCurrentCPU() if that function works, or a random shard from 0-15 if GetCurrentCPU() fails.
This can result in an array overflow in the shards_ array in two cases.
One, where the server supports physical CPU hotplug, I don't think NumTotalCPUs() is guaranteed to return the total possible number of CPUs in the system. It might return the number of CPUs currently present. If CPUs are added later, GetCurrentCPU() could return a number higher than reported by NumTotalCPUs() when initialization was done, and the array index of shards_ in ShardedReadPtrs::get() will overflow.
The second case is when NumTotalCPUs() succeeds, and returns a value less than 16, but GetCurrentCPU() fails. In this case, a random shard between 0-15 will be returned by the code in GetShard(), but the value returned may be larger than the length of shards_, leading to an array index out of bounds. This probably isn't a common problem, but I work on an odd environment where NumTotalCPUs() was able to get answer, but GetCurrentCPU() failed, which exposed this problem.
A simple fix for both problems might be to have GetShards() enforce that whatever value it returns should be <= num_shards_.
Exact Steps to Reproduce
Unless you have access to a system with physical CPU hotplug, you would probably have to modify tensorflow's port::NumTotalCPUs() or port::GetCurrentCPU() so that NumTotalCPUs() returns a value less than 16, and GetCurrentCPU() fails, or GetCurrentCPU() returns a value less than NumTotalCPUs(). Then run a request against the server.
Source code / logs
All of the relevant code is in class ShardedReadPtrs in fast_read_dynamic_ptr.h.
Thanks for the bug report. Happy to review a PR to fix this problem.
@arai-fortanix,
Did you got a chance to submit a PR for fixing the issue. Kindly let us know if the issue persists. Thank you!
I believe this bug still exists, but I do not have time to work on a fix.