serving icon indicating copy to clipboard operation
serving copied to clipboard

ShardedReadPtrs can overflow shards_ array in unusual situations

Open arai-fortanix opened this issue 5 years ago • 3 comments

Bug Report

If this is a bug report, please fill out the following form in full:

System information

  • **OS Platform and Distribution: Linux Ubuntu 16.04
  • **TensorFlow Serving installed from source
  • TensorFlow Serving version: 2.2.0, git commit d22fc192c7ad7b48d9a81346224aff637b8988f1

Describe the problem

In the initialization of num_shards_ in class ShardedReadPtrs, num_shards_ is initialized to port::NumTotalCPUs() if that function succeeds, otherwise it is initialized to kRandomShards (which is 16). GetShards() then uses port::GetCurrentCPU() if that function works, or a random shard from 0-15 if GetCurrentCPU() fails.

This can result in an array overflow in the shards_ array in two cases.

One, where the server supports physical CPU hotplug, I don't think NumTotalCPUs() is guaranteed to return the total possible number of CPUs in the system. It might return the number of CPUs currently present. If CPUs are added later, GetCurrentCPU() could return a number higher than reported by NumTotalCPUs() when initialization was done, and the array index of shards_ in ShardedReadPtrs::get() will overflow.

The second case is when NumTotalCPUs() succeeds, and returns a value less than 16, but GetCurrentCPU() fails. In this case, a random shard between 0-15 will be returned by the code in GetShard(), but the value returned may be larger than the length of shards_, leading to an array index out of bounds. This probably isn't a common problem, but I work on an odd environment where NumTotalCPUs() was able to get answer, but GetCurrentCPU() failed, which exposed this problem.

A simple fix for both problems might be to have GetShards() enforce that whatever value it returns should be <= num_shards_.

Exact Steps to Reproduce

Unless you have access to a system with physical CPU hotplug, you would probably have to modify tensorflow's port::NumTotalCPUs() or port::GetCurrentCPU() so that NumTotalCPUs() returns a value less than 16, and GetCurrentCPU() fails, or GetCurrentCPU() returns a value less than NumTotalCPUs(). Then run a request against the server.

Source code / logs

All of the relevant code is in class ShardedReadPtrs in fast_read_dynamic_ptr.h.

arai-fortanix avatar Aug 10 '20 17:08 arai-fortanix

Thanks for the bug report. Happy to review a PR to fix this problem.

netfs avatar May 06 '21 22:05 netfs

@arai-fortanix,

Did you got a chance to submit a PR for fixing the issue. Kindly let us know if the issue persists. Thank you!

singhniraj08 avatar Apr 12 '23 10:04 singhniraj08

I believe this bug still exists, but I do not have time to work on a fix.

arai-fortanix avatar Apr 17 '23 22:04 arai-fortanix