TensorFlow.NET icon indicating copy to clipboard operation
TensorFlow.NET copied to clipboard

why .net version is slower than python version ?

Open Mghobadid opened this issue 4 years ago • 10 comments

hi , guys . i have ssd-light model pb file . its trained wit python.

in python(Anaconda) & cpu version of tensorflow , the cpu always under 50 % but in TenserFlow.net cpu ,it rise up to 80 % .

where is problem?

Mghobadid avatar Jan 28 '20 13:01 Mghobadid

Can you provide the example? It should not slow than python version.

Oceania2018 avatar Jan 28 '20 19:01 Oceania2018

I also noticed this. However the slowdown is not coming from Tensorflow.net itself. The slowdown is coming from NumSharp as well as data preparation costs. If data prep and numsharp operations are done in series with training on a single thread then you are not going to be utilizing the GPU at 100% of its capability.

Also the dot net implementation of numsharp for obvious reasons cannot do some of the mind_boggle.not_type_safe.cast_pig_into_bird tricks that Python allows after which debugging any non trivial python code becomes brain_fried.developer.kill_me_now

If you add a System.Diagnostics.Stopwatch around the call to Sess.run var results = sess.run(outTensorArr, new FeedItem(image_tensor, image_np));

And another one around : image_np = image_np.reshape(1, frame.shape[0], frame.shape[1], 3);

You will see a significant time is spent on Data preparation. Almost equal to the time spent on the actual tensorflow call. What I see is that for example: To fully feed (100% Cuda utilization) a Variational Encoder running on a GTX 1080 TI the overhead of Numsharp slice / resize and shuffle operations with a batch size of 64 and data size of 12288 floats (64 pixel X 64 Pixel RGB image) easily consumes 100% (all 8 cores) of an I7 4790K overclocked to 4.4GHz

However the power of having all this Tensorflow in Dot Net lies in utilizing the full power of a programming language like C#. Example Muti threading anything non trivial in Python is just a nightmare and I have previously burned months trying to get it to work decently.

Try move the data preparation into a separate thread which queues the prepared data into a thread safe Queue. Then in your training thread just dequeue the prepared data and feed that into tensorflow:

This is what I use for a threadsafe Queue<NDArray> where you can have one thread(s) writing to it and a separate thread(s) reading from it and feeding tensorflow Note: I've used ReaderWriterLockSlim rather than Monitor or Lock because this is called from tight loops and performance is important. Also do not use ReaderWriterLock because that is slower than Monitor or Lock. Note: The reason for Queue<T> is because for what we are doing it is way faster than List<T>. The only thing faster (and not by that much) would be to implement a custom circular buffer arrayy (NDArray[] Buffer)

`

private System.Threading.ReaderWriterLockSlim SyncRootTrainBatch { get; } = new System.Threading.ReaderWriterLockSlim();
private Queue<NDArray> TrainBatch { get; } = new Queue<NDArray>();
protected NDArray TrainBuffer_Get(out bool GotData)
{
    NDArray Result;
    Result = null;
    GotData = false;
    try
    {
        SyncRootTrainBatch.EnterUpgradeableReadLock();
        if (TrainBatch.Count == 0)
        {
            //Silly but NDArray cannot be used with != null operator
            Result = null;
        }
        else
        {
            try
            {
                SyncRootTrainBatch.EnterWriteLock();
                Result = TrainBatch.Dequeue();
                GotData = true; 
            }
            finally
            {
                SyncRootTrainBatch.ExitWriteLock();
            }
        }
    }
    finally
    {
        SyncRootTrainBatch.ExitUpgradeableReadLock();
    }
    return Result;
}
protected void TrainBuffer_Set(NDArray Data)
{
    int NumSamples = Data.shape[0];
    if (NumSamples < 1)
    {
        return;
    }
    try
    {
        SyncRootTrainBatch.EnterWriteLock();
        TrainBatch.Enqueue(Data);
    }
    finally
    {
        SyncRootTrainBatch.ExitWriteLock();
    }
}
protected int TrainBuffer_HasData()
{
    try
    {
        SyncRootTrainBatch.EnterReadLock();
        return TrainBatch.Count;
    }
    finally
    {
        SyncRootTrainBatch.ExitReadLock();
    }
}

`

Your data preparation thread code could be some appropriate variant of this: Note: This example is for infinite duration training and Epochs is used to feed the same set of batches multiple times. Adjust this to fit your scenario

` // Data preparation thread example

    int BatchSize;
    List<NDArray> DataBuffer;

    int NumBatches;
    int Epochs; // Set this to whatever is appropriate
    bool HasData;
    NumBatches = 0;
    while (IsTraining)
    {
        try
        {
            int MaxBufferSize;
            MaxBufferSize = Config.ServerTensorFlowThreads * 25;
            if (MaxBufferSize < 50)
            {
                MaxBufferSize = 50;
            }
            while (IsTraining && (TrainBuffer_HasData() > MaxBufferSize))
            {
                System.Threading.Thread.Sleep(1);
            }
            if (IsTraining)
            {
                // Do all your data prep here and assign the final end result NDArray to DataBuffer.
                var frame = cv2.resize(frame, (800, 600));
                DataBuffer  = image_np.reshape(1, frame.shape[0], frame.shape[1], 3);
                // etc...

                for (int I = 0; I < Epochs; I++)
                {
                    foreach (NDArray DataBatch in DataBuffer)
                    {
                        TrainBuffer_Set(DataBatch);
                    }
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.ToString());
            System.Threading.Thread.Sleep(1000);
        }
    }

`

Your Tensorflow training thread could be some variant of this:

` // Example tensorflow training thread method

private void TensorflowTrainingThread()
{
    public Operation Optimizer { get; set; }
    public Tensor Loss { get; set; }
    public Tensor Input { get; set; } // = tf.placeholder(tf.float32, shape: new int[2] { -1, datasize}, name: "Input");
    private void TensorflowTrainingThread()
    {
        try
        {
            NDArray DataBatch;
            bool GotData;
            DataBatch = TrainBuffer_Get(out GotData);
            while (!GotData)
            {
                System.Threading.Thread.Sleep(1); //Thread sleep of less than 1 millisecond starts becoming like a spinwait.
                DataBatch = TrainBuffer_Get(out GotData);
            }
           
            Sess.run((Optimizer, Loss), (TrainCortex.Input, DataBatch));
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.ToString());
        }
    }
}

`

Spinning up the threads for the data preparation could be something like this:

` //Spinning up the threads for preparing data

for (int I = 0; I < 4; I++) //for 4 threads - tune this to whatever to the number of cores and the data preparation cpu cost etc...
{
    m_IsTraining = true;
    TS = new System.Threading.ThreadStart(TrainPrepareData);
    TrainPrepareThread = new System.Threading.Thread(TS);
    TrainPrepareThread.Start();
}

`

If your GPU is fast enough. Or maybe if you are running multiple GPU's then you could do something like this to feed tensorflow using multiple threads.

` // Feed tensorflow using multiple threads or not... // Depending on your model feeding it from multiple threads may produce slightly different training results. // However Unless you are running multiple GPU's most of the time a single feeding thread is more than enough

    List<Task> tasks = new List<Task>();
    Task Runner;
    int MaxRunners; // concurrent task count
    if (MaxRunners > 1)
    {
        Runner = Task.Run(() => TrainProcessBatch());
        tasks.add(Runner);
        if (tasks.Count > MaxRunners)
        {
            Task.WaitAll(tasks.ToArray());
            tasks.Clear();
        }
    }
    else
    {
        TrainProcessBatch();
    }`

Hope this helps.

tcwicks avatar Feb 19 '20 10:02 tcwicks

@tcwicks It defenitely helps us and other people who are using tf.net. NumSharp should be optimized in terms of performance. Thank you for the complete code sample. It would be great if you can push this code in the example project.

Oceania2018 avatar Feb 19 '20 11:02 Oceania2018

@Oceania2018 I am new to github. Do I have access to push this to the example project.

Also what I would really like to do is create a modular building blocks project with various fully functional building blocks like this. I get stuck trying to help with the core tensorflow. but I am quite good at writing this stuff instead.

Actually what I'm currently working on is writing a replacement for Unity ML Agents using Sci Sharp Tensorflow. A replacement as in fully multi threaded, distributed, and allowing for modular custom brain designs, etc... I ended up here after 5 months of pure frustration with Python.

Also a request. It would be really nice if we could have an overload of Sess.Run which takes a array of feeditems but does NOT cast the return result to an NDArray. and instead returns just a float[] or a <T>[]

The reason is because that way we can completely skip the overheads of numsharp where numsharp is not needed or performance is critical.

sorry I never said thanks for freeing us non Python people from Python.

tcwicks avatar Feb 19 '20 11:02 tcwicks

I've invited you to join SciSharp STACK members. You can fork or new branch on tf.net.

Oceania2018 avatar Feb 19 '20 11:02 Oceania2018

i forget notice that i use anaconda environment , and anaconda use Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) https://software.intel.com/en-us/articles/intel-optimization-for-tensorflow-installation-guide how about this ? this is may cause of lower speed?

Mghobadid avatar Feb 22 '20 01:02 Mghobadid

@Mghobadid were you able to solve your problem? I seem to be experiencing a difference in performance as well.

solarflarefx avatar Mar 05 '20 01:03 solarflarefx

Performance difference is actually quite big, running a rather deep model (~200 layers) can make the compute time go from seconds (python) to minutes (.net). Also: using SciSharp.TensorFlow.Redist-Windows-GPU with a Geforce 3080 is a couple seconds slower than just running SciSharp.TensorFlow.Redist on an overclocked I7.

svenrog avatar Aug 15 '21 19:08 svenrog

@svenrog It will help us if you can narrow down the root cause with sample code provieded.

Oceania2018 avatar Aug 15 '21 20:08 Oceania2018