rtl-power-fftw icon indicating copy to clipboard operation
rtl-power-fftw copied to clipboard

Sweep time calculation accuracy

Open xmikos opened this issue 9 years ago • 8 comments

I have implemented rtl_power_fftw backend in QSpectrumAnalyzer (it's in rtl_power_fftw branch, if you want to try it you have to choose it in Settings, because rtl_power is still the default option). It works great, but one thing is worse than with original rtl_power - when I specify time that sweep should take, real time that it takes is always considerably longer than requested time. In original rtl_power it is much more spot on (it only takes longer if I specify too long sweeps with many hops).

I have reimplemented your hops calculation algorithm in QSpectrumAnalyzer to calculate hop-time (in original rtl_power, specified time is for whole multi-hop sweep, not only one hop) and if user chooses e.g. 2s and sweep is 4 hops long, I simply set -t option to 0.5s (specified_time / number_of_hops). Maybe I am doing something wrong with that calculation, but it seems to give same numbers as your original calculation in rtl_power_fftw.

Best way would be if you can add option for specifying requested whole-sweep time (like in rtl_power), not only one-hop time. But I fully understand if you don't want to do that.

xmikos avatar Dec 07 '15 10:12 xmikos

It is true that rtl_power_fftw tends to take longer to finish than specified with the -t option. This is because we tend to use quite large buffers and tend to make sure that we capture at least the specified amount of time. This all means that we often overrun time. The most obvious thing to do is to toggle the -T option too. But this doesn't solve all delays. You can try to play with smaller buffers - I would like to head your findings. But it is true that it shows in the design that we were most interested in "minutes", not "seconds". I will think about making some changes so that the program rather finishes slightly early than introduce an overrun. I will think about how to best implement it.

KlemenBlokar avatar Dec 07 '15 17:12 KlemenBlokar

Yeah, here it quite shows that we were thinking radio astronomy when designing the program. But this does not mean that some improvements could not be made. Let me try to reconcile which are the current "hot spots", i.e., the points that cause the execution time to be longer than the integration time:

  1. Setting up the device. This happens only once in the beginning and does not really add much overhead. This delay is constant per-execution. Not much can be done about it.

  2. Changing the frequency. The PLL takes a while to stabilize and the more frequency hops we have, the longer the cumulative delay will be. This factor is linear with respect to the number of hops (and therefore to the frequency span). This is probably the main reason for exceeding the specified time in the original rtl_power. It is still relatively insignificant compared to other delays.

  3. Reading whole bufferfuls of data. The buffer length must be a multiple of 16384 and the default buffer length is 1638400 = ~1.6 Mb, which amounts to about 0.8 seconds with a sample rate of 2 MHz. So if you choose, for example, an integration time of 1 second, two whole buffers will be filled and the actual time taken for this acquisition will be ~1.6 seconds. For single-shot and/or long measurements this does not matter; an additional delay of 0.8 seconds in a 30-second measurement is negligible. However, this changes drastically when doing sweeps. For a 100-hop sweep with a nominal integration time of one second per hop, the program will take a quite unexpected 160 seconds. This could be mitigated with a more careful selection of the buffer length. I think that we should rethink and improve the algorithm for the automatic selection of the buffer length.

  4. Waiting for the FFT thread between frequency hops. At the end of one measurement, the program currently waits for the FFT thread to finish its job before proceeding to the next frequency. So data acquisition is stopped for at least as long as it takes to do the FFT on one whole buffer, and possibly more. In principle, this is not necessary. The data acquisition thread could immediately proceed and feed the FFT thread with the data from the next frequency. In order to do this, we could introduce an additional buffer collection (i.e., another Datastore object) and carefully juggle the two, much in the same way as we currently juggle the buffers within one Datastore. Perhaps it would be best to move the acquisition to its own thread, too, so the main thread only coordinates the efforts, reaping the results when they are available and writing them to the output.

  5. Waiting for the FFT thread to finish the job at the end of the program. This is, again, a more-or-less fixed delay and not much can be done about it.

Points 3 and 4 are certainly the greatest culprits for the difference between the expected duration and the actual run time.

alajovic avatar Dec 08 '15 20:12 alajovic

Thinking again,

  1. Reading whole bufferfuls of data. [...] I think that we should rethink and improve the algorithm for the automatic selection of the buffer length.

this approach again goes down the path of greatest common denominator and/or similar complicated solutions. We've been there and we threw it out for a reason. Perhaps we could better solve this by allowing the final buffer to be only partially filled. We could even simply resize the final buffer; the current program logic should handle that with no additional modifications. If I recall correctly, std::vector.resize() does not change the capacity of the vector, i.e., the memory is not reclaimed, and no reallocation is necessary if the vector is later resized back to its previous size. So this should not even represent a performance hit.

alajovic avatar Dec 09 '15 21:12 alajovic

I was thinking in that direction too, but I also think that modifying our current strategy could bring improvements. For short measurements (up to 1.6MB of data), we currently do an essentially sequential work-flow: read everything, do FFT. As we wait for FFT before retuning, we get a huge delay (less so on really beefy machines with huge single-thread performance). If we rethink our approach and handle as "short" sessions everything up to ~4.8M or so, and divide it into 2 - 3 buffers, we might be more efficient. Even further, we could go into complicating the thread stuff - why not using more threads for FFT if more cores are available? Even raspberries grow in bunches these days... ;-) But it would be a difficult thing to get those threads to cooperate nicely. An "oversight committee" thread would be necessary. It would command rtlsdr thread and FFT threads and most probably condition the data for FFT (and divide it among threads) and then get back the results and combine them and do output. It would basically be a complete rewrite of logic, but hey, it wouldn't be the first time that's needed for a program to evolve.

KlemenBlokar avatar Dec 10 '15 10:12 KlemenBlokar

Hi all. Probably, if you decide to go with the multi-threaded approach, this info could be interesting:

http://www.fftw.org/doc/Multi_002dthreaded-FFTW.html

http://www.fftw.org/fftw2_doc/fftw_4.html

mariocannistra avatar Dec 10 '15 10:12 mariocannistra

Thank you very much for this info. However, if I understand it correctly, we might be better off with our own solution for threading - this is due to the fact that we usually have a problem of a lot of small FFTs that we then average the results of, rather than one big FFT to calculate. This means we can simultaneously launch multiple single-thread FFTs that have no parallel overhead and then combine the results. I sense that this is the faster approach here, but thank you very much for the info anyway, as I did not know about this, and it is good to know about this option.

KlemenBlokar avatar Dec 10 '15 12:12 KlemenBlokar

I have looked a bit into this. The situation with my point no. 3 (above) is not as bad as I recalled. Still, I think that the solution I proposed (the short-final-buffer approach) is a viable one and I am implementing it just now. This will also allow us to simplify the buffer length heuristic.

@xmikos: do you have a particular case (frequency span + integration time + sample rate) where the sweep is much longer than you would expect? If yes, could you please paste here the corresponding command line parameters? This would help a lot with profiling/testing.

alajovic avatar Dec 13 '15 22:12 alajovic

@xmikos: is this issue still relevant? I never heard back from you after my last question, and now literally years have gone by, so I reckon that it might be time to close it.

alajovic avatar Aug 25 '18 23:08 alajovic