mimic1
mimic1 copied to clipboard
Request: Test on a raspberry pi unit needed (with faster compilation)
Hi,
One of the most common complaints about mimic is the long compile times it requires. The main cause for that is that the mycroft voice is compiled embedded in the mimic binary, instead of being loaded on runtime from a file.
We don't load the voice from a file on runtime because it is too slow. However, if we were able to improve the voice loading functions then we could stop embedding it at compilation time. So far, @forslund has made some improvements in #85 but still there is room for improvement.
I need someone to test a command that loads the mycroft voice from file. Then that person needs to compile mimic with a patch that may improve voice loading performance slightly and then check if there is a significant improvement or not.
- Download and compile the development version
We will disable all the embedded voices (with --disable-voices-all
) to make compilation much faster:
git clone https://github.com/MycroftAI/mimic.git
cd mimic
git checkout development
./configure --disable-voices-all
make
- Test the timings (copy the output of this command)
time ./mimic -voice voices/mycroft_voice_4.0.flitevox -t "" a.wav
- Clean up:
make distclean
- Try the patched version:
git remote add zeehio https://github.com/zeehio/mimic.git
git checkout zeehio/cg_maybe_faster_load
./configure --disable-voices-all
make
- Test the patched version (copy the output of this command)
time ./mimic -voice voices/mycroft_voice_4.0.flitevox -t "" a.wav
Thanks to anyone who can help on this
Cool, I'll see if I can find my raspberry Pi.
I'm working on a memory pool allocator for all the small allocs, but I believe more in this approach.
What sort of time imporvements would be expected? I assume you tested on your own PC?
The number of calls to cst_safe_alloc decreased from 797956 calls to 294710. So after this commit we do 37% of the allocations in order to load the mycroft voice.
I will check the timings, I believe it was something like: before 0.2s after 0.15 but I am not sure.
That's a decent improvement. I seem to have my Pi packed away still so it'll take me a while to find it...
I was afraid of disk cache so I tried again. The first run of each case shows the performance without any cache and you can see (at the end of this message) that the timing drops from 0.74s to 0.67s, merely a 10% improvement.
Thinking a bit, my expectations are not that good on a raspberry pi: Reading from rpi forums, it seems the SD read speed in the pi is about 40-50MB/s. The mycroft_voice_4.0.flitevox
file is 69MB, so there are 69/40 = 1.75 seconds we won't be able to avoid in any way.
If disk reading is the limiting factor then using ./mimic
with all the voices embedded should also be even slower, because the mimic binary needs to be read from disk to RAM. The most fair comparison would be:
All voices embedded, no flitevox file loaded: (this is how things are now)
-
./configure && make && time ./mimic -t "" a.wav
No voices embedded, use flitevox file: (this is how things would be)
-
./configure --disable-voices-all && make && time ./mimic -voice voices/mycroft_voice_4.0.flitevox -t "" a.wav
I am afraid some previous benchmarks may have been done with both embedded voices and loading voices from file, and that is the most unfair situation as we have both a flitevox file and a large mimic binary to read.
Real numbers will be very welcome.
Without optimization:
1st run:
- real 0m0.742s
- user 0m0.272s
- sys 0m0.024s
2nd run:
- real 0m0.275s
- user 0m0.240s
- sys 0m0.032s
3rd run:
- real 0m0.274s
- user 0m0.248s
- sys 0m0.024s
With optimization:
1st run:
- real 0m0.674s
- user 0m0.108s
- sys 0m0.048s
2nd run:
- real 0m0.151s
- user 0m0.112s
- sys 0m0.036s
3rd run:
- real 0m0.150s
- user 0m0.120s
- sys 0m0.028s
I'm rebuilding my raspberry pi image at the moment. I think my conclusion when I checked into this was the same. Disk I/O was the large issue.
That said, this seems to be an improvement in any case.
If all else fails, pymimic runs OK keeping the voice file in memory.
Oh yes, pymimic should be the way to go.
Has anyone considered compiling mimic using a "Unity Build" (single compilation unit). The compilation speed benefits are pretty large, because in reduces redundant compilation. Being IO bound is rarely the problem, mostly it is compiling all of those redundant includes, and if your files are as big as some of the mimic voice data files are, I can see why that could add up pretty quickly. Some say they have cut their times by 90%, my guess is that mileage may vary based on project, but from what others have said and my own experience with unity builds, I have noticed 50% seems to be the lower bound.
I have noticed that there seem to be mixed feelings for unity builds online. Some people love them, some people think they are a hack.(Seems no more of a hack to me than using bash scripts or makefiles) It looks like most of the naysaying comes from people who run into some problems on c++ codebases that use a lot of c++ features (namespaces, templates, etc), as C++(convention methods of building it) seem to kind of assume the use of multiple compilation units. I haven't really heard anything bad about using it with C other than some crazy macros will be even more crazy if your not careful. I've heard some people say it makes your code unmaintainable, (lol I have also heard that functions with more than 10 lines of code are unmaintainable, thats the internet for you) but I have seen codebases put everything into one compilation unit and are totally fine. (although I think it helps if the project starts development that way, since switching to it later means touching quite a few places) I have heard of people running out of memory when compiling this way. I've never seen it happen, but as rare as it may be on a desktop, I thought I would mention it since compiling on a rasberryPI is a retirement in this case.
Your thoughts? Are there any other promising methods that you know of for reducing the total build time?
On Mon, Jun 12, 2017 at 10:16 AM, Sergio Oller [email protected] wrote:
Oh yes, pymimic should be the way to go.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MycroftAI/mimic/issues/122#issuecomment-307821085, or mute the thread https://github.com/notifications/unsubscribe-auth/ACp7HOooNu7oUoW2MbdMAOf7DWVrEbDqks5sDVZZgaJpZM4N29qY .
Yes, I considered unity builds after learning about them in the meson build system.
Mimic compilation issues on the raspberry pi are:
- high ram usage due to very large compilation units
- slow compilation of the Mycroft voice due to large compilation units.
The large compilation units are caused by huge structures that are compiled once.
Using a unity build on a raspberry pi for building mimic will likely make the pi system run out of memory because we would make one super huge compilation unit combining all the huge structures.
The best way to reduce the compilation time is to avoid compiling those large structures in the voice files and make mimic use .flitevox files. In order to do that we must ensure that the loading voice from file function is competitive in speed with respect to embedding the voice at compile time. And that is what we are working at right now.
Sorry for the delay, my Pi was on the fritz and these tests have been run on a rPi 2 instead of a 3. The results look good, with a noticeable speed incease. It will have a bit less effect on the raspberry Pi 3 since it has a faster CPU, I have no idea how much though.
Results:
High load cached
Zeehio faster
real 0m8.496s
user 0m1.820s
sys 0m0.680s
Development
real 0m12.882s
user 0m2.890s
sys 0m0.730s
Low Load Cached
Zeehio faster
real 0m2.693s
user 0m1.850s
sys 0m0.650s
Development
real 0m3.881s
user 0m2.670s
sys 0m0.730s
Low Load Uncached
Zeehio faster
real 0m5.159s
user 0m1.780s
sys 0m0.980s
Development
real 0m6.025s
user 0m2.590s
sys 0m1.120s
I cleared the cache with the following snippet: free && sync && echo 3 > /proc/sys/vm/drop_caches && free
That's great!
I think I may be able to shave a few more seconds by changing how the CART trees are serialized. But it will take a while.
Just to know the details... did you compile mimic with --disable-voices-all
in any of those cases?
./configure --disable-voices-all
was the commandline I used.
Here are some uncached times for the Pi3. Not as big a boost but noticeable!
Pi3: Zeehio go faster
real 0m4.553s
user 0m0.990s
sys 0m0.360s
Pi3: development
real 0m5.578s
user 0m1.500s
sys 0m0.370s
Good to know that the change is noticeable. There still is room for improvement in the flite serialization of cart trees. I hope to be able to improve them in the future
By the way, talking about improvements... has anyone explored different CFLAGS
? The combination of O3 with ffast-math gave on my workstation a 27% improvement in speech synthesis.
This test may take a while on a pi, because synthesizing doc/alice
gives a 1h long wav file. You may want to use only 10% of the doc/alice
document for testing on a pi (10% of doc/alice
should be long enough to measure the differences).
Before O3 fast-math
./configure --disable-voices-all
make
time ./mimic -voice voices/mycroft_voice_4.0.flitevox -f doc/alice test_no_ffastmath_O3.wav
real 1m25.348s
user 1m25.132s
sys 0m0.212s
After O3 fast-math
./configure --disable-voices-all CFLAGS="-O3 -ffast-math"
make
time ./mimic -voice voices/mycroft_voice_4.0.flitevox -f doc/alice test_ffastmath_O3.wav
real 1m2.789s
user 1m1.976s
sys 0m0.280s
Extra possible optimization (only on the pi3, not pi2, not pi1, not pi0), flags from here:
./configure --disable-voices-all CFLAGS="-O3 -ffast-math -mcpu=cortex-a53 -mfpu=neon-fp-armv8"
make
time ./mimic -voice voices/mycroft_voice_4.0.flitevox -f doc/alice test_ffastmath_O3_cpu_fpu.wav
@forslund Can you still do some testing on your pi3? I don't have any
Sure thing, now I have mine set up in a good way for testing. It might have to wait until tomorrow morning.
I just edited the post to add --disable-voices-all
otherwise you will spend an awful lot of time compiling mimic... Oh and with --disable-voices-all
it is safe to use make -j4
yeah, I never build the voices if I can help it =) -j4 was a good tip though
I don't know if as of today -march=native
includes the fpu/CPU specific
optimizations.. as you can see from the link I gave, it was not the case in
April 2016.
Anyway, proper CFLAGS are something that usually is handled by the distribution packagers and not us, because there are thousands of combinations of compilers, architectures, cross-compilation scenarios and use cases.
Let's see if that has any kind of impact first, then document it later and if it has an impact we can suggest mycroft-core and other packagers to change their build flags
El dia 28 juny 2017 5:12 a. m., "el-tocino" [email protected] va escriure:
does -march=native include the f/cpu-specific optimizations (other than ffast-math)? Would be slightly more compatible for other systems than just pi ARM types that way. Elsewise, putting a cpu check in the build script might be possible if they're explicit.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/MycroftAI/mimic/issues/122#issuecomment-311544965, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEmsbkiTcDZJj7nyGmtE5k7wtucyB3Rks5sIcSigaJpZM4N29qY .
pi3, picroft .8.16 time ./mimic -voice voices/mycroft_voice_4.0.flitevox -f doc/alice test_ffastmath_O3.wav cg_maybe_faster_load: real 12m33.637s user 12m28.250s sys 0m3.530s
same branch as above with pi3 copts in place (./configure --disable-voices-all CFLAGS="-O3 -ffast-math -mcpu=cortex-a53 -mfpu=neon-fp-armv8") real 10m48.887s user 10m42.190s sys 0m4.180s
Default mycroftai mimic with pi3 copts: real 11m54.623s user 11m45.990s sys 0m6.410s
My (development branch)
normal
real 17m6.581s
user 17m3.220s
sys 0m3.100s
-O3 -ffastmath
real 16m18.952s
user 16m15.620s
sys 0m3.260s
-O3 -mcpu=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard -funsafe-math-optimizations
real 16m23.169s
user 16m19.900s
sys 0m3.210s
-O3 -ffast-math -mcpu=cortex-a53 -mfpu=neon-fp-armv8
real 16m36.796s
user 16m33.260s
sys 0m3.420s
Tried -Ofast as well:
real 16m38.909s
user 16m35.770s
sys 0m3.070s
Ran again with SLT voice and writing to ramdisk: pi@picroft:~/Build/mimic $ time ./mimic -voice voices/cmu_us_slt.flitevox -f doc/alice /ram/test.wav
cg_maybe_faster_load real 4m55.212s user 4m53.290s sys 0m1.090s
cg_maybe w/copts real 4m56.311s user 4m54.660s sys 0m1.040s
mycroft w/copts real 4m42.239s user 4m39.870s sys 0m1.120s
The difference between the voices is kind of expected. The sampling rate of the Mycroft voice is 44100Hz while the sampling rate of slt is 16000Hz, so we synthesize about three times more samples with the Mycroft voice. In any case all those scenarios are several times faster than real time synthesis
Didn't want the sd card or io to be the limiting factor. Also I use slt normally. :)
I'm currently installing gcc-6 (.2 I think) to see if any improvements have been made since 4.9. I don't have very high hopes but it might be worth a try.
Did a quick profiling on my PC These are the top 4 cpu hogging functions according to gprof
64.48 66.43 66.43 62197740 0.00 0.00 mlsadf
11.79 78.58 12.15 251 48.41 360.97 synthesis_body
11.67 90.60 12.02 565936 0.02 0.02 b2en
4.77 95.51 4.91 14525906 0.00 0.00 internal_ff.constprop.1
I think mlsadf1
and mlsadf2
are inlined by the compiler and are hence not shown separately.