NeMo-text-processing
NeMo-text-processing copied to clipboard
Sparrowhawk slower than Python implementation
Describe the bug
As you guys suggested, I tried exporting the grammars and run the normalizer with Sparrowhawk. But it actually takes even longer than Python. n_utts vs time_taken for Sparrowhawk: 100: 0m27.430s 50: 0m15.666s 10: 0m4.690s
For python: 100: 11s 50: 5.6s 10: 0.85s
The time taken for Sparrowhawk seems a bit non-linear.
Steps/Code to reproduce bug
Exported my custom grammar and ran the Sparrowhawk docker. There was another issue reporting this slowdown: https://github.com/NVIDIA/NeMo-text-processing/issues/82
Expected behavior
C++ supposed to be faster.
Environment overview (please complete the following information)
- Environment location: GCP
- Method of NeMo install: poetry
Environment details
Additional context
@anand-nv could you please comment on this?
Can you provide the steps your are following to evaluate. Providing Python scripts and sparrowhawk code snippets used for benchmarking and performing ITN/TN would be useful.
For python, I'm simply initializing the text normalizer and running it in a for loop
normalizer = Normalizer(
input_case='cased',
lang='en',
whitelist='path/to/whitelist.tsv',
overwrite_cache=False,
cache_dir='./assets/'
)
for each line of text in a file:
line = normalizer.normalize(line, punct_pre_process=True, punct_post_process=True, verbose=True)
Sparrowhawk
bash export_grammars.sh --GRAMMARS=tn_grammars --LANGUAGE=en --OVERWRITE_CACHE=true --WHITELIST path/to/whitelist.tsv --INPUT_CASE=cased --MODE=interactive
and in the docker container, I replaced the test.txt with my own text and
time normalizer_main --config=sparrowhawk_configuration.ascii_proto --multi_line_text < test.txt > results.txt
I also modified normalizer_main.cc to print out the actual time taken in the loop
const auto normalize_start = std::chrono::steady_clock::now();
for (const auto& sentence : sentences) {
string output;
normalizer->Normalize(sentence, &output);
std::cout << output << std::endl;
}
const auto normalize_end = std::chrono::steady_clock::now();
const auto normalize_time = std::chrono::duration_cast<std::chrono::milliseconds>(
normalize_end - normalize_start).count();
std::cerr << "Time taken to normalize: " << normalize_time << " milliseconds" << std::endl;
Do you have the "actual time estimates" for the C++ implementation normalizer_main.cc ?
Do you have the "actual time estimates" for the C++ implementation
normalizer_main.cc?
I don't have the numbers / the docker container open anymore but like I said it's always around 600 ms less than the bash time. So it was about 100: 26.8s 50: 15s 10: 4s which is why I assume the init time was around 600ms
Are you using the Dockerfile provided here for building sparrowhawk. If so can you try adding 'CXXFLAGS' and 'CFLAGS' to ./configure and rebuild the docker. ./configure CFLAGS='-g -O2 -w' CXXFLAGS='-g -O2 -w'
I see let me try, thanks
I got this error trying to compile:
79.22 libtool: link: g++ -g -O2 -w -std=c++11 -o .libs/normalizer_main normalizer_main.o ../lib/.libs/libsparrowhawk.so -L/usr/local/lib/fst -lthrax -lfstfar -lfst -lm -ldl -lprotobuf -l
re2
79.29 ../lib/.libs/libsparrowhawk.so: undefined reference to `fst::internal::DenseSymbolMap::Find(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) c
onst'
79.29 collect2: error: ld returned 1 exit status
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.
The docker container used for sparrowhawk is not optimized for production. You can build your own docker container and compile openfst-1.7.9, thrax-1.3.4 and sparrowhawk (https://github.com/anand-nv/sparrowhawk/tree/nemo_tests) with CXXFLAGS=-g -O2
The time taken to run my custom test set with the existing (non-optimized) docker container is
real 0m14.396s
user 0m14.355s
sys 0m0.020s
Time taken to run in a container with openfst/thrax/sparrowhawk compiled with -g -O2 is
real 0m1.191s
user 0m1.179s
sys 0m0.012
Thanks, I will try this again
The docker container used for sparrowhawk is not optimized for production. You can build your own docker container and compile openfst-1.7.9, thrax-1.3.4 and sparrowhawk (anand-nv/sparrowhawk@
nemo_tests) withCXXFLAGS=-g -O2The time taken to run my custom test set with the existing (non-optimized) docker container isreal 0m14.396s user 0m14.355s sys 0m0.020sTime taken to run in a container with openfst/thrax/sparrowhawk compiled with
-g -O2isreal 0m1.191s user 0m1.179s sys 0m0.012
I thought you meant you built the docker container with different versions, but it turns out it's the same version as in your Dockerfile? Like before, I'm running into some build issues that I can't even test the speed. I then tried specify the version for thrax and openfst but still the same error. This is the dockerfile:
# set base image (host OS)
FROM conda/miniconda3
# set the working directory in the container
WORKDIR /workspace
# install dependencies
RUN echo "deb http://archive.debian.org/debian stretch main contrib non-free" > /etc/apt/sources.list
RUN conda install conda-build -y
RUN apt-get update && apt-get install -y --reinstall build-essential pkg-config && apt-get upgrade -y && apt-get install -y git && apt-get install make
RUN git clone https://github.com/google/re2
RUN cd re2 && git checkout tags/2022-02-01 && make && make install
RUN apt-get install build-essential -y && apt-get install wget -y
RUN wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
RUN tar xzvf protobuf-2.5.0.tar.gz
RUN cd protobuf-2.5.0 && ./configure && make && make install && ldconfig
RUN conda install -c conda-forge thrax=1.3.4 openfst=1.7.9 -y
RUN git clone https://github.com/anand-nv/sparrowhawk.git && cd sparrowhawk && git checkout nemo_tests && apt-get install -y autoconf && bash autoreconf && ./configure CXXFLAGS='-g -O2' && make && make install && ldconfig
RUN git clone https://github.com/kward/shunit2.git
RUN echo "DONE"
79.35 make[2]: Entering directory '/workspace/sparrowhawk/src/bin'
79.35 g++ -DPACKAGE_NAME=\"Sparrowhawk\" -DPACKAGE_TARNAME=\"sparrowhawk\" -DPACKAGE_VERSION=\"1.0.0\" -DPACKAGE_STRING=\"Sparrowhawk\ 1.0.0\" -DPACKAGE_BUGREP
ORT=\"[email protected]\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sparrowhawk\" -DVERSION=\"1.0.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB
_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -
I./../include -funsigned-char -g -O2 -std=c++11 -MT normalizer_main.o -MD -MP -MF .deps/normalizer_main.Tpo -c -o normalizer_main.o normalizer_main.cc
81.54 mv -f .deps/normalizer_main.Tpo .deps/normalizer_main.Po
81.54 /bin/bash ../../libtool --tag=CXX --mode=link g++ -g -O2 -std=c++11 -o normalizer_main normalizer_main.o ../lib/libsparrowhawk.la -L/usr/local/lib/
fst -lthrax -lfstfar -lfst -lm -ldl -lprotobuf -lre2
81.63 libtool: link: g++ -g -O2 -std=c++11 -o .libs/normalizer_main normalizer_main.o ../lib/.libs/libsparrowhawk.so -L/usr/local/lib/fst -lthrax -lfstfar -lf
st -lm -ldl -lprotobuf -lre2
81.72 ../lib/.libs/libsparrowhawk.so: undefined reference to `fst::internal::DenseSymbolMap::Find(std::__cxx11::basic_string<char, std::char_traits<char>, std:
:allocator<char> > const&) const'
81.72 collect2: error: ld returned 1 exit status
81.72 make[2]: *** [normalizer_main] Error 1
81.72 Makefile:386: recipe for target 'normalizer_main' failed
81.72 make[2]: Leaving directory '/workspace/sparrowhawk/src/bin'
81.72 Makefile:352: recipe for target 'all-recursive' failed
81.72 make[1]: Leaving directory '/workspace/sparrowhawk/src'
81.72 make[1]: *** [all-recursive] Error 1
81.72 Makefile:383: recipe for target 'all-recursive' failed
81.72 make: *** [all-recursive] Error 1
which happens in this step:
RUN git clone https://github.com/anand-nv/sparrowhawk.git && cd sparrowhawk && git checkout nemo_tests && apt-get install -y autoconf && bash autoreconf && ./configure CXXFLAGS='-g -O2' && make && make install && ldconfig
I also tried building thrax from source but that led to another error:
# install dependencies
RUN echo "deb http://archive.debian.org/debian stretch main contrib non-free" > /etc/apt/sources.list
RUN conda install conda-build -y
RUN apt-get update && apt-get install -y --reinstall build-essential pkg-config && apt-get upgrade -y && apt-get install -y git && apt-get install make
RUN git clone https://github.com/google/re2
RUN cd re2 && git checkout tags/2022-02-01 && make && make install
RUN apt-get install build-essential -y && apt-get install wget -y
RUN wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
RUN tar xzvf protobuf-2.5.0.tar.gz
RUN cd protobuf-2.5.0 && ./configure && make && make install && ldconfig
# RUN conda install -c conda-forge thrax=1.3.4 openfst=1.7.9 -y
RUN wget https://www.openfst.org/twiki/pub/FST/FstDownload/openfst-1.7.9.tar.gz && tar xzvf openfst-1.7.9.tar.gz
RUN cd openfst-1.7.9 && ./configure CFLAGS='-g -O2 -w' CXXFLAGS='-g -O2 -w' && make && make install
RUN wget https://www.openfst.org/twiki/pub/GRM/ThraxDownload/thrax-1.3.4.tar.gz && tar xzvf thrax-1.3.4.tar.gz
RUN cd thrax-1.3.4 && ./configure CXXFLAGS='-g -O2' && make && make install
RUN git clone https://github.com/anand-nv/sparrowhawk.git && cd sparrowhawk && git checkout nemo_tests && apt-get install -y autoconf && bash autoreconf && ./configure CXXFLAGS='-g -O2' && make && make install && ldconfig
RUN git clone https://github.com/kward/shunit2.git
RUN echo "DONE"
3.475 checking fst/extensions/far/far.h presence... no
3.493 checking for fst/extensions/far/far.h... no
3.493 configure: error: fst/extensions/far/far.h header not found
------
Dockerfile:38
--------------------
36 | RUN cd openfst-1.7.9 && ./configure CFLAGS='-g -O2 -w' CXXFLAGS='-g -O2 -w' && make && make install
37 | RUN wget https://www.openfst.org/twiki/pub/GRM/ThraxDownload/thrax-1.3.4.tar.gz && tar xzvf thrax-1.3.4.tar.gz
38 | >>> RUN cd thrax-1.3.4 && ./configure CXXFLAGS='-g -O2' && make && make install
39 | RUN git clone https://github.com/anand-nv/sparrowhawk.git && cd sparrowhawk && git checkout nemo_tests && apt-get install -y autoconf && bash autoreconf && ./configure CXXFLAGS='-g -O2' && make && make install && ldconfig
40 | RUN git clone https://github.com/kward/shunit2.git
Hey I searched around for these issues and resolved it with some extra configure flags. I'll start testing the speed now. Sorry I'm not well versed with C and make...
It worked! My with everything properly optimized and compiled, I got double the speed than the python implementation.
The dockerfile modification for future reference:
RUN cd protobuf-2.5.0 && ./configure CFLAGS='-g -O2 -w' CXXFLAGS='-g -O2 -w' && make && make install && ldconfig
RUN wget https://www.openfst.org/twiki/pub/FST/FstDownload/openfst-1.7.9.tar.gz && tar xzvf openfst-1.7.9.tar.gz
RUN cd openfst-1.7.9 && ./configure CFLAGS='-g -O2 -w' CXXFLAGS='-g -O2 -w' --enable-far --enable-grm && make && make install && ldconfig
RUN wget https://www.openfst.org/twiki/pub/GRM/ThraxDownload/thrax-1.3.4.tar.gz && tar xzvf thrax-1.3.4.tar.gz
RUN cd thrax-1.3.4 && ./configure CXXFLAGS='-g -O2 -I/workspace/openfst-1.7.9/src/include' LDFLAGS="-L/workspace/openfst-1.7.9/src/lib" && make && make install && ldconfig
RUN git clone https://github.com/anand-nv/sparrowhawk.git && cd sparrowhawk && git checkout nemo_tests && apt-get install -y autoconf && bash autoreconf && ./configure CXXFLAGS='-g -O2 -I/workspace/openfst-1.7.9/src/include' LDFLAGS="-L/workspace/openfst-1.7.9/src/lib" && make && make install && ldconfig
RUN git clone https://github.com/kward/shunit2.git
RUN echo "DONE"