NeMo-text-processing icon indicating copy to clipboard operation
NeMo-text-processing copied to clipboard

Sparrowhawk slower than Python implementation

Open riqiang-dp opened this issue 1 year ago • 2 comments
trafficstars

Describe the bug

As you guys suggested, I tried exporting the grammars and run the normalizer with Sparrowhawk. But it actually takes even longer than Python. n_utts vs time_taken for Sparrowhawk: 100: 0m27.430s 50: 0m15.666s 10: 0m4.690s

For python: 100: 11s 50: 5.6s 10: 0.85s

The time taken for Sparrowhawk seems a bit non-linear.

Steps/Code to reproduce bug

Exported my custom grammar and ran the Sparrowhawk docker. There was another issue reporting this slowdown: https://github.com/NVIDIA/NeMo-text-processing/issues/82

Expected behavior

C++ supposed to be faster.

Environment overview (please complete the following information)

  • Environment location: GCP
  • Method of NeMo install: poetry

Environment details

Additional context

riqiang-dp avatar May 23 '24 18:05 riqiang-dp

@anand-nv could you please comment on this?

ekmb avatar May 23 '24 19:05 ekmb

Can you provide the steps your are following to evaluate. Providing Python scripts and sparrowhawk code snippets used for benchmarking and performing ITN/TN would be useful.

anand-nv avatar May 24 '24 03:05 anand-nv

For python, I'm simply initializing the text normalizer and running it in a for loop

normalizer = Normalizer(
                input_case='cased',
                lang='en',
                whitelist='path/to/whitelist.tsv',
                overwrite_cache=False,
                cache_dir='./assets/'
            )

for each line of text in a file:

line = normalizer.normalize(line, punct_pre_process=True, punct_post_process=True, verbose=True)

Sparrowhawk

bash export_grammars.sh --GRAMMARS=tn_grammars --LANGUAGE=en --OVERWRITE_CACHE=true --WHITELIST path/to/whitelist.tsv --INPUT_CASE=cased --MODE=interactive

and in the docker container, I replaced the test.txt with my own text and

time normalizer_main --config=sparrowhawk_configuration.ascii_proto --multi_line_text < test.txt > results.txt

I also modified normalizer_main.cc to print out the actual time taken in the loop

  const auto normalize_start = std::chrono::steady_clock::now();
  for (const auto& sentence : sentences) {
    string output;
    normalizer->Normalize(sentence, &output);
    std::cout << output << std::endl;
  }
  const auto normalize_end = std::chrono::steady_clock::now();
  const auto normalize_time = std::chrono::duration_cast<std::chrono::milliseconds>(
    normalize_end - normalize_start).count();
  std::cerr << "Time taken to normalize: " << normalize_time << " milliseconds" << std::endl;

riqiang-dp avatar May 24 '24 21:05 riqiang-dp

Do you have the "actual time estimates" for the C++ implementation normalizer_main.cc ?

anand-nv avatar May 24 '24 21:05 anand-nv

Do you have the "actual time estimates" for the C++ implementation normalizer_main.cc ?

I don't have the numbers / the docker container open anymore but like I said it's always around 600 ms less than the bash time. So it was about 100: 26.8s 50: 15s 10: 4s which is why I assume the init time was around 600ms

riqiang-dp avatar May 27 '24 22:05 riqiang-dp

Are you using the Dockerfile provided here for building sparrowhawk. If so can you try adding 'CXXFLAGS' and 'CFLAGS' to ./configure and rebuild the docker. ./configure CFLAGS='-g -O2 -w' CXXFLAGS='-g -O2 -w'

anand-nv avatar May 29 '24 06:05 anand-nv

I see let me try, thanks

riqiang-dp avatar May 29 '24 22:05 riqiang-dp

I got this error trying to compile:

79.22 libtool: link: g++ -g -O2 -w -std=c++11 -o .libs/normalizer_main normalizer_main.o  ../lib/.libs/libsparrowhawk.so -L/usr/local/lib/fst -lthrax -lfstfar -lfst -lm -ldl -lprotobuf -l
re2                                                                                                                                                                                        
79.29 ../lib/.libs/libsparrowhawk.so: undefined reference to `fst::internal::DenseSymbolMap::Find(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) c
onst'                                                                                                                                                                                      
79.29 collect2: error: ld returned 1 exit status

riqiang-dp avatar May 30 '24 22:05 riqiang-dp

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jun 30 '24 01:06 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Jul 07 '24 01:07 github-actions[bot]

The docker container used for sparrowhawk is not optimized for production. You can build your own docker container and compile openfst-1.7.9, thrax-1.3.4 and sparrowhawk (https://github.com/anand-nv/sparrowhawk/tree/nemo_tests) with CXXFLAGS=-g -O2 The time taken to run my custom test set with the existing (non-optimized) docker container is

real    0m14.396s
user    0m14.355s
sys     0m0.020s

Time taken to run in a container with openfst/thrax/sparrowhawk compiled with -g -O2 is

real    0m1.191s
user    0m1.179s
sys     0m0.012

anand-nv avatar Jul 09 '24 02:07 anand-nv

Thanks, I will try this again

riqiang-dp avatar Jul 09 '24 23:07 riqiang-dp

The docker container used for sparrowhawk is not optimized for production. You can build your own docker container and compile openfst-1.7.9, thrax-1.3.4 and sparrowhawk (anand-nv/sparrowhawk@nemo_tests) with CXXFLAGS=-g -O2 The time taken to run my custom test set with the existing (non-optimized) docker container is

real    0m14.396s
user    0m14.355s
sys     0m0.020s

Time taken to run in a container with openfst/thrax/sparrowhawk compiled with -g -O2 is

real    0m1.191s
user    0m1.179s
sys     0m0.012

I thought you meant you built the docker container with different versions, but it turns out it's the same version as in your Dockerfile? Like before, I'm running into some build issues that I can't even test the speed. I then tried specify the version for thrax and openfst but still the same error. This is the dockerfile:


# set base image (host OS)
FROM conda/miniconda3

# set the working directory in the container
WORKDIR /workspace

# install dependencies
RUN echo "deb http://archive.debian.org/debian stretch main contrib non-free" > /etc/apt/sources.list
RUN conda install conda-build -y
RUN apt-get update &&     apt-get install -y --reinstall build-essential pkg-config &&     apt-get upgrade -y &&     apt-get install -y git &&     apt-get install make
RUN git clone https://github.com/google/re2 
RUN cd re2 && git checkout tags/2022-02-01 && make && make install
RUN apt-get install build-essential -y && apt-get install wget -y
RUN wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
RUN tar xzvf protobuf-2.5.0.tar.gz
RUN cd protobuf-2.5.0 && ./configure && make && make install && ldconfig
RUN conda install -c conda-forge thrax=1.3.4 openfst=1.7.9 -y
RUN git clone https://github.com/anand-nv/sparrowhawk.git && cd sparrowhawk &&  git checkout nemo_tests &&   apt-get install -y autoconf && bash autoreconf && ./configure CXXFLAGS='-g -O2' && make && make install && ldconfig
RUN git clone https://github.com/kward/shunit2.git
RUN echo "DONE"
79.35 make[2]: Entering directory '/workspace/sparrowhawk/src/bin'                                                                                             
79.35 g++ -DPACKAGE_NAME=\"Sparrowhawk\" -DPACKAGE_TARNAME=\"sparrowhawk\" -DPACKAGE_VERSION=\"1.0.0\" -DPACKAGE_STRING=\"Sparrowhawk\ 1.0.0\" -DPACKAGE_BUGREP
ORT=\"[email protected]\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sparrowhawk\" -DVERSION=\"1.0.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB
_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\"   -
I./../include -funsigned-char  -g -O2 -std=c++11 -MT normalizer_main.o -MD -MP -MF .deps/normalizer_main.Tpo -c -o normalizer_main.o normalizer_main.cc        
81.54 mv -f .deps/normalizer_main.Tpo .deps/normalizer_main.Po                                                                                                 
81.54 /bin/bash ../../libtool  --tag=CXX   --mode=link g++  -g -O2 -std=c++11   -o normalizer_main normalizer_main.o ../lib/libsparrowhawk.la -L/usr/local/lib/
fst -lthrax -lfstfar -lfst -lm -ldl -lprotobuf -lre2                                                                                                           
81.63 libtool: link: g++ -g -O2 -std=c++11 -o .libs/normalizer_main normalizer_main.o  ../lib/.libs/libsparrowhawk.so -L/usr/local/lib/fst -lthrax -lfstfar -lf
st -lm -ldl -lprotobuf -lre2                                                                                                                                   
81.72 ../lib/.libs/libsparrowhawk.so: undefined reference to `fst::internal::DenseSymbolMap::Find(std::__cxx11::basic_string<char, std::char_traits<char>, std:
:allocator<char> > const&) const'                                                                                                                              
81.72 collect2: error: ld returned 1 exit status                                                                                                               
81.72 make[2]: *** [normalizer_main] Error 1                                                                                                                   
81.72 Makefile:386: recipe for target 'normalizer_main' failed                                                                                                 
81.72 make[2]: Leaving directory '/workspace/sparrowhawk/src/bin'                                                                                              
81.72 Makefile:352: recipe for target 'all-recursive' failed                                                                                                   
81.72 make[1]: Leaving directory '/workspace/sparrowhawk/src'                                                                                                  
81.72 make[1]: *** [all-recursive] Error 1                                                                                                                     
81.72 Makefile:383: recipe for target 'all-recursive' failed                                                                                                   
81.72 make: *** [all-recursive] Error 1              

which happens in this step:

RUN git clone https://github.com/anand-nv/sparrowhawk.git && cd sparrowhawk &&  git checkout nemo_tests &&   apt-get install -y autoconf && bash autoreconf && ./configure CXXFLAGS='-g -O2' && make && make install && ldconfig

I also tried building thrax from source but that led to another error:

# install dependencies
RUN echo "deb http://archive.debian.org/debian stretch main contrib non-free" > /etc/apt/sources.list
RUN conda install conda-build -y
RUN apt-get update &&     apt-get install -y --reinstall build-essential pkg-config &&     apt-get upgrade -y &&     apt-get install -y git &&     apt-get install make
RUN git clone https://github.com/google/re2 
RUN cd re2 && git checkout tags/2022-02-01 && make && make install
RUN apt-get install build-essential -y && apt-get install wget -y
RUN wget https://github.com/protocolbuffers/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
RUN tar xzvf protobuf-2.5.0.tar.gz
RUN cd protobuf-2.5.0 && ./configure && make && make install && ldconfig
# RUN conda install -c conda-forge thrax=1.3.4 openfst=1.7.9 -y
RUN wget https://www.openfst.org/twiki/pub/FST/FstDownload/openfst-1.7.9.tar.gz && tar xzvf openfst-1.7.9.tar.gz
RUN cd openfst-1.7.9 && ./configure CFLAGS='-g -O2 -w' CXXFLAGS='-g -O2 -w' && make && make install
RUN wget https://www.openfst.org/twiki/pub/GRM/ThraxDownload/thrax-1.3.4.tar.gz && tar xzvf thrax-1.3.4.tar.gz
RUN cd thrax-1.3.4 && ./configure CXXFLAGS='-g -O2' && make && make install
RUN git clone https://github.com/anand-nv/sparrowhawk.git && cd sparrowhawk &&  git checkout nemo_tests &&   apt-get install -y autoconf && bash autoreconf && ./configure CXXFLAGS='-g -O2' && make && make install && ldconfig
RUN git clone https://github.com/kward/shunit2.git
RUN echo "DONE"
3.475 checking fst/extensions/far/far.h presence... no
3.493 checking for fst/extensions/far/far.h... no
3.493 configure: error: fst/extensions/far/far.h header not found
------
Dockerfile:38
--------------------
  36 |     RUN cd openfst-1.7.9 && ./configure CFLAGS='-g -O2 -w' CXXFLAGS='-g -O2 -w' && make && make install
  37 |     RUN wget https://www.openfst.org/twiki/pub/GRM/ThraxDownload/thrax-1.3.4.tar.gz && tar xzvf thrax-1.3.4.tar.gz
  38 | >>> RUN cd thrax-1.3.4 && ./configure CXXFLAGS='-g -O2' && make && make install
  39 |     RUN git clone https://github.com/anand-nv/sparrowhawk.git && cd sparrowhawk &&  git checkout nemo_tests &&   apt-get install -y autoconf && bash autoreconf && ./configure CXXFLAGS='-g -O2' && make && make install && ldconfig
  40 |     RUN git clone https://github.com/kward/shunit2.git

riqiang-dp avatar Jul 10 '24 17:07 riqiang-dp

Hey I searched around for these issues and resolved it with some extra configure flags. I'll start testing the speed now. Sorry I'm not well versed with C and make...

riqiang-dp avatar Jul 10 '24 19:07 riqiang-dp

It worked! My with everything properly optimized and compiled, I got double the speed than the python implementation.

The dockerfile modification for future reference:

RUN cd protobuf-2.5.0 && ./configure CFLAGS='-g -O2 -w' CXXFLAGS='-g -O2 -w' && make && make install && ldconfig
RUN wget https://www.openfst.org/twiki/pub/FST/FstDownload/openfst-1.7.9.tar.gz && tar xzvf openfst-1.7.9.tar.gz
RUN cd openfst-1.7.9 && ./configure CFLAGS='-g -O2 -w' CXXFLAGS='-g -O2 -w' --enable-far --enable-grm && make && make install && ldconfig
RUN wget https://www.openfst.org/twiki/pub/GRM/ThraxDownload/thrax-1.3.4.tar.gz && tar xzvf thrax-1.3.4.tar.gz
RUN cd thrax-1.3.4 && ./configure CXXFLAGS='-g -O2 -I/workspace/openfst-1.7.9/src/include' LDFLAGS="-L/workspace/openfst-1.7.9/src/lib" && make && make install && ldconfig
RUN git clone https://github.com/anand-nv/sparrowhawk.git && cd sparrowhawk &&  git checkout nemo_tests &&   apt-get install -y autoconf && bash autoreconf && ./configure CXXFLAGS='-g -O2 -I/workspace/openfst-1.7.9/src/include' LDFLAGS="-L/workspace/openfst-1.7.9/src/lib" && make && make install && ldconfig
RUN git clone https://github.com/kward/shunit2.git
RUN echo "DONE"

riqiang-dp avatar Jul 10 '24 23:07 riqiang-dp