language-learning icon indicating copy to clipboard operation
language-learning copied to clipboard

Performance of MST-Parser

Open akolonin opened this issue 5 years ago • 39 comments

Need to see how to make MST-Parses on large corpora to work faster, below is the discussion: @glicerico : It now takes around 100 hours to perform MST-parses using the DNN files for Guteberg Children. It's true that it's not using a lot of CPU, and the main problem is that is done sequentially. The current experiments are not performing any I/O with the Postgress DB, although it does open the databese. I circunvented the long time by splitting the GC files into 7 different directories and processing each of them in a separate docker container, so it's a pedestrian way of processing in parallel. I wasn't able to use more than 7 different processes because of limitations by postgress in the number of accesses. @akolonin : I have two existential questions to be answered somehow by someone - A) why OpenCog is not allocating CPU spending so much time on computationally intensive process not involving I/O? B) if there is a way to let MST-Parser to use multiple threads? @linas, @AmeBel , @alexei-gl - any clues?
Actually, I was making observations on the server where @glicerico was running these parses - few Python processes doing clustering at the same time were allocating 100-600% CPU while OpenCog doing MST-Parses was consuming just few percents appearing semi-idle.

I see few directions of work: 1) Explore how to use MST-Parser algorithm implemented in Guile Scheme to consume CPU; 2) Explore if Guile Scheme makes it possible to distribute the load across multiple CPU-s like Pythin; 3) Port MST-Parser to Python with native multi-processing support, 4) Port MST-Parser to C/C++ with native multi-processing support

akolonin avatar Mar 01 '19 09:03 akolonin

more than 7 different processes because of limitations by postgress in the number of accesses.

There are three distinct solutions for this:

  1. don't open postgres, until you actually really need to, and close it when done. (I'm not sure if this is possible in the current code design or not)

  2. By default, the atomspace opencog backend opens 4 threads in parallel, devoted to database i/o. You can change this to 2 or 1. It probably will not affect performance, and will allow you to have 2x or 4x more connections to postgres.

  3. postgres has a tunable parameter for "max number of open connections". You can set this higher.

linas avatar Mar 05 '19 21:03 linas

MST-Parser to use multiple threads?

In principle, yes: just start new threads. Everything is thread safe, this should "just work". (there are unit tests for this)

In practice, there is some deep, unfixed live-locking bug in guile that is blocking this. Nothing crashes, but the live-lock means it runs very slow. When I run 2 threads, it usually runs abut twice as fast. When I run three threads, it goes 3x faster. Usually. For a while. When I run 4 threads, it goes 4x faster... for a few minutes, and then it goes 0.5x single-threaded speed. When I run 8 threads -- it goes 0.01x faster than single-threaded. A total disaster.

The root cause of this needs to be identified and fixed. I'm 95% sure that the bug is in the guile code-base, but it could be elsewhere (e.g. bdw-gc). I think I probably could fix this but I don't have the time. I think that it is "probably not too hard" to fix -- i.e. it probably requires less than 50-100 lines of code changes to fix. Probably. But it might be deeper/harder than that. There is no one on our software team who has any experience at all with bug-fixing multi-threaded environments, except maybe Vitaly. Who probably does not want to do this. I know someone who could fix this, but he's not on our payroll, and I can't really ask this a a favor.

linas avatar Mar 05 '19 21:03 linas

while OpenCog doing MST-Parses was consuming just few percents appearing semi-idle.

I'm fairly sure that this means that you are bottle-necked on postgres, trying to store updated word-counts. Open a new connection, and type in (sql-stats) -- this will print a dense, cryptic report on how the atomspace is using the postgres backend. About half-way down the screen, if it says "stalled", that means that it is waiting for postgres to finish writing. It will also tell you that it hit the high-water-mark (i.e. the queue is full, and it is waiting for the queue to drain). It will tell you the longest drain time ever seen.

From personal experience, I have gotten major performance improvements in using SSD disks, instead of spinning disks. The Atomspace README also has some detailed instructions on performance-tuning postgres. If you haven't done those, you should. They make a big difference.

I don't know if this will actually make a difference for you. I'm assuming it should. If opencog is not running at 100% cpu (or more), that means that the operating system is preventing it from running at 100% for some eason. The only reason for this is that it is waiting on I/O. There are two reasons for waiting on I/O: 100% RAM-full, hitting the swap disk, and and the process is doing legit I/O. The only legit I/O that the atomspace does is to talk to the postgres server, so waiting on postgres should be the only reason why it's not going at 100% CPU.

It is possible that the atomspace is doing something stupid: maybe it is asking postgres for data that it doesn't actually need. I dunno. Again -- we would need someone with multi-threading, multi-processing experience on staff, but we do not have anyone with that skill set.

linas avatar Mar 05 '19 22:03 linas

So -- for example: basic guile multi-threading works fine. Here's a mock-up of MST parsing in ten threads. It runs at 900% CPU on my machine. (Ideally, it would be 1000%)

(use-modules (opencog) (opencog exec))
(use-modules (ice-9 threads))

; Do it 50 million times
(define NITER (* 50 1000 1000))

(define (fun thread-label)
   ; Some atom ... different atom in each thread....
   (define some-atom (ConceptNode (number->string thread-label)))

   ; loop that increments the count truth value by one each
   ; time in loop.
   (define (loopy cnt)
      (if (> cnt 0) (begin (cog-inc-count! some-atom 1) (loopy (- cnt 1)))))

   ; Set the count to zero
   (cog-set-tv! some-atom (CountTruthValue 1 0 0))

   ; run the loop
   (loopy NITER)
)

; Run function `fun` in ten threads.
(par-for-each fun (iota 10))

So this basic example of mixed atomspace+guile multi-threading works fine.

linas avatar Mar 05 '19 22:03 linas

The clustering code is here: lines 140-148 of gram-agglo.scm -- you will see, commented out in that code, (cls (par-find merge-pred CLS-LST)) -- this function is attempting to perform the clustering in parallel. It's commented out because, for me, it runs slower than single-threaded clustering. I don't know why.

linas avatar Mar 05 '19 22:03 linas

The MST parsing code is here: lines 237-250 of mst-parser.scm Let me cut-n-paste it here, for you:

(define-public (observe-mst plain-text)
"
  observe-mst -- update pseduo-disjunct counts by observing raw text.

  This is the second part of the learning algo: simply count how
  often pseudo-disjuncts show up.
"
   ; The count-one-atom function fetches from the SQL database,
   ; increments the count by one, and stores the result back
   (for-each
      (lambda (dj) (if (not (is-oversize? dj)) (count-one-atom dj)))
      (make-sections (mst-parse-text plain-text))
   )
)

Note the comment: "fetches from the SQL database, increments the count by one, and stores the result back" -- that is probably your bottleneck. The fetch is not really necessary, if everything fits in RAM for you. The store is not really necessary, if you remember to store everything before you exit (and if you don't crash in the meanwhile)

linas avatar Mar 05 '19 22:03 linas

And so one more test -- this one feteches an atom, increments the count by one, stores the atom, each time in the loop. -- in ten threads. On my machine, guile uses 300% cpu times, and there are 14 different postgres processes, each using about 30% cpu time.

(use-modules (opencog) (opencog exec))
(use-modules (ice-9 threads))
(use-modules (opencog persist) (opencog persist-sql))
(use-modules (system repl server))

(spawn-server (make-tcp-server-socket #:port 8888))

; Do it 50 million times
(define NITER (* 50 1000 1000))
(set! NITER (* 100 1000))

(define (fun-sql thread-label)
   ; Some atom ... different atom in each thread....
   (define some-atom (ConceptNode (number->string thread-label)))
   ; loop that increments the count truth value by one each
   ; time in loop.
   (define (loopy cnt)
      (if (> cnt 0) (begin
         (fetch-atom some-atom)
         (cog-inc-count! some-atom 1)
         (store-atom some-atom)
         (loopy (- cnt 1)))))

   ; Set the count to zero
   (cog-set-tv! some-atom (CountTruthValue 1 0 0))

   ; run the loop
   (loopy NITER)
)

(sql-open "postgres://opencog_tester@localhost/opencog_test")

; Run function `fun` in ten threads.
(par-for-each fun-sql (iota 10))

You can monitor the progress of this by saying this to get a guile REPL via the network:

$ rlwrap telnet localhost 8888

linas avatar Mar 05 '19 23:03 linas

@linas - thank you, that is all very useful. However, @glicerico has eliminated dependency on Postgres now but the problem is still apparent - I am biased to consider Docker to be a culprit: https://docs.google.com/document/d/1WB6GrVDsy9uzkJFzN3wMIRMGBI8APjxdQcFk7CbPsl8/

akolonin avatar Mar 06 '19 06:03 akolonin

Cut-n-paste of that googledoc document:

Understanding reasons of low run-time performance of the OpenCog MST-Parsing 
https://github.com/singnet/language-learning/issues/179

Anton:
Below is the activity of the OpenCog MST-Parser running 7 parsing processes.

Here is what I see:
CPU usage (sudo top):
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
26620 akolonin  20   0 3249040 2.375g  36344 S   8.6  0.9  74:29.95 guile
26663 akolonin  20   0 3264436 2.372g  36760 S   7.6  0.9  73:57.57 guile
26745 akolonin  20   0 3346472 2.439g  36488 S   7.6  1.0  74:36.30 guile
26482 akolonin  20   0 3201956 2.294g  32344 S   7.3  0.9  79:13.54 guile
26527 akolonin  20   0 3198784 2.328g  36340 S   7.0  0.9  74:12.69 guile
28570 root      20   0 1333756  83992  37392 S   6.6  0.0   1343:01 dockerd
26571 akolonin  20   0 3249108 2.345g  36504 S   6.3  0.9  74:28.60 guile
26705 akolonin  20   0 3281064 2.406g  36624 S   6.3  1.0  74:49.19 guile
23080 root      20   0    9060   3676   3128 S   2.0  0.0 132:12.00 docker-containe
1393 root      20   0    9060   3216   2940 S   1.7  0.0 128:56.47 docker-containe
7234 root      20   0    9060   3592   3128 S   1.7  0.0 131:09.46 docker-containe
10575 root      20   0    9060   3348   3128 S   1.7  0.0 131:46.90 docker-containe
12383 root      20   0    9060   3696   3128 S   1.7  0.0 129:22.87 docker-containe
13186 root      20   0    9060   3576   3128 S   1.7  0.0 132:15.01 docker-containe
20576 root      20   0    9060   3520   3128 S   1.7  0.0 129:28.74 docker-containe
…
IO usage (sudo iotop):
 TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 726 be/4 syslog      0.00 B/s    7.64 K/s  0.00 %  0.00 % rsyslogd -n [rs:main Q:Reg]
28828 be/4 root        0.00 B/s   11.46 K/s  0.00 %  0.00 % dockerd -H fd://
9279 be/4 root        0.00 B/s    3.82 K/s  0.00 %  0.00 % sshd: unknown [priv]
28870 be/4 root        0.00 B/s   19.10 K/s  0.00 %  0.00 % dockerd -H fd://
29023 be/4 root        0.00 B/s  129.90 K/s  0.00 %  0.00 % dockerd -H fd://
28618 be/4 root        0.00 B/s   61.13 K/s  0.00 %  0.00 % dockerd -H fd://
26624 be/4 akolonin    0.00 B/s    0.00 B/s  0.00 %  0.00 % guile -l launch-cogserver.scm -- --lang en --mode mst --db test4
   1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % systemd --system --deserialize 21
   2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
26627 be/4 akolonin    0.00 B/s    0.00 B/s  0.00 %  0.00 % guile -l launch-cogserver.scm -- --lang en --mode mst --db test4
   4 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker/0:0H]
26629 be/4 akolonin    0.00 B/s    0.00 B/s  0.00 %  0.00 % guile -l launch-cogserver.scm -- --lang en --mode mst --db test4
   6 be/0 root        0.00 B/s    0.00 B/s
Postgres activity (sudo ps -A -o pid,fname,pcpu,args | grep postgres):
13283 postgres  0.0 postgres
13346 postgres  0.0 postgres: checkpointer process  
13347 postgres  0.0 postgres: writer process  
13348 postgres  0.0 postgres: wal writer process  
13349 postgres  0.0 postgres: autovacuum launcher process  
13350 postgres  0.0 postgres: stats collector process  
16057 grep      0.0 grep --color=auto postgres
26507 postgres  0.0 postgres: opencog_user test1 172.17.0.5(42906) idle
...
26520 postgres  0.0 postgres: opencog_user test1 172.17.0.5(42932) idle
26551 postgres  0.0 postgres: opencog_user test2 172.17.0.7(52722) idle
...
26780 postgres  0.0 postgres: opencog_user test7 172.17.0.6(55302) idle
26781 postgres  0.0 postgres: opencog_user test7 172.17.0.6(55304) idle
26782 postgres  0.0 postgres: opencog_user test7 172.17.0.6(55306) idle
Docker activity (sudo docker stats):
CONTAINER ID        NAME                 CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
b6ec36b84672        opencog_dev_run_10   0.00%               12.46MiB / 251.8GiB   0.00%               149kB / 62kB        0B / 0B             6
3131d096d343        opencog_dev_run_9    0.00%               14.08MiB / 251.8GiB   0.01%               149kB / 62kB        0B / 0B             6
94620d1e9d54        opencog_dev_run_8    0.00%               13.63MiB / 251.8GiB   0.01%               157kB / 69.2kB      26.9MB / 115kB      6
767228e46f41        opencog_dev_run_7    14.49%              2.69GiB / 251.8GiB    1.07%               450kB / 365kB       4.3MB / 618kB       42
9e4da9b4d8af        opencog_dev_run_6    15.66%              2.673GiB / 251.8GiB   1.06%               456kB / 370kB       3.28MB / 655kB      43
3ff7911613d0        opencog_dev_run_5    17.14%              2.71GiB / 251.8GiB    1.08%               449kB / 366kB       3.08MB / 598kB      47
208101cdd76a        opencog_dev_run_4    13.49%              2.635GiB / 251.8GiB   1.05%               457kB / 368kB       13.3MB / 2.52MB     43
075167f225b0        opencog_dev_run_3    14.33%              2.633GiB / 251.8GiB   1.05%               421kB / 347kB       12.9MB / 32.7MB     43
cf3e7daf6ed4        opencog_dev_run_2    13.60%              2.756GiB / 251.8GiB   1.09%               447kB / 361kB       267MB / 34.1MB      43
39dcd8c869b8        opencog_dev_run_1    16.57%              2.587GiB / 251.8GiB   1.03%               979kB / 1.01MB      61.9MB / 4.25MB     48
67c28f38474d        opencog_relex_1      0.09%               194.3MiB / 251.8GiB   0.08%               133kB / 434B        1.27MB / 0B         44
96f293ec5799        shadowbox            0.35%               102.7MiB / 251.8GiB   0.04%               0B / 0B             390MB / 1.8GB       81
e064ea43f2d5        watchtower           0.00%               8.363MiB / 251.8GiB   0.00%               133kB / 0B          18.3MB / 0B         18
6de086b1048a        opencog_postgres_1   0.00%               187.9MiB / 251.8GiB   0.07%               68.5GB / 82.1GB     58.6MB / 6.53TB     105


There is a LOT of disk writes made by docker, and they are seemingly non-buffered outputs which needs rotating mechanical cylinders to be synchronized with MST-Parsing process. 
Which particular part of the pipeline may need this operation and can it be workarounded?

Andres:
I'm not very familiar with the way disk writes and buffering work with respect to resource consumption. I can tell you that for each processed file, the pipeline is doing:
1) cat file | perl_script (https://github.com/singnet/learn/blob/master/run-poc/submit-one.pl) 
2) the perl_script sends every sentence through netcat to the cogserver
3) the cogserver receives a line, checks whether it's a sentence or a word-pair
4)     if it's a sentence, it stores it in a variable
5)     if word-pair, creates link between the two words, with its weight attached as value
6)     if it's an empty line, then it parses the sentence using the links created previously, and write the newly created parse to file
7)  repeat from 3) for next line of the file

Anton:
Can the Docker be the reason of run-time performance bottleneck itself?
https://www.weave.works/blog/i-quadrupled-the-speed-of-our-container-test-rig-in-a-day/
Possible hints:
https://docs.docker.com/config/containers/resource_constraints/
https://docs.docker.com/engine/reference/run/#runtime-constraints-on-resources

linas avatar Mar 06 '19 21:03 linas

So Multiple comments.

29023 be/4 root 0.00 B/s 129.90 K/s 0.00 % 0.00 % dockerd -H fd:// There is a LOT of disk writes made by docker, and they are seemingly non-buffered outputs which needs rotating mechanical cylinders

  1. Maybe Docker has logging or tracing turned on? Maybe it is just writing to some trace/log-file (syslog or elsewhere) ? You can use lsof and fuser to find out (lsof==list open files, fuser==file user)

  2. 129.90 K/s is a lot, but not huge. all modern disk drives have at least 3 if not 32MBytes or more of RAM in them to buffer up writes while disk spins into place. You would have to be very unlucky to get tangled up with the elevator algorithm. (Make sure you have deadline-I/O block scheduling turned off. It is always off, by default, but maybe it got turned on?) cat /sys/block/sda/queue/scheduler replace sda by your disk, etc.

The list that you gave below does almost-very-nearly zero disk-i/o (esp if you have postgres disabled)

1) cat file | perl_script (https://github.com/singnet/learn/blob/master/run-poc/submit-one.pl) 
2) the perl_script sends every sentence through netcat to the cogserver
3) the cogserver receives a line, checks whether it's a sentence or a word-pair
4)     if it's a sentence, it stores it in a variable
5)     if word-pair, creates link between the two words, with its weight attached as value
6)     if it's an empty line, then it parses the sentence using the links created previously, and write the newly created parse to file
7)  repeat from 3) for next line of the file

linas avatar Mar 06 '19 21:03 linas

Next: LXC is a LOT nicer than docker for data processing! Docker is great for web-servers but totally sucks for data processing. Why? Because, with docker, you cannot stop/start containers! When you start a docker container, it wipes out all the memory! You start fresh, clean -- you've lost all config, all data, everything. That's because starting a docker container is like rebooting the OS -- you lose everything when you reboot.

LXC also handles sockets and multi-processing more nicely than Docker. And you can log into it, and fiddle around, look at things, adjust things. Docker was meant for cloud deployments of webservers. LXC is meant for movable, migratable data processing (on the cloud or elsewhere).

Basically, its easier-to-use -- simpler, better, nicer than Docker. Less cranky, less balky.

linas avatar Mar 06 '19 22:03 linas

You did not include the header for top. It should include a line, similar to this:

%Cpu(s):  3.6 us,  0.7 sy,  0.0 ni, 95.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

us: % time running non-nice user code sy: % time in operating ssytem kernel ni: % time running nice user code id: % time idle wa: % time blocked waiting for i/o to complete hi: hardware interrupts si: software interrupts st: time stolen away from you by hypervisor

Normally st is zero cause most desktop systems do not have a hypervisor (most cloud instances are running on a hypervisor) For you, wa, hi, si and st should be all zero.

linas avatar Mar 06 '19 22:03 linas

This:

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
26620 akolonin  20   0 3249040 2.375g  36344 S   8.6  0.9  74:29.95 guile

Notice the column called S, and under that column is S. The S column is process state. it can be R (running) S (sleeping) Z (zombie) Since its S its sleeping - as confirmed by %CPU less than 100%

If guile is NOT waiting on disk, then maybe it is waiting on perl to give it the next sentence. It is possible that maybe perl is not getting a chance to run often enough. For example, lets assume 2 or 4 CPU's and a clock tick rate of 100Hz (clock ticks every ten millisecs)

  • perl runs and in a microsecond, it gives guile one sentence.
  • guile does MST parse in less than a millisecond, and then waits for next sentence.
  • linux kernel schedules some other processes to run for a while.
  • ten millisecs later, clock ticks, and perl sends one sentence

The above would be weird and pathological, but could happen. As a result, perl issues 100 sentences/sec (exactly one for each time it gets scheduled), which guile handles almost immediately, and then guile does nothing, sleeping, waiting for another sentence. Meanwhile, perl is waiting to get a chance to run again (which it never gets, because some other process is hogging all the CPU)

Fixes include:

  • Changing clock tick rate to 1000Hz
  • Giving perl a negative nice
  • Redesigning the software to send over lots of sentences, instead of one at a time. Or maybe, redesigning the guile handler to read a bunch of sentences in one thread, and MST-process them in another thread, so that socket reading thread can suck in sentences as fast as possible. Or something like that.

My guess is that giving perl a negative nice is a simple hack that you can check instantly, without any work. Maybe you'll get lucky.

linas avatar Mar 06 '19 22:03 linas

If the above crazy scenario of waiting on the OS to schedule perl to deliver one sentence is ahppening, it can be diagnosed by getting a microsecond-accurate clock, sitcking it in the perl script, and also in guile, in four spots: when guile gets a new sentence, when it finishes with the sentence, and immediately before and after the cogserver socket read (where it reads the sentence). The timestamps should help clarify if this is happening.. The cogutils logger prints micro-sec time-stamps.

linas avatar Mar 07 '19 02:03 linas

@linas , thanks a lot for all the ideas. After experimenting for a while, I finally found that what seems to be the bottleneck is indeed passing the sentences to the cogserver through netcat, and in particular the change of terminal that comes with it (I'm not sure if that's the proper term, see below). In the test, for each sentence in the input file, instead of processing it I just send a simple command to the cogserver

        { print NC "(define a 1) \n"; }

and it takes the same time as if I was processing the whole sentences and mst-parsing. The terminal looks like this:

guile-mst-en> mst-en> Entering scheme shell; use ^D or a single . on a line by itself to exit.
guile-mst-en> 
guile-mst-en> mst-en> Entering scheme shell; use ^D or a single . on a line by itself to exit.
guile-mst-en> 

Instead, if I change the command to send to just a newline:

        { print NC "\n"; }

then the terminal doesn't swap between guile-mst-en> and mst-en and the processing takes 1/30th of the time. So, I also ran a test where I define my sentences inside the same scm file and don't need to netcat them, and the process takes around 1/8th of the time as the slower one, and CPU does get utilized close to 100%.

So, the course of action @akolonin would be to redesign the parser like @linas suggested, to send blocks of text containing a sentence and all it's weights, instead of single lines. I think that could considerably help.

glicerico avatar Mar 10 '19 06:03 glicerico

I modified the workflow to netcat blocks of sentence + weights lines, and the MST-parses are being processed at least 50 times faster. Running now and expecting to confirm that parses are the same as before.

glicerico avatar Mar 11 '19 19:03 glicerico

Ahh. netcat. There are two fundamentally incompatible versions of netcat out there. One of them waits for the remote end of a connection to close the socket, the other does not. If you have the version that waits, and the socket doesn't close, then it waits for "timeout" seconds. During which everything is idle.

Why are there two versions, and why are they incompatible? Why do they have the same name? I don't know. It's a major screw-up. One is from BSD, one is from elsewhere. One expects the nc -q seconds flag, the other has a different meaning for -q Also: the -N flag. My manpage says:

     -N      shutdown(2) the network socket after EOF on the input.  Some
             servers require this to finish their work.

     -q seconds
             after EOF on stdin, wait the specified number of seconds and then
             quit. If seconds is negative, wait forever (default).  Specifying
             a non-negative seconds implies -N.

Try adding -q 0 to your NC command. Or maybe -N -q0 just to be safe.

Also, perhaps DNS is slow:

     -n      Do not do any DNS or service lookups on any specified addresses,
             hostnames or ports.

use a dotted-decimal address e.g. 192.168.0.1 instead of hostname .. it might go faster. On localhost, use 127.0.0.1

There are lots of additional complexities in the cogserver+network+guile interaction, but my first guess is that its netcat mis-behaving.

linas avatar Mar 11 '19 21:03 linas

Created a super-simple benchmark, here: https://github.com/opencog/benchmark/pull/21 Its a super-simple perl script to send commands to the cogserver.

It reproduces all the report results for this bug: cogserver runs at about 10% CPU time. I don't know why -- I expect 100% CPU time, or even better.

Changing nc -q 0 localhost 17001 to nc -q 0 -n 127.0.0.1 17001 has no effect at all on performance.

Running sudo nice -n -10 ./shell-bench.pl has no effect on performance.

I cannot tell if this is a netcat artifact, or if this is a cogserver bug, or if there's some other explanation.

linas avatar Mar 11 '19 23:03 linas

the MST-parses are being processed at least 50 times faster

You are THE MAN. The CPU util is close to 100% now, I guess?

akolonin avatar Mar 12 '19 02:03 akolonin

OK, Found it. (more or less) See https://github.com/opencog/benchmark/pull/22

Instead of using netcat, I created a native perl socket, and just sent data on that. I ran it in two modes: wait, and no-wait. In wait-mode:

  • open socket, send ascii text to cogserver, wait for complete reply, close socket.

In no-wait mode:

  • open socket, send ascii text to cogserver, close socket immediately.

In "wait" mode, the cogserver can handle about ten requests/second. The cogserver is 97% idle.

In "no-wait" mode, the cogserver handles 160 requests/second, the cogserver is 200% to 300% busy, and is running 30 to 80 threads in parallel (20 of these threads are cog-logger threads) (so its 16 times faster, but using bucket-loads of CPU to do this. I think the CPU is being wasted on creating and deleting threads, creating and exiting guile for each thread. Each actual work request was tiny. Each actual work request was a write-to-cog-logger so I know that the cogserver received and handled the request, because its in the log-file.)

Strangely, the netcat version is part-way between the two. So, netcat is not waiting for the full reply, but its also not closing the socket immediately...

Conclusion: it's some kind of networking plus operating-system-scheduling effect. The wait-for-reply is a real killer, it seems. I don't understand why.

linas avatar Mar 12 '19 02:03 linas

Here's how to write TCP socket code in perl. Its super-low-brow and minimal.

use Socket;
my $port = 17001;
my $server = "127.0.0.1";

# Don't use netcat; just send directly
# argument 1: what to send (ascii string)
# argument 2: bool flag, wait for response, or not.
sub send_stuff {
   socket(SOCKET, PF_INET, SOCK_STREAM, (getprotobyname('tcp'))[2])
      or die "Can't create a socket $!\n";

   connect(SOCKET, pack_sockaddr_in($port, inet_aton($server)))
      or die "Can't connect to port $port! \n";

   my $text = $_[0];

   # if subroutine has second argument, use it.
   my $wait_for_response = 0;
   if (1 < scalar @_) {
      $wait_for_response = $_[1];
   }
   # print "Will wait for response: $wait_for_response\n";

   if ($wait_for_response) {
      SOCKET->autoflush(1);
      print SOCKET "$text\n.\n.\n";
      my $line;
      while ($line = <SOCKET>) {
         # print "Drained: $line";
      }
      close SOCKET;
   } else {
      print SOCKET "$text\n.\n.\n";
      close SOCKET;
   }
}

Example usages:

send_stuff('(use-modules (opencog logger))',);
send_stuff('(cog-logger-set-stdout! #t)');
send_stuff('(cog-logger-info "cbgb omfug")', 0);   # don't wait 
send_stuff('(cog-logger-info "cbgb omfug")', 1);   # wait for reply.

Look at /tmp/cogserver.log to see the log messages. That is, you know that the cogerver got this stuff, because it shows up in the log.

linas avatar Mar 12 '19 02:03 linas

Again: moral of the story: its netcat/network craziness. You have two options:

  1. continue to use netcat, but send across large blocks of text
  2. replace netcat by above perl code, suitably cleaned up.

(I'm offering perl only because the rest of the NLP pipeline has perl in it. Writing socket code in the language of your choice (c++, scheme, python, typescript, ML) is just as easy, if ... if you avoid the temptation to use someone-else's visual glitter rock-star API to sockets. All the languages have 3 or more rock-star API's to sockets, and they are all pretty much all of them are harder, not easier to use, and are also slower and buggier. Just saying. Don't be fooled -- I've been there, done that -- its hard to improve on the original, which, strange as it seems, was nearly perfect already.)

I recommend option 2) although it seems @glicerico is already sending across large blocks. Maybe it doesn't matter which - they're both OK options.

linas avatar Mar 12 '19 03:03 linas

@akolonin yes, the block-based approach with netcat is taking around 150% of my CPU (I have 4 cores in my laptop) @linas I tested the socket approach and it is indeed even 6 times faster than the block-based approach and uses around 750% of my CPU. I think that should be the way to go. However, there's a catch with that: the threads finish in different times, so it's not processing the sentences in order (which matters to our current process, since we're printing the parses to file, and they should be ordered). @akolonin , @alexei-gl , we could solve this by re-ordering the file later on, or change our parse-evaluator approach... ideas? Anyhow, I think this issue can be closed now.

glicerico avatar Mar 12 '19 08:03 glicerico

@glicerico - that is outstanding! :-) Still, I suggest to close once the ordering problem is solved. The re-ordering should be done after the completion, otherwise too many things will get broken and we will be losing more time on them. So until then, we better run in slow ordered mode. From architectural standpoint, there should some "collector" process that takes things from different threads/processes together. Alternatively, you can have some intermediate storage and feed inputs with line numbers and store them with line numbers to intermediate storage. After the completion of all processes/threads, the outputs can be taken out of intermediate storage, sorted on line numbers and saved again without of line numbers to clean file.

akolonin avatar Mar 12 '19 08:03 akolonin

@glicerico + @alexei-gl - re: @linas "replace netcat by above perl code, suitably cleaned up." I suggest Python, with the code residing in language-learning project, because the rest of ULL pipeline is being written in Python and that way it could be seamless integration with the Python DNN code and Grammar Learner and Grammar Tester code.

akolonin avatar Mar 12 '19 09:03 akolonin

I already implemented it in Perl based in @linas example, but of course feel free to create a different python version

On Tue, Mar 12, 2019, 17:16 Anton Kolonin [email protected] wrote:

@glicerico https://github.com/glicerico + @alexei-gl https://github.com/alexei-gl - re: @linas https://github.com/linas "replace netcat by above perl code, suitably cleaned up." I suggest Python, with the code residing in language-learning project, because the rest of ULL pipeline is being written in Python and that way it could be seamless integration with the Python DNN code and Grammar Learner and Grammar Tester code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/singnet/language-learning/issues/179#issuecomment-471914103, or mute the thread https://github.com/notifications/unsubscribe-auth/AWakOmS_mYSPQC2pW9KYMq_uIewkVbIJks5vV3BtgaJpZM4bYpId .

glicerico avatar Mar 12 '19 10:03 glicerico

@glicerico , Re: "I already implemented it in Perl based in @linas example, but of course feel free to create a different python version" - JIC, I mean the version with syncing up outputs after the unsync batch processing. If you have it in Perl already - great. We can close this then :-)

akolonin avatar Mar 12 '19 12:03 akolonin

I picked perl only because the scripts that grab files and feed them are in perl.

Why perl and not python? For mangling text files, perl is actually simpler (shorter/easier) than python. Maybe half as many lines of code, or even less. so like 10-20 lines of perl code, instead of 50-100 lines of python code. Just the fact that its so short just makes it easier to debug, maintain, understand.

The flip side: when programs get big, complicated, then python is better, because python forces you to use "structured coding style" -- forces you to indent. Forces you to declare subroutine arguments. Forces you to open and close files, and use RAII and all that wholesome white-bread stuff that polite and well-mannered programmers are supposed to do. (and on big projects, this is a good thing).

But on small projects, like "I'm lazy and I want to do it in 5 minutes" projects -- perl actually allows you to do it in 5 minutes. Instead of 3 hours. And that is a HUGE win.

And also -- again, for text processing, perl is a huge win, because it has regex built in and automatic. You can reformat files in literally three lines of code. In python, you can't even finish saying import foobar in three lines of code.

linas avatar Mar 19 '19 02:03 linas

@glicerico I merged my variant of the fix here: https://github.com/opencog/learn/pull/3 If your solution is more complex than that you may want to review and simplify. Notice that in the final fix, I removed the $wait_for_response flag - thus the resulting socket code is just 5 lines of code total (not counting white-space and comments)

@akolonin Take a look at either the old or the new perl script, and you'll see what I mean by "very short program: Here's the old one https://github.com/opencog/learn/blob/master/run/submit-one-netcat.pl It's 43 lines total, of which 28 are comments/whitespace, and 15 that actually do something, including (1) verifying the arguments, (2) verifying that the server is alive, (3) measuring the elapsed time, and (4) printing the results (5) actually sending the data. That's five things in 15 lines of code. You can't even begin to touch that in python.

The new script is here: . https://github.com/opencog/learn/blob/master/run/submit-one.pl its a little longer and more complicated: it has to waste two whole lines of code to open a socket, and a third to close it! It has to waste one line of code to declare a subroutine. And a second line of code to call the subroutine! (what a waste!) Again, this is impossible in python.

Moral of the story: different languages are good for different things, and restricting yourself to only language is a strong guarantee that you will be using the wrong programming language most of the time. There is no "one ring to rule them all".

linas avatar Mar 19 '19 03:03 linas

Thanks Linas. I'm not able to use the socket version yet, because we need ordered submission of the sentences in the way our pipeline works now. We'll modify it sometime soon, to take advantage of this faster way of processing the files :)

glicerico avatar Mar 19 '19 04:03 glicerico