goaccess icon indicating copy to clipboard operation
goaccess copied to clipboard

JSON response bandwidth usage via WebSocket server

Open oddjobz opened this issue 1 year ago • 17 comments

So, I've just noticed that the websockets update seem to be a complete data refresh every couple of seconds .. this is generating ~ 120k per refresh, so maybe 2-3Mbtytes per minute .. which is something like 2G per hour, or 50G per day. Which is kinda huge in the context of scaling to many users (which is what I'm looking at)

My current websocket client/server code just transfers deltas, I was wondering if there was any scope in the code for outputting to a local file / key-value store rather than a websocket, in order to hook in a more efficient ws mechanism?

(or alternatively, a way of cutting down the data packets ... other than disabling a lot of charts? or maybe compress the data?)

oddjobz avatar Oct 25 '24 21:10 oddjobz

Good point! With a small tweak, you could probably read the named pipe that GoAccess uses to get the data directly. The --stdout option was added to gwsocket, but it hasn’t made it to GoAccess yet. Merging that change should be pretty straightforward, though. I recall there’s a request for real-time JSON format too, but for now, grabbing the output from the pipe seems like the easiest option.

allinurl avatar Oct 25 '24 22:10 allinurl

Also, take a look at mod_deflate, I think it can handle application/json content types.

allinurl avatar Oct 25 '24 22:10 allinurl

Ok, the bandwidth usage is a little crazy .. is there any way to limit the frequency of response, so no more than once every 5s? I know the whole idea is "live", but this will chew up my monthly bandwidth allowance in a matter of days ..?

oddjobz avatar Oct 25 '24 23:10 oddjobz

--html-refresh=<seconds> Refresh the HTML report every X seconds. The value has to be between 1 and 60 seconds. The default is set to refresh the HTML report every 1 second.

allinurl avatar Oct 25 '24 23:10 allinurl

It would appear html-refresh sends the first WS packet then stops .. and for some reason my persist and restore seem not to be working, logrotate just rand and after restarting goaccess instances I'm seeing blank charts. Very odd .. cache folders are not populated .. although I have an old cache folder that is. Might have to call it a night, will look again tomorrow.

oddjobz avatar Oct 26 '24 00:10 oddjobz

Mmm.. works when I launch it from the command line, but when I launch goaccess from a python script, persist doesn't write the database file in the cache folder .. no error. Will investigate tomorrow, I guess maybe it's a pty issue.

oddjobz avatar Oct 26 '24 00:10 oddjobz

Just a quick heads up: if you're piping data or running it from a script, be sure to include -. e.g.,

# cat access.log | goaccess - --log-format=COMBINED

allinurl avatar Oct 26 '24 01:10 allinurl

Ok, so I seem to have resolved a number of issues. I'd not appreciated that the cache is only written on a clean exit and my sub-process shutdown was obviously a little too severe. I'm now doing a SIGINT and that seems to be writing cache files on exit.

However; what happens if the application (or server) has a hard crash? This seems to imply that logging information might be lost if there was a logrotate since it was last restarted? Should I be restarting all goaccess instances following a logrotate (to ensure the cache is updated?)

http-refresh now seems to work for me, I was obviously doing something wrong here. bandwidth, I've reduced rows to 24, removed a couple of less important tables, and turned compression up to 9 .. which has left me with a throughput of 13k bytes per second .. which is a lot better, but possibly still 10x what it could probably be. I need to generate you a html file for the mouse-over issue, then I'll take a look at the alternative WS transport.

oddjobz avatar Oct 26 '24 16:10 oddjobz

Hi @oddjobz It's great to talk to you here. So, I think things are a bit mixed here.

What you do means with " cache is only written ... " ? You are running in real-time mode. Right ?

Humm... I see. You want running in real-time mode and generate persistent data storage. Yes; GOAccess can be do. But... You will be a price to pay. So, in real-time mode GOAccess need (or just worry) only generate JSON data for transmission under WebSocket to client. I believe that ( and correct me if I'm wrong @allinurl ) persistent data need finished LOGs requests processing for save to disk, like in normal mode (non real-time).

Well... I use day-by-day GOAccess in real-time mode. And I believe some tricks that I learned can be useful for you too.

So, I use too scripts for start and stop... In truly, I created SystemD service script for that. Well... I used timeout tool for stop/end service, at ending of day. I prefer that, instead of running indefinitely. So, You need only use timeout --foreground TIME-TO-END goaccess ... . You can know more details using man timeout . This tool work perfectly, send TERM signal to GOAccess and made finished itself.

But, the price is... Service will be stopped. Off course; do You need restart it. And You need be refresh page, at browser, for running again, because WebSocket went away. For me; the advantage is that -- I had the frozen report ( state of data ) at moment in GOAccesswas stopped. For You; the persistent data will be saved to disk !

Yes; is short answer about warning lost your persistent data. Do you need made a copy, every time, before of starting GOAccess. And if something to happen, You may to reprocessing LOGs at normal mode at ending of day.

I hope to be clean, and that helped you. Feel free to talk about.

0bi-w6n-K3nobi avatar Nov 05 '24 14:11 0bi-w6n-K3nobi

" cache is only written ... "

Ok, so this would be a feature request :-)

Please can we have an option for goaccess to flush it's persistent storage cache to disk periodically .. say every minute .. so the most data that could be lost would be 1m?

In the meantime I'm going to set an hourly restart :) Just as a matter of interest, this is how I'm using it, embedded within a Vue Application ... so it automatically creates and maintains a live stats instance for every site it tracks. Another useful feature would be the ability to manipulate the side-bar a little more easily .. ;-)

GOAccessInMMS

oddjobz avatar Nov 05 '24 14:11 oddjobz

Hi @oddjobz Again, I feel good to talk to you here.

Well ... ... this is impossible, in practice. I will explain for why... So; I believe the best way for realize this, will be run 2 instances: 1 for real-time and other for offline LOGs processing. (at end-of-day for example)

But; Why is this impractical ?

  • The nature of GOAccess, first at all: (@allinurl, please, correct me if I is wrong) This great tool was planned for 2 modes: real-time and offline LOGs processing. First for real-time monitoring your sites, and second for report and persistent data storage (off course). Second mode will save data at ending, no more and no less.
  • Big Overhead: if you claim for large data transport for JSON via WebSocket, so what will be if save all data each minute. And here need make a note: the WebSocket just only send data needs for graphics and tables, but this is not all data that GOAccess retain. Here, I know that possible to use fast storage like M2 NVME. But, GOAccess need save all data and not only one delta or snapshot. Remember that it can retain millions of requests and several days, and this operation will do slow downing its operation.
  • Operation System and File System constraints: Neither Windows or Linux good enough trust for your data file. So, here We can prolong talk about safely. Only real-time OS and very security File Systems can be guarantee your data. Even ZFS/OpenZFS can not do it. All File Systems has interval between flush of data. For example: If at exactly moment of saving data file, your PC turn off, your file can be lost or corrupt. If not then your data will be old data, and not from last 1 minute. Off course, I can suppose that you are using one no-break. Well; it can fail too. Almost security way is a Raid Controller at Server Machine, with a battery (off course) !
  • Right Way (mechanism) for resume from stopped point: Well, how resume from stopped point? I.E. Which mechanism/strategy will You adopt for resume? I suppose here that LOG file processing was interrupt before to be finished. If You already had a response, so why not using this for offline processing between regular intervals. For example: hour in hour, half hour, etc.

Humm ... What I propose here to You do use GoAccess for what it does well. Therefore, without creating any expectation where it may be fail. I use it at day-by-day, processing roughly 20 sites and 6 millions of hits. Processing offline/batch LOGs and persist data is way like GOAccess work well. In this way, you can split your process in regular intervals, and backup Your storage data files before continue next set of LOGs. In this way, You maybe guarantee or prevent any data lost. If some error to happen, so you can continue at point-of-error reprocessing LOGs again.

Well. I hope than be clean. Again; feel free to show your point.

0bi-w6n-K3nobi avatar Nov 08 '24 16:11 0bi-w6n-K3nobi

Ok, so the statistics need to be relatively accurate, but losing a small percentage of the information is not an issue. So when you talk about data safely, I think you are missing the point.

Here is the operational scenario;

  1. (n) GoAccess processes run on (n) log files (say for example n=300)
  2. These logs are real-time and start from the beginning of the current log file
  3. Every night "logrotate" moves each log to "log.1" and starts a new log file
  4. After 3 days, the GoAccess processes (all 300) need to restart

Problems;

  1. Each GoAccess process will restart with the current .log file, losing 2 days of history
  2. If there is no persistence, this is a lot of concentrated processing in one hit, not great for the server
  • If this repeats, there will never be any realistic history available for any of the virtual servers.
  • If this is not "live", then most of the point of GoAccess is lost.

What I do to try to mitigate this;

0 0 * * * /usr/local/bin/mms_weblogs --restart

Which restarts all my GoAccess instances and forces them onto new logs. This works, and in the event of a crash the current logs would still be available so little or no information would be lost.

So, don't saying the issue can't be solved, I've already solved it, what I'm suggesting is that having to do this with "cron" is a bit of a messy / poor solution and it would be a lot "cleaner" if GoAccess had this ability itself.

oddjobz avatar Nov 08 '24 16:11 oddjobz

@oddjobz Ok, continuing...

Well; like I already said above, if You separate into 2 process, all problems that you did quote will be solved ! Therefore follow, I will describe (one suggestion) like can You solve them:

  • Real-time processing not need be stopped. You can use --keep-last option; see more detail in manual at parse options section here. In this way, you can for keep last 3 days (for example). If coming new day, the data of first one will be cleaned! And, if You use for offline processing, and have a backup of persistent data files, then can use different values. I.E. generate reports with lasted 3 days, another with 7 days, and other 30 days and so far.
  • LOGROTATE can use date suffix instead of numbers 1, 2, 3, etc; Again, You can guarantee processing correct LOGs from date and not for number. So, keep more that 3 days of LOGs in case of some error processing to happen. If storage space are problem, so You can compress old of them. Well. Your script will need smart enougth for detect compressed and uncompressed LOGs .
  • In Real-Time if error to happen, You can reprocess old LOGs until current LOG (today). Off course requires more elaborate solution, and for that, You can not use current LOG but just a clone in real-time. I use this solution for that ! Even because I have a lot of servers and not just one. Any way, if some error to happen, this file can retain (at least) last 3 days and so running again just from zero (and --kept-last option active), and waiting until arrives to now.
  • Oh, in Real-Time I can not stop... several people depend of them. So, Why not have 2 instances? If one error to happen, the 2nd instance restart an reprocess from old LOGs until today. Some HTTP fronts, like NGinX, can transparent redirect between them. Of course, the WebSocket connect will be lost, and You need be refresh the page !
  • Server is corrupt, is lost, is damage ! Well; second server is for that ! And this suggestion describe here, can do just "Rewind and Forward" like not did happened !

Well; I think the opposite of you. I believe more like Unix Philosophy: make one thing, make it right. GOAccess work well; with 2 modes and each with its distinct corresponding objective. Again, ~~ I ~~ You can ask me for: Why should I separate the processes? GOAccess can process in multi-thread, in lasted versions. So, a big data volume is not problem anymore. And You can apply different filters, different interval for ~~ statics ~~ statistics, inhibit some panels, etc.

Well. Again, I hope than be clean.

0bi-w6n-K3nobi avatar Nov 08 '24 17:11 0bi-w6n-K3nobi

  • I fail to see the point of having a second instance. If one instance is live and one is not, the one that is not live won't get used.

  • I already use "keep-last", but I don't see how that's relevant either way.

  • Logrotate, compression is an issue. If GoAccess could process all log files for a virtual host regardless of the extension and whether it was compressed, that would help. So if it started without persistence and would automatically process 30 days worth of logs, that would be a start, but 30 days * 300 virtual hosts is a lot.

  • From my timings, saving a snapshot of the database takes almost no time, doing this every hour should not be a significant overhead.

If you're happy with it the way it is, that's great. For me, although it's all there and looks great, it's operationally problematic. Whereas I accept the bandwidth issue is complex and not easy to solve (and something I can probably do myself), simply storing the data in a way that doesn't involves excessive reprocessing or data loss seems to be a fundamental issue.

oddjobz avatar Nov 08 '24 17:11 oddjobz

Hi @oddjobz .

Well... For me, backup exist for never be used... but if something to happened, that is amazing that had one ! Just for that, exist RAID 0, RAID 10 and so for... I really hope no need them, but...

GOAccess can process from input; the called STDIN at UNIX systems. You can use:

(bunzip2 -c LOG1.bz2; gunzip -c LOG2.gz; cat LOG3) | goaccess-f -SOME-MORE-OPTIONS-HERE

Not. I not said that. You not need reprocess 30 days again. If you have a backup of persistent data storage files (PDSF), you need reprocess from point of fail. For example:

  • At each day You process only LOGs of day....
  • So, You found that error happened 3 days back;
  • Well, just get backup from 3 days back PSDF, and reprocess LOGs from last 3 days only and so for;

What do I propose is offline LOGs processing at end-of-day, or at short period. It is not excessive. And GOAccess have multi-thread processing, so it will be fast enough for that. You may save copies from PSDF before each LOGs processing, then you never lost any data. And since, You can have different report ~~ statics ~~ statistics: 3 days, 7 days, 30 days and so for. Or different filters. ( With diffent PSDF for each case, of course. And need to reprocessing LOGs for each one. )

Well, It is correct: Database Server can do snap at few seconds. But it have a mechanism for that. It is intrinsic nature from Database: high availability and fault tolerance.

GOAccess do not have a snap, or transaction, or atomizity for that. All things happen in memory as storage. And for saving data, it need do "walk" into (all) entire hash tree.

0bi-w6n-K3nobi avatar Nov 08 '24 18:11 0bi-w6n-K3nobi

I just recently started to use goaccess, and discovered the same issues with persistence and crash-safety. My solution was to:

  • Run goaccess on logrotation, and feed goaccess the now inactive logfile.
  • Make goaccess run in the background (do NOT forget to use the undocumented single dash as an argument, otherwise goaccess gets all confused about not being able to find the terminal on standard input), and produce a static HTML file only (after processing the new entries).
  • This effectively runs goaccess every 20 minutes on average, in order to create/overwrite the static HTML page that displays the results.
  • Yes, that means the view is not exactly real-time; but lags behind about 20 minutes, which is not an issue.

BuGlessRB avatar Apr 03 '25 23:04 BuGlessRB

  • I fail to see the point of having a second instance. If one instance is live and one is not, the one that is not live won't get used.
  • I already use "keep-last", but I don't see how that's relevant either way.
  • Logrotate, compression is an issue. If GoAccess could process all log files for a virtual host regardless of the extension and whether it was compressed, that would help. So if it started without persistence and would automatically process 30 days worth of logs, that would be a start, but 30 days * 300 virtual hosts is a lot.
  • From my timings, saving a snapshot of the database takes almost no time, doing this every hour should not be a significant overhead.

If you're happy with it the way it is, that's great. For me, although it's all there and looks great, it's operationally problematic. Whereas I accept the bandwidth issue is complex and not easy to solve (and something I can probably do myself), simply storing the data in a way that doesn't involves excessive reprocessing or data loss seems to be a fundamental issue.

Hi there! I know this might take some time, but I’ll try to contribute a bit. I’ve seen several topics being discussed:

  • WebSocket JSON payload is too large
  • GoAccess Real-Time cache grows excessively
  • Vue-based design considerations

I'll try to offer a more complete suggestion:

  • I’ll assume you have full control over the server (root access, etc.), so you can create your own systemd services. In that case:
  1. Create a script in Node, Python, or Bash—whichever works best for you.
  2. Since the script runs locally, it can read directly from the WebSocket.
  3. It can write the data as JSON to TMPFS or a Unix socket.
  4. Another thread or process can expose a new WebSocket (on a different port) that only sends the relevant data you actually need.
  5. You can then use proxy_pass in your web server (e.g., Nginx) if you need SSL.
  6. I assume you’re storing logs with daily or weekly rotation—this next part is key:
  7. Since you already have historical logs, your Vue frontend could offer a date selector for historical data. It can request those logs on demand and generate static HTML.
  8. Another idea: the generated HTML could first be parsed by this script, which could extract the specific modules or data you need and return a new HTML layout for your web UI.
  9. The real-time stream can be restarted periodically by the script to clear the cache. There’s really no point in letting it run forever.

Just to be clear—GoAccess is the engine here, but many of these extra tasks (streaming, formatting, caching) can be handled by separate processes in Node or Python.

SergioDG-YCC avatar Apr 24 '25 17:04 SergioDG-YCC