Trade-Dangerous
Trade-Dangerous copied to clipboard
SEVERE performance issues doing EDDBlink import when DB is large.
Not sure what the problem is, but there's a huge difference in processing time when doing an import using EDDBlink plugin when the database file is large vs. small. To whit:
TradeDangerous.db file size: 6.5 GiB listings-live.csv file size: 10.3 MiB Time to completion: ~24 minutes
NOTE: Processing market data from listings-live.csv: Start time = 2024-04-23 11:14:51.958359
#Getting total number of entries in listings-live.csv...
#Getting list of commodities...
#Getting list of stations...
#Processing entries...
NOTE: Finished processing market data. End time = 2024-04-23 11:38:17.200988
versus: TradeDangerous.db file size: 43.7 MiB (empty StationItem table, otherwise identical to above database) listings-live.csv file size: 10.3 MiB (Same file as above) Time to completion: ~7 seconds
NOTE: Processing market data from listings-live.csv: Start time = 2024-04-23 12:20:00.816731
#Getting total number of entries in listings-live.csv...
#Getting list of commodities...
#Getting list of stations...
#Processing entries...
NOTE: Finished processing market data. End time = 2024-04-23 12:20:07.871285
Everything is exactly the same in both cases except for the size of the StationItem table in the database being imported to.
The problem is cumulative: During testing, importing listings.csv, a 2.5 GiB file, starting with an empty database, getting from 10% to 12% took under a minute, getting from 70% to 72% took just under 10 minutes, in the same import.
Can I suggest we expand this concept:
The server side works. The client side works (apart from one warning, because a Rares station is in lockdown and doesn't appear in market data).
It works.
So I suggest concentrating on optimization, memory use in any case where that remains an issue and then speed rather than adding any new features. Draw a line under it, call it feature complete until we have something which can process imports and output runs in a more reasonable timeframe.
I saw we recently enabled incremental vacuuming and some other sqlite options to improve performance? It's probably worth turning those off locally and investigating SQLite's features for reporting on what the vacuum will do (keywords: sqlite3, pragma, vacuum, analyze, optimize) which may explain the issue - maybe there's an index that needs tweaking or scrapping.
You also made it more aggressive about flushing the cache (closing the db, even) yesterday, to counter the memory bloat that my simple executemany was causing. So this make sit smell like a transaction issue - either an absence of them at some point or an unexpected nesting of them.
Look for manual transactions (execute "BEGIN"/"COMMIT") and switch those to context-managed cursors so that the sqlite3 lib isn't taken by surprise.
I also strongly recommend looking at the apsw
module; it's designed as a drop-in-replacement for sqlite3 but less of it is implemented in python so it has a potential to be faster, but I have not bench'd it with our particular use case.
https://stackoverflow.com/a/17794805/257645 this is intriguing, it would potentially give us a great way to get the data into the database fast and let sqlite worry about finalization, make it actually earn some of it's keep: https://stackoverflow.com/a/3167052/257645
I strongly recommend using executemany whenever there's a loop, if possible. If you have to cap it, that's fine, but you save a bunch of overhead.
Did you try running the imports with pyinstrument?
python -m venv venv
. venv/Scripts/activate.ps1 # or . venv/bin/activate on mac/linux
pip install -r requirements/dev.txt
pip install pyinstrument
pip install -e .
cd tradedangerous
pyinstrument ./trade.py import ...
at the end it will give you a bunch of blurb and you can customize how it does that, but I tend to run it like the above and then grab the session from the last two lines:
To view this report with different options, run:
pyinstrument --load-prev 2024-04-24T10-19-42 [options]
and run something like:
> pyinstrument --load-prev 2024-04-24T10-19-42 -r html >speed.html # local output
> start speed.html
# or
> pyinstrument --load-prev 2024-04-24T10-19-42 -r speedscope >speed.scope
# and upload that file here: https://www.speedscope.app/
sample speed.scope from an eddblink -O solo: speedscope.zip unzip, hit https://www.speedscope.app/ and drag it into the browse button. use the buttons in the top left to change type of view.
The only transaction being performed when starting from an empty database is:
listingStmt = """INSERT OR IGNORE INTO StationItem
(station_id, item_id, modified,
demand_price, demand_units, demand_level,
supply_price, supply_units, supply_level, from_live)
VALUES ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )"""
Removing the "OR IGNORE" part doesn't have any effect.
eddblink used to use the executemany function after it finished gathering the list of all changes to be made, by appending the values for each change to the list listingsList
within the for listing in listings
loop, and then running executemany(listingsStmt, listingsList)
after exiting the loop, which resulted in about 6 hours (on my machine) of nothing appearing to be happening while the database was being updated.
Changing it to what it is now, where each listing is inserted during processing with execute(listingStmt, {values})
still took about 6 hours, but at least there was some indication of progress.
The pragmas that were introduced knocked it down to a little over 2 hours.
I've been working on TD for a week straight now, during any free time I had outside of work and sleep. I'm taking a break for a while.
If it ain't a bug, it can wait or someone else can tackle it, is my current mode.
The pragmas was me. WAL journal has a massive speed benefit on Linux, though seems less so on Windows and the vacuum keeps the db size to a minimum. The server is set up to vacuum every couple of days and I do want it to do incremental, as a full vac takes a chunk of time. That said, the first vac and optimise on the server's db reduced it from 6GB and change to 4GB and change so it's definitely worth it.
I hope I didn't sound like I was handing out assignments :( if you can run the pyinstrument command on the server and snag a .scope file it might be really interesting.
I hope I didn't sound like I was handing out assignments :( if you can run the pyinstrument command on the server and snag a .scope file it might be really interesting.
I hope I didn't sound like I was feeling put upon. I'm just so tired from working on this for a week solid and need a break for awhile.
I'm running pyistrument, I'll post the results for you if you want to have a look at it.
IT definitely seems to be more an issue on my *nix box than on Tromador's Win one.
Jonathon first, trade calculator for online game later, man.
TD is a vague distant memory for me - I honestly thought one of you added the plugin system after I'd ambled off, and I was kinda wondering wtf the saitek crap was until I looked at one of the files and saw the first line comment. cough
Point: I don't know what the priorities are or the full details of what's wear and what's tear. If you think I can productively be of use somewhere specific, you'll need to point me, and while I'm human and will get my knickers in a twist or my butt hurt in the moment when it comes up that I'm responsible for any particular cretinous decision or demonstrate of derp, I'm not a diva: my first job was for a UK "national" contract software company as an embedded engineer, and when the company went belly up I got offers from each customer I'd worked for to go finish development, but I either didn't really "get" the companies or feel adequate, except for one, in my home town, on the fishing docks. So I accepted their offer, which came with a caveat: they weren't big enough to pay for a programmer, they were a cold-store/blast-freezer/fish packing plant. So if it got busy, I worked packing fish, loading/unloading 50lb boxes of frozen shark bellies or halibut. Yeah, I've worked some fancy places since, but I didn't stay.
In some countries, TD is almost old enough to start putting bread on the table :)
IT definitely seems to be more an issue on my *nix box than on Tromador's Win one.
We aren't on identical hardware though so although I agree from our testing that Linux and Windows do seem to behave differently, don't forget I have lots of CPU, on the other hand you have more RAM than me. It's hard to simply draw comparisons therefore just based on OS. The server is another game entirely, running an archaic version of Centos, such that I have had to build TD (from source) its own special environment to run in with its very own sqlite3 and OpenSSL libraries and a super special Python just for it, along with all the environment shenanigans to make it not use the outdated system versions. Not to mention it's kinda short on RAM, but again is free vhost and server hosting isn't cheap.
I to tend to waffle when I am tired, point being, many variables outside of OS. Particularly hardware.
In some countries, TD is almost old enough to start putting bread on the table :)
I quite deliberately put Est 2015 at the start of the thread on E:D forums. Other tools may be quicker, flashier whatever, but nothing else can do what we can and nothing else has our stubborn will to survive.
I don't know what the priorities are or the full details of what's wear and what's tear. If you think I can productively be of use somewhere specific, you'll need to point me,
This is my list of things needing doing at some point. The Niceness value is how important I consider each item, with lower being more important just like *nix CPU prioritizing.
1) N = -20
Anything you can do to improve import performance is welcome. On my box, with the current TD, both spansh and eddblink take a bit over 3 hours to do an import. Obviously different machines will take different times, but ideally anything but a potato should have sub 1 hour import duration. My box has an 8-core CPU with 64GB RAM, so it's not a potato. (Also I checked the drive, and it has 45MB/s+ write speeds, but neither import ever went above 5MB/s, so I highly doubt that's the bottleneck for me.)
2) N = 5
The spicing up of the visuals is certainly welcome, I like what you've done there and would be pleased to see the visual improvements you've done to spansh done to eddblink (and the rest of TD) as well.
3) N=-10
TD needs to be able to build the other Tables. We have two reliable sources, those being the daily dumps from Spansh, and the EDCD lists, such as rare_items.csv
We currently use the Spansh data (and EDDN via the listener) to fill out System, Station, Item, and StationItem, and could potentially use it to fill out Ship, ShipVendor, Upgrade, and UpgradeVendor.
We currently use EDCD to fill out FDevOutfitting and FDevShipyard, and could potentially use it to fill out RareItem, Item, and Category.
There's obviously some overlap, so in the case where we can use either, I think EDCD is the better option. Spansh (and listener): System, Station, StationItem, ShipVendor, UpgradeVendor, Ship, Upgrade EDCD: FDevShipyard, FDevOutfitting, RareItem, Item, Category
That leaves Added as the only one that can't be easily updated, but as that's only used in System, to indicate (I think?) what version of Elite Dangerous a system was added to the galaxy, I feel like it should individually get N=20
Getting everything populated via Spansh means updating to utilize the information provided by the dump that is currently being ignored, as well as pulling from the EDCD sources when building the DB from scratch or they've been updated.
Getting everything populated via the TD listener means listening to the other EDDN messages besides just the commoditiy ones, and processing them accordingly. Since the listener uses Spansh, this only matters for things which change more often than once a day. The eddblink plugin pulls from Trom's server, so getting this into the listener automatically gets this into eddblink as well.
4) N=-18
Rescue Ships are EVIL to TD, because not only do they move, they also change their name, as they are named after whatever station they are currently rescuing. This results in duplicate entries in the prices file whenever two different rescue ships go to the same station. For example, "Rescue Ship - Arc's Faith" has two entries with a different station_id because it's been visited by two different rescue ships. The FDev Id for the ships are different, and searching EDSM, I found one of them is listed as being at HIP 17694 \ Hudson Observatory and the other is at Didio \ Laumer Orbital. The next Spansh dump should have those updates, but even with that, both of the rescue ships are now named differently since neither one is at Arc's Faith anymore. Renaming the station needs to be done so that users don't go looking for the wrong station, and maybe adding the FDev ID to the name to prevent the duplicate station error is a good idea?
5) N=12
Working GUI, as in feature complete with TDHelper.
If you have TD installed via pip, you can run tradegui
and see the awful attempt I've got so far.
6) N=20
Update the wiki to account for changes made to TD since the last time it was updated.
I can probably get all that done myself, but I know it'll take a long time and be coded poorly, because I'm entirely self-taught on this stuff. I'm currently in college for computer science, so I'm better at this stuff than I was when I picked up the torch, but I'm still learning, and most if I'm learning the hard way.
Which reminds me, I set this to run before I left for work. speed.tar.gz
I'd venture to add. Import performance is the biggest problem as eyeonus says. No doubt. But...
I reinstalled E:D being as we had working TD. I guess I was feeling nostalgic. So I have been using it in anger and there are in fact two things which leave me twiddling my thumbs. The first is absolutely importing data. The second is exporting data, by which I mean getting query results from trade run. The size of the dataset has grown in just the same way for making queries, so that the tree of possibilities which must be examined has become vast. This can be mitigated by filtering down, excluding by age for example, or no odyssey is a good one, but if you want to do a full check with a huge hold, enough money to fill it with whatever, and a jump range sufficient to cross the bubble? It's not usable. I am taking to lying about my real jump range to constrain the tree size.
So yes. Import speed does need optimization, but so does the output.
@eyeonus The visual stuff would ultimately be about code reduction - if rich had existed 10 years ago I'd not have written progressbar etc. I was looking at the transfers code, and it feels like half that module is just about trying to get "requests" imported. That's covered now by the requests.txt file, right?
GUI I always figured that would end up being web-based but the web-app concepts at the time weren't really much use and I had reasons not to really want to get too much into html. But perhaps 'gradio' would be an option.
So that brings us to the meaningful work, and I guess I always suspected TD would find itself struggling for trying to be a cold-runner rather than some kind of service.
Juts thinking out loud here: We pay an entry fee for storing our data: every operation we need to do has to be expressed textually as SQL, parsed by the SQLite code, sqlite has to plan what to do and then execute on it. Lot of cpu cycles just to start reading. The ordinary payoff is a sort of storage-jit: tell the engine what you want in the abstract and it can figure out the fastest way to do it. We're not doing elaborate queries with sophisticated multi-table joins, groups, sorts, wheres, etc.
I mean, SELECT * FROM System
is actually better for us than SELECT * FROM System ORDER BY ...
. We don't have geospatial indexing turned on, and the SQLite options there would hurt us anyway.
Spitballs, consider each individually not as a sequence:
- Incremental imports
for system in systems: INSERT system INTO SystemIncoming for station in stations: INSERT station into StationIncoming executemany ( INSERT item INTO ItemIncoming FOR item in station.items ) if thread: thread.join thread = Thread( move *incoming -> * )
Using the temp table means we can avoid either building our own list or the cost of per-item queries or database indexes while we're doing this work.
We can also parallelize the work here; we fully populate the temporary table without index overheads etc, and then have sqlite move things over. Now we're making SQL pay for itself because the query planner/executor can use its own internal data about quantities and stuff to do the transfer.
That's the theory - I don't know if it's true for SQLite.
- Persistent database
We intuitively get why a movie ticket costs $25 when renting the film costs $5, but TD is paying $25 to rent. SQLite's benefit is that you don't need a server, it's con is that you dont' get a server doing all the heavy lifting in another thread/process or possibly on a different cpu. That's a fundamental part of the hitch here. We've turned on indexes and joins. We've started doing vacuums and the problem is they can only happen while there's a process running, and we're it.
We could: a) have TD split itself in two, a TradeAdder that implements TradeDB and soaks up the costs of keeping the database fresh, and a TradeGecko that shuttles stuff back and forth between the user and the backend; not entirely implausible, mostly separating concerns and building facades/bridges b) Make it possible or even a default to have a persistent database spun up one of several ways - docker, podman, vm, local or remote url and use that, with sqlite as a naive fallback
This is where we might look to something like SQLAlchemy, PeeWee or pony. It's also where we might actually change some of the operations encumbered on TradeDB so that it doesn't have to in-memory everything.
- Persistent td/REPL
A variant of (a) in the above, we literally just make TD keep itself alive over probably a non-sql persistant store (either embedded or a second process; key value or nosql: leveldb, rocksdb, memcache) and then have a very simple api for talking to it that it can use on itself; either thru json-rpc with something like flask/eel or literally have it read stdin and accept/process trade commands and pass them to the existing command-line arg parser:
> trade.py repl
trade@SOL$ import -P eddb -O solo
via pseudo code:
for line in stdin:
args = ["trade.py"] + split_args(line) # deals with quotes etc
main(args) # <-- the old main, which can't tell they didn't actually come from the command line
- async or threading
I don't know off the top of my head where we can leverage this right now to any good use; I created an experimental Rust-based python-package providing a rust version of the "count lines in file" method. That's a hard one to optimize because cpu isn't the main bottleneck. I did a version where it would on startup create a number of counting workers, and then you simply gave them byte-regions to count from.
And there was no version where it wasn't slower with concurrency. I wasn't exhaustive by any means, but you're trying to apply a very simple set of logic and rush thru memory, so there's not a lot of juggle room.
Which reminds me, I set this to run before I left for work. speed.tar.gz
that is uniquely ... lacking info lol. I've never seen it give that little information. what are the specs on the server?
Which reminds me, I set this to run before I left for work. speed.tar.gz
that is uniquely ... lacking info lol. I've never seen it give that little information. what are the specs on the server?
That's not the server, that's my machine.
Tromador would have to provide you with any server runs.
@eyeonus The visual stuff would ultimately be about code reduction - if rich had existed 10 years ago I'd not have written progressbar etc. I was looking at the transfers code, and it feels like half that module is just about trying to get "requests" imported. That's covered now by the requests.txt file, right?
??? What requests.txt file?
GUI I always figured that would end up being web-based but the web-app concepts at the time weren't really much use and I had reasons not to really want to get too much into html. But perhaps 'gradio' would be an option.
I was considering using pySimpleGUI, I'll have to look at gradio, never heard of it.
So that brings us to the meaningful work, and I guess I always suspected TD would find itself struggling for trying to be a cold-runner rather than some kind of service.
Spitballs, consider each individually not as a sequence:
* Incremental imports
Since the speed of the import goes down as the table gets bigger, this sounds like a great idea. The vast majority of the information goes into StationItem: all the other tables, completely filled with the spansh data, don't amolunt to even .5GB, but add in the market data and it grows to ~7GB, so while I doubt we'd see much improvement doing this on any of the other tables, it might be a big help for that particular one.
* Persistent database * Persistent td/REPL
I"m sorry, that's all over my head. I defer to your expertise.
| I"m sorry, that's all over my head. I defer to your expertise.
Oh, just to make you regret saying that:
| ??? What requests.txt file?
... was a typo of "requirements.txt". What I mean't was, in order for anyone to pip install Trade-Dangerous, it depends on requirements.txt on its own, right?
Duh. Gonna check that real quick, dum-me:
so it should be safe to remove all that crap screwing around to make sure requests is installed (in transfers.py), haha
Oh. Yes.
| I was considering using pySimpleGUI, I'll have to look at gradio, never heard of it.
I don't have useful input on GUIs really; 'gradio' is - I believe - what people are using to build web-based-gui-apps for things in python, so it's not sexy but you spend a lot more time on what the gui elements do than on the screen itself. So, for instance, "StableDiffusion" has a gradio-based UI, as does the popular "run your own local GPT" https://github.com/oobabooga/text-generation-webui... https://www.gradio.app/ Yeah, I know it says machine learning, but they're pitching to their new market because of those last two.
I've used flask for building UIs before but then you need to build an entire front-end application in angular or react or cut-my-wrists-for-me-now-please.
If you're going to be old school about it and do a desktopy app; wxWidgets is shit, but you can keep it relatively simple and use 'wxFormBuilder' to design the app - I've done that a few times lately: https://github.com/wxFormBuilder/wxFormBuilder.
So pySimpleGui might be rockstar - check that out before wxWidgets, because I hate it as much as I use it.
You could conceivably do something with Dear imgui - I found it easy enough to use I wrote wrappers for it in C++ (https://github.com/kfsone/imguiwrap) when I used it a few years ago, the python version is ok, but it's best use case imho is for in-game debug screens/menus etc.
So I'll look at doing some experiments on the item import and how it is currently working.
How often have frontier been adding items?
I wonder if it would make sense to actually alter the row format of StationItem so that there are two rows per station: one for selling, one for buying.
Right now I'm trying a little experiment that reintroduces a per-station cursor; we're not building a list of items for executemany, but we're also not doing the full cost of adding each and every item in its own mini-transaction. Also breaking up the code a little-bit where we're not invoking methods in loops and we can afford the extra 40-50ns it takes to call a method in python lol.
So for right now I think it would be useful for us to look at:
-
apsw instead of sqlite3; it pulls some tricks like automatic stored-procedures that we might benefit from where we're not using executeman. replacing sqlite3 across the whole program will take a bit of work, but just replacing it in one plugin might be fairly easy for benchmark purposes.
-
incremental import, i.e. write all the new entries into a pristine, unindexed table, and then let sqlite transfer them. could either be radically worse, since its doing "more" work, or radically faster because it's the database doing that work and the overheads we pay are supposed to make exactly that kind of thing faster...
-
alternative storage backend; perhaps literally take eddblink_plug and replace the sqlite3 calls with something that talks to postgres or mysql and see what it's like having it dump the data into that? not a full transition or anything, but just a naive "what if"
Win|PS> python -m apsw .\data\TradeDangerous.db
SQLite version 3.45.3 (APSW 3.45.3.0)
Enter ".help" for instructions
sqlite> select count(*) from item;
┌──────────┐
│ count(*) │
│ 395 │
└──────────┘
sqlite>
forget the row-shape idea then lol
Just so you know - I'm not remotely above figuring out how to get an AI to help me, but I'm also not expecting it to be actual help. However... Giving it the ability to read the repository definitely makes it more useful: https://github.com/kfsone/ggama
$ ls -1sh /dev/sda /dev/sdb
/dev/sda 1PB
/dev/sdb 15MB
$ rm -rf pride # surplus to requirements
$ mv /sdb/shame /sdb # storage upgrade required
Do we care about the added column any more?
Do we care about the added column any more?
I gave the Added TABLE an N=20
I never cared about that column. :D
Just want to say, the recent PR is definitely an imrpovement. Even with the additional 'optimization' step, it finishes processing both listings and listings-live faster than the old way could get through just listings.
I'm happy, and extremely grateful.