cmdstanpy
cmdstanpy copied to clipboard
Path forward on IO operations
This issue serves to document plans that myself, @bob-carpenter, and @mitzimorris have all discussed in person.
Issues:
- the existing IO code re-uses very little between the different services.
- the existing IO code is slow.
- the existing IO code depends on things like the order of comments in the files.
Steps:
- [ ] Split IO code into two passes. One pass should ignore all comments and use a fast reader like numpy's fromtxt to read in the draws. This can be used for all services. The second pass should deal with the comments. This will be service-specific. This change can be done any time
- [ ] Replace the second pass of the above with the newer JSON outputs rather than CSV comments. This requires a minimum supported cmdstan version: #685
- [ ] The first step could later be replaced with some sort of binary output. This would require upstream Stan changes
Thanks for creating the issue!
When I spoke to Mitzi, the other day, I had something like that in mind, but I was not thinking of using numpy, rather navigate through the file with seek() [1]. I cannot imagine this would be slower than numpy.fromtxt().
with open("bernoulli-20250416172900_1.csv", "r") as f:
# Move to byte 1678 in my test file
# which is the start of the chain data
f.seek(1678) # probably NOT fixed across files though
beg_pos = f.tell()
# Read in the length of the file
f.seek(0, 2)
end_pos = f.tell()
# Move back to exclude the tail information about timing
offset = 129 # probably fixed across files
remaining = end_pos - 1678 - offset
f.seek(1678)
data = f.read(remaining)
print(data)
This is very fast. The only annoying thing about that solution is that the beg_pos where the chain data starts is possibly not the same, depending on the number of characters in the preamble that's commented. This is not the case of the tail of the csv, because the formatting is, I believe, forcing a dedicated structure likely to always be the same size in bytes.
Arguably, we could undershoot roughly where we think we're close to the data (like in my example), then either filter out the few remaining lines to the data. That's very likely to be faster than the current solution, which is to read each line, check if they start with "#" and filter out, line after line.
That would work, I think, but it would be cleaner to separate the preamble from the data directly in Stan, adding a new flag like save_cmdstan_config to save only the data of the chains.
The other solution I was thinking about was using grep, which is quite optimised already. :) .. but I haven't tested it.
We probably need a way to evaluate the time that these steps in cmdstanpy actually take. What do you use?
Have a great Easter break!
References:
I'm reasonable certain what is slow is not seeking around the file or skipping the # rows, but actually parsing the numbers into floats. Right now, sampling does this in an all-python loop:
https://github.com/stan-dev/cmdstanpy/blob/0c25ec41bbd91d6aedb351eaaedf3c10a77205ee/cmdstanpy/stanfit/mcmc.py#L465-L468
Replacing this with np.loadtxt (probably combining with the extra logic in that file that loads the warmup if they're present) should be both straightforward and a good bit faster, because the equivalent of the double loop we're doing there ends up being done in C code
I did some rough timings on a fake CSV file with 1000 rows and 45 columns
def read_manual(filename):
# similar to current cmdstanpy code
draws = np.empty((1000, 45), dtype=float)
with open(filename, 'r') as f:
for i in range(1000):
line = f.readline().strip()
xs = line.split(',')
draws[i, :] = [float(x) for x in xs]
return draws
15.5 ms ± 526 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
def read_np(filename):
with open(filename, 'r') as f:
return np.loadtxt(f, delimiter=',', comments='#')
11.9 ms ± 215 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
def read_pandas(filename):
with open(filename, 'r') as f:
return pd.read_csv(f, engine='c', comment='#').to_numpy()
8.39 ms ± 314 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So actually using pandas seems to be the fastest, even if we actually want the numpy data out. And we can probably get it to ~twice as fast as our current code just by switching.
engine='pyarrow' for pandas is actually the fastest (3.8 ms ± 246 μs), but doesn't allow the comment argument, so we can't use it with the current Stan CSV files. And it would be an extra dependency, at which point we could also consider options like dask etc that also claim to have fast CSV readers.
Nice. You are right, pandas is probably the good enough route, minimising hassle. Shall I have a go at incorporating it? @mitzimorris, any other idea? Have a great rest of the week!
thanks @eroesch and looking forward to the PR.
Hey, you are worrying about the wrong thing.
CSV files created with stan mcmc process are wide (100s - 100k columns) and usually not that long (<1M rows). (this is due to nD array transformed to 1D presentation)
It should not matter do you read small file in 250ms or 500ms. What should matter is that can you read 100k columns with 100 rows.
Pandas will choke, np.load_txt is not your friend either.
I think best option is to read file line by line and then profile what takes least amount of time to convert one line to floats.
np.frombuffer/np.fromxyz (? I don't remember what was the best but one of the np.from functions) I think has best performance when file is read with "rb" flag.
Also you can profile how long it takes to iterate through files line by line and check first non-whitespace char.
Edit. When I imply load_text is slow, I mean for a full file
Here's a simple test of default I/O in pandas for data frames with headers. There aren't any comment lines, so I don't know if that would actually slow this down.
$ python3 ptest2.py
100 × 1000 write: 0.1s read: 0.0s
100 × 10000 write: 1.2s read: 0.4s
100 × 100000 write: 25.3s read: 11.2s
1000 × 1000 write: 1.1s read: 0.2s
1000 × 10000 write: 13.7s read: 2.9s
1000 × 100000 write: 242.2s read: 102.0s
10000 × 1000 write: 11.2s read: 1.8s
10000 × 10000 write: 136.3s read: 27.9s
I think this is a lot faster than whatever we have currently. A C++ program should be able to read data like this very quickly. 10K x 10K is less than 1GB of data at double precision. I ran out of patience before the 10K x 100K timing completed.
Here's the code.
import time
import numpy as np
import pandas as pd
def eval(rows, cols):
df = pd.DataFrame(np.random.randn(rows, cols))
fn = f"data_{rows}x{cols}.csv"
with open(fn, "w", newline="") as f:
f.write(",".join(f'"{c}"' for c in df.columns) + "\n")
t0 = time.perf_counter()
df.to_csv(f, index=False, header=False)
t1 = time.perf_counter()
df2 = pd.read_csv(fn, engine="c", comment="#")
t2 = time.perf_counter()
print(f"{rows:5d} × {cols:6d} write: {(t1 - t0):5.1f}s read: {(t2 - t1):5.1f}s")
for rows in [100, 1000, 10_000]:
for cols in [1000, 10_000, 100_000]:
eval(rows, cols)
Nice to see that pandas IO has been going forward. I did this test couple of years ago (or more than couple) and I think 100k columns never finished.
I have done more tests this weekend, and I am getting between 2-4x speedup. I'll get cracking on it. There are quite a few places in the codebase using that original algorithm, and I suspect a few edge cases. I'll proceed with caution!
Sounds like pandas is a reasonable direction here, but I wanted to chime in and say that you can get around the engine='pyarrow' comment limitation by filtering out the comment lines in Python and feeding a file-like object back to these optimized readers.
I created a Stan CSV file, 2000 sampling iterations and 2000 parameters. First, I tested the pandas C engine with conversion to numpy:
import pandas as pd
def read_pandas_c(file):
return pd.read_csv(file, engine="c", comment="#").to_numpy()
804 ms ± 63.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Compared to manually parsing out the comments and using the pyarrow engine + conversion to numpy:
import io
import pandas as pd
def read_pandas_pyarrow(file):
with open(file, "rb") as f:
no_comments = (line for line in f if not line.startswith(b"#"))
no_comment_filelike = io.BytesIO(b"".join(no_comments))
return pd.read_csv(no_comment_filelike, engine="pyarrow").to_numpy()
312 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So you don't pay much of a real performance cost by filtering out the comments manually and are able to get the pyarrow benefits. The main difference between the C and pyarrow engines seems to be that the latter is multithreaded. You'd still have to pay the cost of adding the dependency, but it's a pretty noticeable performance jump for large files.
Nice find @amas0. I suspect you might be able to get event faster if we wrote our own little TextIO wrapper to avoid needing to .join the entire no_comments iterator, but that would depend on how they actually use it under the hood
@eroesch are you still interested in working on this issue?
I am, yes, but haven't done much yet.
That's no problem, just wanted to check!
I think a good starting place is a function like:
def split_csv_comments(path: os.PathLike) -> Tuple[string, BytesIO]:
comments = b""
data = b""
with open(path, 'rb') as f:
for line in f:
if line.startswith(b'#'):
comments += line
else:
data += line
return (comments.decode('utf-8'), BytesIO(data))
This would let us do some algorithm-specific processing on the comments and also gives us the ability to use fast readers even if they don't support comments, while not committing to any one specific reader up front.
No desire to step on anyone's toes, but if the folks working this issue can't find the time, I'd be happy to take a look at implementing. These changes would be a big win for anyone dealing with larger models!
No desire to step on anyone's toes
I think it's fine for whoever has time to work on something to do so. We're not going to run out of things to do. Please be sure to communicate to others that you're doing that so we don't duplicate effort.
Everything has to get code reviewed so at least two people understand it, so at the very least, there will be time for @eroesch to jump in there.
In the interest of avoiding toe-stepping, is that also your take, @WardBrian?
Yes, I was about to say, we can both work together and two pairs of eyes is usually better! @amas0, if you are interested, would you be keen to jump on a quick call?
Sure! I'd be happy to chat
For completeness, @amas0 and I met and we'll work on this together!
Wanted to follow up on this thread to get some thoughts on where the scope of this change should land. After chatting with @eroesch, I did some initial work on this and was able to get an implementation that significantly speeds up reading large Stan CSV files. In some sense, this was the easy part -- the approach is more or less:
- Do a single line-by-line pass of a Stan CSV file and partition the bytes of each line into a list corresponding to a particular "section". In the MCMC output, I referred to these sections as
config,warmup,adaptation,sampling,timing. - For any sections that represent draws/samples/parameter values, convert the list of bytes to a numpy array via:
def csv_bytes_list_to_numpy(
csv_bytes_list: List[bytes],
) -> npt.NDArray[np.float32]:
"""Efficiently converts a list of bytes representing whose concatenation
represents a CSV file into a numpy array"""
return pd.read_csv(
io.BytesIO(b"".join(csv_bytes_list)), engine="pyarrow"
).to_numpy()
I don't know if this can be improved much more at the Python level, the bytes parsing is very fast and the multithreaded pyarrow engine is also very fast. I have implemented a version of this approach for each of the different Stan CSV file variations from the different inference methods.
That's all well and good. The thing I'm struggling a bit with is scoping how big a change like this should be. The smallest impactful change would be to implement this kind of parsing into the CmdStanMCMC._assemble_draws() method (which seems to be the worse offender by far). This could be done without really modifying much other code in the library and would address the most acute performance issues.
A more general difficulty in the library, however, is the inconsistency across the different inference methods in how they handle interacting with Stan CSV files. MCMC implements its own more or less pure Python parsing logic, Pathfinder and Laplace handle parsing draws with np.loadtxt, MLE and VB both use helper functions in utils.stancsv that parse out the values/draws. This variation happens despite the Stan CSV files being structurally similar across the different inference methods. Surgically implementing a more performant draws parser wouldn't really address these difficulties.
I started looking at abstracting out parsing the various Stan CSV files away from the the Stan fit objects. This has already partially happened with the functions in utils/stancsv.py. This would, I believe, lead to a cleaner overall library structure (and certainly more consistency), but it would be a pretty significant change as it would require modifying each of the fit objects. I'm unsure if a change of that size is desired?
The right move probably depends on what the future plans are for the library. There have been mentions of a 2.0 version in various issues. If that update is intended to be a more ground-up redesign of the library, then a large scale change here may not be worthwhile and a more surgical approach would address short-term issues? On the other hand, if it's anticipated that the design will be more or less the same and cleaning up the organization of interacting with the Stan CSV files would be useful regardless, then maybe the larger change is worthwhile? Is there a long-term plan to do away with the Stan CSV files themselves? That would also play into this, I think.
@WardBrian and @bob-carpenter, since you guys probably have the clearest vision on where this should go, I'd appreciate hearing your thoughts on this.
I think it would be a really nice benefit of this if all of the code ended up sharing more pieces in the end.
That being said, if you wanted to open two PRs, one that just does the change in MCMC, and after that is reviewed and merged, a second that gets everybody else on that same page, I think that would be a nice way to proceed in terms of getting to that eventual goal
I started working on fleshing this out more completely and found that the pyarrow solution wasn't scaling as I expected. This prompted doing a run of more careful performance testing, where I isolated just the parsing of list[bytes] -> np.array where the list[bytes] were the lines corresponding to the draws in the Stan CSV file. In particular, I found that the pandas solution scaled very poorly as you pushed the number of columns higher and higher. @ahartikainen's comment was quite prescient.
See the below figure where I compare five different methods: parsing with pure Python (current implementation), np.fromtxt, Pandas read_csv with engine='c', Pandas with the pyarrow engine, and polars, a separate package I tested.
You can see that both of the Pandas approaches exhibit some non-ideal scaling in the number of columns. The problem is exacerbated when both the rows and columns are pushed quite high. Since I imagine it's likely many people will stick to the default iter_sampling=1000 when fitting models, the scaling with number of columns (parameters) is likely more representative. If we look at a fixed rows = 1024, we have:
We see that the pandas methods quickly deteriorate worse than the existing implementation once we get to ~4000 parameters in a model. More realistically, the only viable options of this set then seem to be the current implementation, a pure numpy solution, or introducing polars, which scales very well. From my trial runs, I computed which method had the fastest parsing run at each (n_row, n_col) pair which gives:
We see some distinct regions as the size of the input varies. For smaller number of rows, pure numpy wins out, but once we get to larger models, polars emerges as the winner (and also you see a little island of pyarrow which was deceptive in initial testing).
Looking at this, I think the question comes down to whether it's worthwhile to introduce a new dependency like polars to the library just for the fastest parsing or to stick with numpy's solution and keep the dependencies the same. I took at look at how the numpy vs polars solutions compare across input sizes:
Under ~2k rows (samples per chain) the numpy solution is no more than twice as slow across all the numbers of columns, but it scales worse as the number of rows increases. This high-row, high-column scenario is, I imagine, fairly unlikely in most use cases so it may be ignorable.
In summary, for improved parsing, I think the question comes down to whether there's interest in including polars as a dependency to get the best performance possible. Otherwise, falling back to pure numpy seems to be the way. Welcome any thoughts.
Nice analysis.
Does polars need other dependencies than itself? (I tested and I think it does not)
https://pypi.org/project/polars/#files
The size is only 30MB so I would say no too bad.
That's no problem, just wanted to check!
I think a good starting place is a function like:
def split_csv_comments(path: os.PathLike) -> Tuple[string, BytesIO]: comments = b"" data = b"" with open(path, 'rb') as f: for line in f: if line.startswith(b'#'): comments += line else: data += line return (comments.decode('utf-8'), BytesIO(data)) This would let us do some algorithm-specific processing on the comments and also gives us the ability to use fast readers even if they don't support comments, while not committing to any one specific reader up front.
Addition to b"" is quite fast but I think in iterative case bytearray is even faster. Ofc not sure how much does this affect the total time.
Thanks a ton @amas0, those are all super helpful to see laid out like that.
I think the question comes down to whether there's interest in including polars as a dependency to get the best performance possible. Otherwise, falling back to pure numpy seems to be the way. Welcome any thoughts.
I think one option might be to do what we do with ujson/json in the stanio package, which is use the faster option if it is installed but fall back to one we know will be available. So I would mind if the code had a try: import polars except ImportError:-block at the top which changed the logic we used later on.
Speaking of Stanio, that might be a reasonable place for some of this code to eventually end up. There is already a minimal, np.loadtxt based, CSV reader in csv.py. It currently just ignores all comments, but we could easily expand it to a function that returns (header, comments, data) using the ideas discussed here.
CmdStanPy would still contain the logic for handling those comments, but if we're going to write a generic function it might make sense to expose it in a place other packages in the Stan ecosystem could consider using.
Polars has no other dependencies, so it seems it would be relatively lightweight. It's also easy enough to include it optionally like Brian suggested. Personally, I lean towards including it as a dependency.
As a note: iterative concatenation to b"" is actually quite slow compared
to a b"".join(...). This, I believe, is due to the fact that since in
Python bytes are immutable, it has to allocate a new object and copy the
contents over at each concatenation. When you perform the b"".join(...)
over a list, it only creates one new object and scales much better.
Quick testing with bytearray suggests that it scales the same as a
b"".join(...) but in my testing it's trivially slower than the join.
Some of the changes I've been prototyping seem like they would make sense to eventually end up in stanio. It's mainly just more general logic/functions for parsing/interacting with Stan CSV outputs. I'm pretty close to having a PR ready where I implement an example of this for the hmc sampling output CSV (with the faster draws parsing), so I'll get that together for some feedback on the overall structure and we can figure out where this stuff should land.
On Tue, Jul 22, 2025 at 10:16 AM Brian Ward @.***> wrote:
WardBrian left a comment (stan-dev/cmdstanpy#785) https://github.com/stan-dev/cmdstanpy/issues/785#issuecomment-3102944264
Thanks a ton @amas0 https://github.com/amas0, those are all super helpful to see laid out like that.
I think the question comes down to whether there's interest in including polars as a dependency to get the best performance possible. Otherwise, falling back to pure numpy seems to be the way. Welcome any thoughts.
I think one option might be to do what we do with ujson/json in the stanio package https://github.com/stan-dev/stanio/blob/13e0f1b397cad6c1556d257525531069e1640393/stanio/json.py#L4-L13, which is use the faster option if it is installed but fall back to one we know will be available. So I would mind if the code had a try: import polars except ImportError:-block at the top which changed the logic we used later on.
Speaking of Stanio, that might be a reasonable place for some of this code to eventually end up. There is already a minimal, np.loadtxt based, CSV reader in csv.py https://github.com/stan-dev/stanio/blob/main/stanio/csv.py. It currently just ignores all comments, but we could easily expand it to a function that returns (header, comments, data) using the ideas discussed here.
CmdStanPy would still contain the logic for handling those comments, but if we're going to write a generic function it might make sense to expose it in a place other packages in the Stan ecosystem could consider using.
— Reply to this email directly, view it on GitHub https://github.com/stan-dev/cmdstanpy/issues/785#issuecomment-3102944264, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3SOH7QQSXCU2FGC2XUEZ33JZBS5AVCNFSM6AAAAAB26OJYD6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCMBSHE2DIMRWGQ . You are receiving this because you were mentioned.Message ID: @.***>
I've got a draft PR #799 that implements the main ideas here for the MCMC outputs. With it mostly focusing on abstracting out the CSV interaction, it does seem like eventual inclusion in stanio would be sensible.
Very nice! We had seen it could get quite complicated, but that was without counting that pandas would not scale nicely! Impressive analysis. I'll ping you on our side channel.
Hi @amas0 @eroesch — are either of you interested in the next steps here (first spreading the new CSV reading to other methods, later updating all the existing comment scanning methods)? No worries if not, just planning my own personal to-do list
I started working on this last night briefly - replacing existing CSV reading for draws should be quick. Pathfinder and laplace are like 3 line changes each (also gq). MLE and variational have their parsing more intertwined with the existing scan_* methods, so that will take a bit more work. Doesn't seem too bad.
I'd be happy to get something together, shouldn't take more than a few days if I can find a few minutes here and there.
On Fri, Aug 1, 2025 at 8:36 AM Brian Ward @.***> wrote:
WardBrian left a comment (stan-dev/cmdstanpy#785) https://github.com/stan-dev/cmdstanpy/issues/785#issuecomment-3144438600
Hi @amas0 https://github.com/amas0 @eroesch https://github.com/eroesch — are either of you interested in the next steps here (first spreading the new CSV reading to other methods, later updating all the existing comment scanning methods)? No worries if not, just planning my own personal to-do list
— Reply to this email directly, view it on GitHub https://github.com/stan-dev/cmdstanpy/issues/785#issuecomment-3144438600, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3SOH2Q7FL6UOPU73CIM333LNNLXAVCNFSM6AAAAAB26OJYD6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTCNBUGQZTQNRQGA . You are receiving this because you were mentioned.Message ID: @.***>
After #806 the first bullet point here is officially done (thanks again @amas0!)
It sounds like you're also interested in bullet point 2 (the JSON output of save_cmdstan_config), so I encourage you to explore what that would look like. That work would probably not be merged until we are ready to do a 2.0 release, since it would probably break compatibility with cmdstan pre-2.34 (which is fine, I'd say)