polars icon indicating copy to clipboard operation
polars copied to clipboard

Very high incremental compile time and binary size using basic polars CsvReader

Open mr-pascal opened this issue 1 year ago • 6 comments
trafficstars

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

use polars::prelude::*;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file_name = "mycsv.csv";
    polars::io::csv::CsvReader::from_path(file_name)?
        .infer_schema(Some(10))
        .with_try_parse_dates(true)
        .has_header(true)
        .finish()?;

    Ok(())
}

[package]
name = "rust-datascience"
version = "0.1.0"
edition = "2021"

[dependencies]
polars = { version = "0.37.0", features = [] }

Log output

No response

Issue description

EDIT: I am talking about incremental compile time here, NOT about a clean one. And the main issue is the high incremental compile time, not the binary size

Hey guys,

Using the minimal example provided above, it appears that the resulting binary, running cargo build, is 500MB in size, and the compile time is 10-20 seconds. (used CPU: AMD® Ryzen 7 5700u with radeon graphics × 16, so not that slow) While this could also be a question, it appears as a bug to me when I follow the documentation, and suddenly, there is this tremendous binary created for just reading a CSV file.

When I don't add the use polars::prelude::*; and remove the .finish()?; line of code, the binary is still 430 MB.

Expected behavior

It doesn't take 10-20 seconds, but at most one or so to rebuild (incremental) the application after adding a print statement. At most a couple of megabytes are added to the binary instead of 500 MB for only using a CSVReader.

Installed versions

rustc: rustc 1.78.0-nightly (3406ada96 2024-02-21) Host: x86_64-unknown-linux-gnu Target: x86_64-unknown-linux-gnu

polars = { version = "0.37.0", features = [] }

mr-pascal avatar Feb 22 '24 20:02 mr-pascal

Try stripping the binary, it's got all the debug symbols probably.

You can try some of the stuff here: https://github.com/johnthagen/min-sized-rust

cargo build by default isn't meant to make minimal binary sizes.

kszlim avatar Feb 22 '24 20:02 kszlim

@kszlim thanks for the tip, but especially for development, the binary size isn't my main concern, but having a 10-20 sec compile time for 5 LoC that does almost nothing is not acceptable

mr-pascal avatar Feb 22 '24 20:02 mr-pascal

If a clean compile time of 10-20s is unacceptable I think rust or at the minimum polars just isn't going to ever be acceptable for you. If you strictly want csv parsing, you could go with a csv crate (https://docs.rs/csv/latest/csv/) for faster compile times.

Incremental compiles should be much faster. Keep in mind that rust does a lot at compile time, so compiling 100 dependencies from src won't ever be instantaneous.

You can try using https://github.com/mozilla/sccache which will cache artifacts across projects.

But anyways, this definitely isn't a bug.

kszlim avatar Feb 22 '24 21:02 kszlim

@kszlim thanks for the tip, but especially for development, the binary size isn't my main concern, but having a 10-20 sec compile time for 5 LoC that does almost nothing is not acceptable

Are you serious? It compiles the whole reader and all data-types involved. It also returns a DataFrame and many differnt data-types with the support of a DataFrame. Polars' CSV reader doesn't exist in a vacuum.

The csv crate only gives you the text data. Polars is much more opinionated and does much more.

Please tone it down a little with "ridiculous". You don't have to use Polars.

ritchie46 avatar Feb 23 '24 00:02 ritchie46

I am sorry if there is a misunderstanding here, but I am NOT talking about a "clean" of "full" compile here. I am talking about an incremental compile, all dependencies compiled already. A full/clean build is absolutely fine that it takes quite a while when dependencies have to be compiled.

Easy example:

  1. I do a cargo build of the full program (the above 5 LoC) -> takes maybe a minute or two with all dependencies to be compiled (that is fine)
  2. I add a print statement
  3. I hit cargo build again -> it takes 10-20 seconds, just because of the added print statement

mr-pascal avatar Feb 23 '24 04:02 mr-pascal

There's probably something wrong with your laptop (perhaps it's in low power mode/throttled)? With your exact example, I can clean compile it in 30s and incremental compiles finish in under a second.

This is on a Apple M1 MacBook Pro with 8p cores and 2 ecores. On my desktop which is x86, the a clean compile takes 22s and an incremental compile takes 5-6s (switching to lld by default brings that down to 1.4s). I'm guessing the difference is just due to a different linker.

I'd give trying a different linker a go and that might fix your issues. Though the magnitude of the difference suggests to me that it's not purely the linker.

kszlim avatar Feb 23 '24 22:02 kszlim

I don't think there are any concrete actions for us to take here.

stinodego avatar Feb 26 '24 22:02 stinodego