umya-spreadsheet icon indicating copy to clipboard operation
umya-spreadsheet copied to clipboard

CPU Utilization and Memory Issues

Open BharathIO opened this issue 1 year ago • 35 comments

Hi, while reading and writing excel file, CPU utilization is going peek to more than 100% and RAM memory consumption is going beyond 1 GB. Can you please check and do the needful.

Attaching Excel template.

Steps to reproduce:

  1. Use below attached XLSX file and read using umya_spreadsheet::reader::xlsx::lazy_read
  2. After reading, try copying row-2 and create 5 to 6 rows
  3. Observe memory and cpu utilization, is is going peek

Vendor_Details.xlsx

BharathIO avatar Oct 08 '24 08:10 BharathIO

@BharathIO Thank you for contacting us. We will investigate.

MathNya avatar Oct 08 '24 09:10 MathNya

One more point i observed.

Whenever we try to create new row / modify existing cell value CPU utilization is going peek into 99% and so on.

Ex: new_book.insert_new_row("Sheet1", &3, &1);

BharathIO avatar Oct 09 '24 09:10 BharathIO

I appreciate if you can share what all the functions i can use to read/write using lazy.

Ex: let mut new_book = umya_spreadsheet::reader::xlsx::lazy_read(path1).unwrap();

Above lazy_read somehow did not eat memory/ more cpu

BharathIO avatar Oct 09 '24 09:10 BharathIO

One more point i observed.

Whenever we try to create new row / modify existing cell value CPU utilization is going peek into 99% and so on.

Ex: new_book.insert_new_row("Sheet1", &3, &1);

Is there a problem with CPU going 100%? In a compute-heavy scenario, you'd want CPU to max out in order for the result to be returned quickly... Now I don't think this repo is multi-threaded, so I assume you're talking maxing out one CPU core?

Now excessive usage of RAM is another matter. I must ask: have you compiled with release? It makes a huge difference.

Also, your XLSX is not particularly small. It may contain only one row, but it is 818KB in size, and actually contains 17528 rows which makes it larger than many large spreadsheet models. I suspect there are lots of formatting in individual cells? Can't tell without analyzing the file in depth.

So you have an 800KB file which is close to 1MB. Parsing this into a Rust structure, it is not unreasonable to expect multiple times the size of data, especially when every record is small and there are a LOT of them. I'd guess 1GB is a bit too much, but I definitely would expect at least a few hundred MB in RAM consumed just by reading the file into memory.

For example, an entry like <foo /> is only a few bytes in the Excel file, but the foo structure may contain hundreds of fields, all of which are default for this entry. The actual type in Rust would be HUGE.

Removing the extra blank rows shrinks the file down to 14KB.

schungx avatar Oct 09 '24 09:10 schungx

Thank you for your update and detailed explanation.

I am currently facing an issue where CPU utilization spikes to 100%. According to the configuration, when CPU usage exceeds a threshold of 60%, the process is shut down, and an email alert is sent to the admin.

I am using the cargo build --release option for compilation. While the file size itself isn't a concern, the CPU and RAM usage during read/write operations is much higher than expected.

I will attach a sample code with CPU and memory statistics for your reference.

I understand that larger files can naturally consume more resources, but in this case, the files are relatively small. However, even with the code below, which only removes a row at index 2, CPU consumption reaches 99%.

Could you kindly provide suggestions or steps to optimize resource usage and reduce the CPU/memory footprint during file read/write operations? I am not concerned about increasing the time it takes for these operations if that helps in reducing resource consumption.

let template_path = "../lov-template-new.xlsx";
    let output_path = ../1_style_issue_output.xlsx";

let  mut book = match umya_spreadsheet::reader::xlsx::read(template_path) {
        Ok(book) => book,
        Err(e) => {
            println!("Error: {:?}", e);
            return;
        }
    };

if let Some(sheet) =  book.get_sheet_mut(&1) {
    let row_index = 2;
    sheet.remove_row(&row_index, &1);
    println!("Removed row: {}", row_index);
   }

match umya_spreadsheet::writer::xlsx::write(&book, output_path) {
        Ok(_) => println!("Success writing to file: {}", output_path),
        Err(e) => println!("Error: {:?}", e),
    };
Screenshot 2024-10-09 at 3 33 37 PM

BharathIO avatar Oct 09 '24 10:10 BharathIO

Must you keep all those blank rows?

Or can you use column styles instead?

schungx avatar Oct 09 '24 13:10 schungx

I would suggest reducing the size of the xlsx files. Excel files are not the easiest to parse and this repo does a very thorough job in exposing very detailed structure. You have to give it some slack by helping it with less work.

schungx avatar Oct 09 '24 13:10 schungx

I appreciate the power of this repo, it has all the features i wanted so far.

Can you please help me how to reduce the size of xlsx without losing any data and styles?

also, i saw that lazy_read and write_light functions are very helpful. An you suggest such functions so that i can use for my functionality?

BharathIO avatar Oct 09 '24 13:10 BharathIO

I would start by finding out why it is so large in the first place. You most likely duplicated formatting styles over all 17000+ rows.

Maybe you should just make some of the column styles and they'll be in effect for all cells in that column.

EDIT: Strange, your XLSX file's data is 6MB (!!!) uncompressed. However, I don't see any specific formatting in any of those cells.

schungx avatar Oct 10 '24 00:10 schungx

Vendor_Details.xlsx

Try this one instead.

schungx avatar Oct 10 '24 00:10 schungx

Thanks, i will try it and let you know with my results.

May i know how you compressed it without losing styles and formattings etc., As per my understanding, new file your shared has below. Highest column - 13, Highest row - 2

All the blank rows are gone now. Please share the steps to do this for any other files in future.

BharathIO avatar Oct 10 '24 06:10 BharathIO

All the blank rows are gone now. Please share the steps to do this for any other files in future.

Simple. Select all the blank columns up to the right end, delete columns. Select first blank column, press Shift-Ctrl-Right-Arrow to select all the columns. Then Edit-Delete.

Select all the blank rows down to the lowest bottom, deleted rows. Select first blank row, press Shift-Ctrl-Down-Arrow to select all rows. Then Edit-Delete.

schungx avatar Oct 10 '24 07:10 schungx

Thanks for the update. I am aware of manual steps to remove blank rows. Is there any way programmatically remove them?

BharathIO avatar Oct 10 '24 08:10 BharathIO

Thanks for the update. I am aware of manual steps to remove blank rows. Is there any way programmatically remove them?

There is API to delete rows and columns...

schungx avatar Oct 10 '24 08:10 schungx

Thanks for the update. I am aware of manual steps to remove blank rows. Is there any way programmatically remove them?

There is API to delete rows and columns...

Is it sheet.remove_row() you are talking about?

I mean, i need a way to remove/delete all empty rows for given input file programmatically with rust.

BharathIO avatar Oct 10 '24 12:10 BharathIO

I mean, i need a way to remove/delete all empty rows for given input file programmatically with rust.

I believe you can loop through the Excel sheet data structures and simply delete all cell-specific data if the cell is empty. However that'll require iterating over a huge number of cells. I would not recommend it.

You are not doing simple manipulation here. You're rewriting Excel.

Alternatively, if all your rows are guaranteed to contain all data or all blanks, then just search downwards until you find a blank row, then remove rows up to max in one single call.

schungx avatar Oct 11 '24 00:10 schungx

Thanks for the info. I am trying couple of ways and got few questions.

Can you please explain the performance differences between

use umya_spreadsheet::{reader::xlsx::lazy_read, writer::xlsx::write_light};

  1. read vs lazy_read
  2. write vs write_light

BharathIO avatar Oct 14 '24 11:10 BharathIO

@BharathIO read reads a file and then deserializes it. lazy_read only reads the file. lazy_read does not deserialize, so it is limited in what it can do, but is faster.

The difference between write and write_light is in the compression format. write_light is faster than write, but increases the file size of the generated files.

MathNya avatar Oct 15 '24 02:10 MathNya

I mean, i need a way to remove/delete all empty rows for given input file programmatically with rust.

I believe you can loop through the Excel sheet data structures and simply delete all cell-specific data if the cell is empty. However that'll require iterating over a huge number of cells. I would not recommend it.

You are not doing simple manipulation here. You're rewriting Excel.

Alternatively, if all your rows are guaranteed to contain all data or all blanks, then just search downwards until you find a blank row, then remove rows up to max in one single call.

We have prepared a function cleanup() that handles this process. However, there is still some difficulty in processing speed.

MathNya avatar Oct 18 '24 10:10 MathNya

Great. I will definitely use cleanup for my use case. May i know how long it took to cleanup all empty rows (around 17K) in the file which i shared?

BharathIO avatar Oct 22 '24 10:10 BharathIO

@BharathIO

Vendor_Details.xlsx

Try this one instead.

We performed a cleanup on this file. It was completed in an instant.

MathNya avatar Oct 23 '24 07:10 MathNya

I would like everybody to try out my PR on: https://github.com/MathNya/umya-spreadsheet/pull/242

Let me know if memory requirements drop.

schungx avatar Nov 16 '24 10:11 schungx

I would like everybody to try out my PR on: #242

Let me know if memory requirements drop.

Great, i could see a bigger change now in terms of Memory & CPU utilization. I will validate few more use cases and post my observations here.

BharathIO avatar Nov 19 '24 06:11 BharathIO

@schungx @mxsrm I’ve made changes to use version 2.2.1, and it seems that the high memory and CPU utilization issues have been resolved. However, I’m noticing that memory consumption is not being released. Is there a way to address this issue? I am trying to process 17k records, so the template row should get removed, and the new 17k records will be inserted.

Excel template:

Vendor_Details_fixed.xlsx

My use case:

Read excel template and replace few variables from vector of rows

Image

Here is my code:

use io_common::err;
use std::path::Path;
use std::sync::Arc;

use io_common::{IoDatasource, Row};
use umya_spreadsheet::{NumberingFormat, Worksheet};

use crate::export::export::EXPORT_SRC;
use crate::export::{
    custom_export_utils::{write_blank, write_date, write_number, write_string},
    export::with_timezone,
};
use crate::utils::memory_utils;
use crate::{get_args, Environment};

use super::export::ExportResult;

static PREFIX: &str = "${row.";
static SUFFIX: &str = "}";

pub fn find_template_row(
    start_row: u32,
    end_row: u32,
    start_col: u32,
    end_col: u32,
    sheet: &mut Worksheet,
) -> Option<u32> {
    let mut row_index = start_row;
    loop {
        // Loop through all the cells in the row until max cell
        for col_index in start_col..end_col {
            let cell = sheet.get_cell((col_index, row_index));
            if let Some(cell) = cell {
                let cell_value = cell.get_value();
                if cell_value.starts_with(PREFIX) && cell_value.ends_with(SUFFIX) {
                    // tracing::debug!("Found template row - {}", row_index);
                    return Some(row_index);
                }
            }
        }
        row_index += 1;
        if row_index > end_row {
            break;
        }
    }
    None
}

pub async fn parse_template(
    template_path: &str,
    total_resp: &Vec<Row>,
    tz: Option<String>,
    user_date_format: String,
    user_datetime_format: String,
    user_number_format: String,
    file_name: String,
    io_ds: &Arc<IoDatasource>,
    env: Arc<Environment>,
) -> anyhow::Result<ExportResult> {
    tracing::info!(
        "Processing custom export using template - {}",
        template_path
    );
    memory_utils::print_memory_usage(env.clone()).await?;
    // tracing::debug!("Parsing template - {}", template_path);
    // tracing::debug!("Total rows - {}", total_resp.len());
    let t_rows = 0;

    // tracing::debug!("User date format - {}", user_date_format);
    // tracing::debug!("User datetime format - {}", user_datetime_format);
    // tracing::debug!("User number format - {}", user_number_format);

    let mut new_book = match umya_spreadsheet::reader::xlsx::lazy_read(Path::new(template_path)) {
        Ok(book) => Box::new(book),
        Err(e) => {
            err!("Export Error", &format!("File error - {e}!"))
        }
    };

    for sheet_index in 0..new_book.get_sheet_count() {
        let sheet = match new_book.get_sheet_mut(&sheet_index) {
            Some(sheet) => sheet,
            None => {
                tracing::debug!("Sheet not found at - {}", sheet_index);
                err!(
                    "Export Error",
                    &format!("Sheet not found at - {sheet_index}!")
                )
            }
        };

        let (highest_column, highest_row) = sheet.get_highest_column_and_row();
        tracing::debug!(
            "Highest column - {}, Highest row - {}",
            highest_column,
            highest_row
        );

        if let Some(template_row_index) =
            find_template_row(1, highest_row + 1, 1, highest_column + 1, sheet)
        {
            tracing::debug!(
                "Found template row at - {}, replace it with actual rows in sheet - {}",
                template_row_index,
                sheet_index
            );
            let next_row_start = template_row_index + 1;
            let total_rows_to_insert = total_resp.len() as u32;

            tracing::debug!("{} rows will be inserted", total_rows_to_insert);

            memory_utils::print_memory_usage(env.clone()).await?;

            let template_row_cells = sheet
                .get_collection_by_row_to_hashmap(&template_row_index)
                .into_iter()
                .map(|(col_index, cell)| (col_index, Box::new(cell.clone())))
                .collect::<std::collections::HashMap<_, _>>();

            let end_row_index = next_row_start + total_rows_to_insert;

            // let mut row_indexes_created: Vec<u32> = vec![];
            let mut row_indexes_created: Box<Vec<u32>> =
                Box::new(Vec::with_capacity(total_rows_to_insert as usize));

            // create all rows at once
            sheet.insert_new_row(&next_row_start, &total_rows_to_insert);
            // calculate the end row index after inserting new rows and push it to row_indexes_created
            for i in 0..total_rows_to_insert {
                row_indexes_created.push(next_row_start + i);
            }

            let template_row_config =
                Box::new(sheet.get_row_dimension(&template_row_index).cloned());
            let mut data_row_index: i64 = -1;

            for row_index in next_row_start..end_row_index {
                // tracing::info!("New Processing Row Index - {}", row_index);
                data_row_index += 1;

                if let Some(data_row) = total_resp.get(data_row_index as usize) {
                    for cell in template_row_cells.values() {
                        let mut data_cell = cell.as_ref().clone();
                        let cell_value = data_cell.get_value_lazy();
                        if cell_value.starts_with(PREFIX) && cell_value.ends_with(SUFFIX) {
                            let attribute = cell_value
                                .trim_start_matches(PREFIX)
                                .trim_end_matches(SUFFIX);
                            // LOV Cell
                            if attribute.contains('.') {
                                let parts: Vec<&str> = attribute.split('.').collect();
                                if parts.len() == 2 {
                                    let lov_attribute_key = parts[0];
                                    let lov_attribute = parts[1];
                                    if let Some(lov_data_val) = data_row.get(lov_attribute_key) {
                                        if lov_data_val.is_json() {
                                            if let Some(json_str) = lov_data_val.as_json_str() {
                                                let lov_rows: Vec<Row> =
                                                    serde_json::from_str(json_str)?;
                                                let total_lov_rows_to_insert =
                                                    lov_rows.len() as u32;

                                                let mut lov_data_row_index: i64 = -1;
                                                let lov_end_row_index =
                                                    next_row_start + lov_rows.len() as u32;

                                                // loop total_lov_rows_to_insert and create rows if not already created
                                                for data_row_idx in 0..total_lov_rows_to_insert {
                                                    let excel_row_idx =
                                                        next_row_start + data_row_idx;
                                                    if !row_indexes_created.contains(&excel_row_idx)
                                                    {
                                                        sheet.insert_new_row(&row_index, &1);
                                                        row_indexes_created.push(excel_row_idx);
                                                    }
                                                }

                                                for lov_row_index in
                                                    next_row_start..lov_end_row_index
                                                {
                                                    lov_data_row_index += 1;
                                                    let mut new_data_cell = data_cell.clone();
                                                    new_data_cell
                                                        .get_coordinate_mut()
                                                        .set_row_num(lov_row_index);
                                                    if let Some(lov_data_row) =
                                                        lov_rows.get(lov_data_row_index as usize)
                                                    {
                                                        write_cell(
                                                            &tz,
                                                            &user_date_format,
                                                            &user_datetime_format,
                                                            &user_number_format,
                                                            io_ds,
                                                            lov_data_row,
                                                            &mut new_data_cell,
                                                            lov_attribute,
                                                        )?;
                                                        sheet.set_cell(new_data_cell);
                                                    }
                                                }
                                            }
                                        }
                                    }
                                }
                            } else {
                                data_cell.get_coordinate_mut().set_row_num(row_index);
                                // Normal Cell
                                write_cell(
                                    &tz,
                                    &user_date_format,
                                    &user_datetime_format,
                                    &user_number_format,
                                    io_ds,
                                    data_row,
                                    &mut data_cell,
                                    attribute,
                                )?;
                                sheet.set_cell(data_cell);
                            }
                        }
                    }
                }
            }
            tracing::info!("Processed all rows in sheet - {}", sheet_index);
            memory_utils::print_memory_usage(env.clone()).await?;

            // Copy template row properties to all created rows
            for row_index in row_indexes_created.iter() {
                if let Some(ref cfg) = template_row_config.as_ref() {
                    let rd = sheet.get_row_dimension_mut(row_index);
                    rd.set_custom_height(*cfg.get_custom_height());
                    rd.set_hidden(*cfg.get_hidden());
                    rd.set_height(*cfg.get_height());
                    rd.set_descent(*cfg.get_descent());
                    rd.set_thick_bot(*cfg.get_thick_bot());
                }
            }

            // Delete the template row
            sheet.remove_row(&template_row_index, &1);
            tracing::debug!(
                "Template row deleted successfully at - {}",
                template_row_index
            );
            drop(row_indexes_created);
            drop(template_row_config);
            drop(template_row_cells);
            memory_utils::print_memory_usage(env.clone()).await?;
        }
    }

    let args = get_args();
    let output_path = format!("/{}{}/{}.xlsx", args.tmp_dir, EXPORT_SRC, file_name);
    match umya_spreadsheet::writer::xlsx::write_light(&new_book, Path::new(&output_path)) {
        Ok(_) => (),
        Err(e) => {
            err!("Output file error", &format!("File error - {e}!"))
        }
    };
    tracing::debug!("File cloned successfully to {output_path}");
    drop(new_book);

    Ok(ExportResult {
        file_path: output_path,
        total_rows: t_rows,
    })
}

fn write_cell(
    tz: &Option<String>,
    user_date_format: &String,
    user_datetime_format: &String,
    user_number_format: &String,
    io_ds: &Arc<IoDatasource>,
    data_row: &Row,
    data_cell: &mut umya_spreadsheet::Cell,
    attribute: &str,
) -> Result<(), anyhow::Error> {
    let cell_format = if let Some(nf) = data_cell.get_style().get_number_format() {
        nf.get_format_code()
    } else {
        ""
    };
    if cell_format == NumberingFormat::FORMAT_GENERAL {
        let new_format = match io_ds.get_attribute_for_code(attribute) {
            Ok(attr) => {
                if attr.is_date() {
                    user_date_format
                } else if attr.is_datetime() || attr.is_string_datetime() {
                    user_datetime_format
                } else if attr.is_integer() || attr.is_double() || attr.is_decimal() {
                    user_number_format
                } else {
                    ""
                }
            }
            Err(e) => {
                tracing::debug!("Error getting attribute for code - {}", e);
                ""
            }
        };
        if !new_format.is_empty() {
            data_cell
                .get_style_mut()
                .get_numbering_format_mut()
                .set_format_code(new_format);
        }
    };
    if let Some(val) = data_row.get(attribute) {
        if val.is_null() {
            write_blank(data_cell);
        } else if val.is_date() || val.is_datetime() || val.is_string_datetime() {
            let val_with_timezone = with_timezone(val, tz)?;
            write_date(data_cell, val_with_timezone);
        } else if val.is_int() || val.is_double() || val.is_decimal() {
            write_number(data_cell, val.double_u()?);
        } else if val.is_string() {
            write_string(data_cell, val.str_u()?);
        } else {
            write_string(data_cell, "Value type is not implemented");
        }
    } else {
        write_blank(data_cell);
    }
    Ok(())
}

BharathIO avatar Jan 17 '25 12:01 BharathIO

How do you know your memory is not released? Sometimes processes do not release memory back to the OS for a while and this may be the reason.

If the heap is fragmented, even if you release all the memory in the middle except for some at the end, the entire heap cant be released by the process back to the OS.

schungx avatar Jan 18 '25 04:01 schungx

Thanks for the quick response. I am monitoring memory using process id, for every run, consumed memory is not being released. Can you please help me to fix the above code?

BharathIO avatar Jan 18 '25 04:01 BharathIO

Thanks for the quick response. I am monitoring memory using process id, for every run, consumed memory is not being released. Can you please help me to fix the above code?

As I mentioned even if your objects are destroyed the memory may not be released back to the OS. Therefore your max memory state may persist for quite a while.

I regularly have processes that release memory from the heap only after 1-2 days or more. Sometimes never.

If the system is under memory pressure the OS will start swapping unnecessary pages out and mostly those would be empty spaces. Some OS will simply remove unallocated pages in the heap and not bother swapping them to disk.

Conclusion: you won't know for real. You need to run under valgrind or something to determine whether there is memory leak.

schungx avatar Jan 18 '25 07:01 schungx

I observed even after couple of hours memory is not released.

BharathIO avatar Jan 18 '25 08:01 BharathIO

I observed even after couple of hours memory is not released.

The only point I can say is try to stress the system to see if the OS reclaims any memory. It is notoriously difficult to cause memory leaks in Rust other than deliberately constructing circular objects, so I seriously doubt that you have leaked memory.

The fact that memory allocated to your process is not released does NOT mean that your process is using all that memory. It can be an empty heap that is just not resized.

schungx avatar Jan 18 '25 09:01 schungx

But I do see a number of drop calls... Why do you need them? I have never need to call drop myself... Rust usually cleans up itself.

schungx avatar Jan 18 '25 09:01 schungx