linopy icon indicating copy to clipboard operation
linopy copied to clipboard

Make polars frames lazy and stream into csv

Open coroa opened this issue 1 year ago • 8 comments

Tests run fine. The extra pyarrow dependency should not hurt, since arrow is already a requirement for polars (and soon also pandas), while pandas is only the python frontend in addition.

We should check each of the invocations of write_lazyframe that explain(streamable=True) shows it can actually run the streaming pipeline.

If you decide to merge, please squash (the history is ugly :))

coroa avatar May 18 '24 20:05 coroa

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 89.65%. Comparing base (f0c8457) to head (d82e97e).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #294      +/-   ##
==========================================
- Coverage   89.69%   89.65%   -0.05%     
==========================================
  Files          16       16              
  Lines        4019     4021       +2     
  Branches      939      941       +2     
==========================================
  Hits         3605     3605              
- Misses        281      284       +3     
+ Partials      133      132       -1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 19 '24 20:05 codecov[bot]

Hey @coroa, thanks for your PR. According to the profiler the lazy operation is taking very long.

Original Pandas Based

mem-pandas

Polars Based (Non-lazy)

mem-polars-non-lazy

Polars Based (lazy)

mem-lazy

Code for running the benchmark

import pypsa
import psutil
import time
import threading
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

# Flag to control the monitoring loop
stop_monitoring = False

# List to store memory usage values
memory_values = []

# Function to monitor memory usage
def monitor_memory_usage(interval=0.1):
    global stop_monitoring
    global memory_values
    process = psutil.Process()
    while not stop_monitoring:
        mem_info = process.memory_info()
        memory_values.append(mem_info.rss / 1024 ** 2)  # Store memory in MB
        time.sleep(interval)

# Start monitoring memory usage in a separate thread
monitor_thread = threading.Thread(target=monitor_memory_usage)
monitor_thread.daemon = True  # Daemonize thread
monitor_thread.start()

# Your original code
n = pypsa.Network(".../pypsa-eur/results/solver-io/prenetworks/elec_s_128_lv1.5__Co2L0-25H-T-H-B-I-A-solar+p3-dist1_2050.nc")
m = n.optimize.create_model()

m.to_file("test.lp", io_api="lp-polars")

# Stop monitoring
stop_monitoring = True
monitor_thread.join()

# Plotting the memory usage
plt.plot(memory_values)
plt.xlabel('Time (in 0.1s intervals)')
plt.ylabel('Memory Usage (MB)')
plt.title('Memory Usage Over Time')
plt.savefig("mem-polars-non-lazy.png")
print(max(memory_values))

FabianHofmann avatar May 21 '24 08:05 FabianHofmann

Interesting that there is no memory savings in either case compared to the other two.

fneum avatar May 21 '24 13:05 fneum

Thanks for the profiling. Very disappointing.

coroa avatar May 21 '24 14:05 coroa

It's possible that .values.reshape(-1) is not zero-copy.

coroa avatar May 21 '24 14:05 coroa

The lazy version has to do everything at least twice, since the check_nulls already needs to eval everything (that could be improved). I don't know why factor 4, though.

coroa avatar May 21 '24 14:05 coroa

I'll try to debug a bit around to find out where we are scooping up this memory use. Any particular xarray version to focus on? @FabianHofmann

coroa avatar May 21 '24 14:05 coroa

I'll try to debug a bit around to find out where we are scooping up this memory use. Any particular xarray version to focus on? @FabianHofmann

Cool, but no rush, seems to be stable for the moment. I think it should be independent of the xarray version.

FabianHofmann avatar May 21 '24 20:05 FabianHofmann