linopy
linopy copied to clipboard
Make polars frames lazy and stream into csv
Tests run fine. The extra pyarrow dependency should not hurt, since arrow is already a requirement for polars (and soon also pandas), while pandas is only the python frontend in addition.
We should check each of the invocations of write_lazyframe that explain(streamable=True) shows it can actually run the streaming pipeline.
If you decide to merge, please squash (the history is ugly :))
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 89.65%. Comparing base (
f0c8457) to head (d82e97e).
Additional details and impacted files
@@ Coverage Diff @@
## master #294 +/- ##
==========================================
- Coverage 89.69% 89.65% -0.05%
==========================================
Files 16 16
Lines 4019 4021 +2
Branches 939 941 +2
==========================================
Hits 3605 3605
- Misses 281 284 +3
+ Partials 133 132 -1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hey @coroa, thanks for your PR. According to the profiler the lazy operation is taking very long.
Original Pandas Based
Polars Based (Non-lazy)
Polars Based (lazy)
Code for running the benchmark
import pypsa
import psutil
import time
import threading
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
# Flag to control the monitoring loop
stop_monitoring = False
# List to store memory usage values
memory_values = []
# Function to monitor memory usage
def monitor_memory_usage(interval=0.1):
global stop_monitoring
global memory_values
process = psutil.Process()
while not stop_monitoring:
mem_info = process.memory_info()
memory_values.append(mem_info.rss / 1024 ** 2) # Store memory in MB
time.sleep(interval)
# Start monitoring memory usage in a separate thread
monitor_thread = threading.Thread(target=monitor_memory_usage)
monitor_thread.daemon = True # Daemonize thread
monitor_thread.start()
# Your original code
n = pypsa.Network(".../pypsa-eur/results/solver-io/prenetworks/elec_s_128_lv1.5__Co2L0-25H-T-H-B-I-A-solar+p3-dist1_2050.nc")
m = n.optimize.create_model()
m.to_file("test.lp", io_api="lp-polars")
# Stop monitoring
stop_monitoring = True
monitor_thread.join()
# Plotting the memory usage
plt.plot(memory_values)
plt.xlabel('Time (in 0.1s intervals)')
plt.ylabel('Memory Usage (MB)')
plt.title('Memory Usage Over Time')
plt.savefig("mem-polars-non-lazy.png")
print(max(memory_values))
Interesting that there is no memory savings in either case compared to the other two.
Thanks for the profiling. Very disappointing.
It's possible that .values.reshape(-1) is not zero-copy.
The lazy version has to do everything at least twice, since the check_nulls already needs to eval everything (that could be improved). I don't know why factor 4, though.
I'll try to debug a bit around to find out where we are scooping up this memory use. Any particular xarray version to focus on? @FabianHofmann
I'll try to debug a bit around to find out where we are scooping up this memory use. Any particular xarray version to focus on? @FabianHofmann
Cool, but no rush, seems to be stable for the moment. I think it should be independent of the xarray version.