scramjet
scramjet copied to clipboard
Slow execution time of reading a big file
I've tested the platform on how fast it would analyze a big .csv file (532MB, 3 235 282 lines). The execution time of the program (code below) is about 25 minutes.
The program should just print the current line with a very simple comment
main.py
from scramjet.streams import Streamfrom scramjet.streams import Stream
lines_number = 0
def count(x):
global lines_number
lines_number += 1
return lines_number
def show_line_number(x):
global lines_number
if lines_number <1000:
return f"{lines_number} \n"
elif lines_number > 2000:
return f"{lines_number} bigger than 2000 \n"
return None
def run(context,input):
x = (Stream
.read_from(input)
.each(count)
.map(show_line_number)
)
return x
lines_number = 0
def count(x):
global lines_number
lines_number += 1
return lines_number
def show_line_number(x):
global lines_number
if lines_number <1000:
return f"{lines_number} \n"
elif lines_number > 2000:
return f"{lines_number} bigger than 2000 \n"
return None
def run(context,input):
x = (Stream
.read_from(input)
.each(count)
.map(show_line_number)
)
return x
package.json
{
"name": "@scramjet/python-big-files",
"version": "0.22.0",
"lang": "python",
"main": "./main.py",
"author": "XYZ",
"license": "GPL-3.0",
"engines": {
"python3": "3.9.0"
},
"scripts": {
"build:refapps": "yarn build:refapps:only",
"build:refapps:only": "mkdir -p dist/__pypackages__/ && cp *.py dist/ && pip3 install -t dist/__pypackages__/ -r requirements.txt",
"postbuild:refapps": "yarn prepack && yarn packseq",
"packseq": "PACKAGES_DIR=python node ../../scripts/packsequence.js",
"prepack": "PACKAGES_DIR=python node ../../scripts/publish.js",
"clean": "rm -rf ./dist"
}
}
requirements.txt
scramjet-framework-py
@tomkeee Print operation in every language is slow, if you want to print every line of big file, you have to keep in mind that will drastically slow down your operation. What is a time of execution (processing full file) without printing?
@Tatarinho You are right that print operations are quite slow, yet I tried to do a similar operation on my local machine (code below) and the execution time was 55 sec (on Scramjet it was 25min).
import time
lines_number = 0
with open("/home/sirocco/Pulpit/data.csv") as file_in:
start = time.time()
for i in file_in:
if lines_number <1000:
print(f"{lines_number} \n")
elif lines_number > 2000:
print(f"{lines_number} bigger than 2000 \n")
lines_number += 1
print(f"the line_numbers is {lines_number}\n execution time: {time.time()-start}")

Hi @tomkeee, we'll be looking into this next week.
Hmm... so I did some initial digging and was able to run a test with local network and a similar program in node works quite fast, but not as fast as from the disk...
We need to take into account the network connection, but that wouldn't explain 25 minutes.
Could you follow this guide: https://docs.scramjet.org/platform/self-hosted-installation
Then based on this, can you try your program with the data sent to 127.0.0.1? We'd exclude the network and the platform configuration as the culprit...