scramjet icon indicating copy to clipboard operation
scramjet copied to clipboard

Slow execution time of reading a big file

Open tomkeee opened this issue 3 years ago • 4 comments

I've tested the platform on how fast it would analyze a big .csv file (532MB, 3 235 282 lines). The execution time of the program (code below) is about 25 minutes.

The program should just print the current line with a very simple comment

main.py

from scramjet.streams import Streamfrom scramjet.streams import Stream

lines_number = 0 

def count(x):
    global lines_number
    lines_number += 1
    return lines_number

def show_line_number(x):
    global lines_number
    if lines_number <1000:
        return f"{lines_number} \n"
    elif lines_number > 2000:
        return f"{lines_number} bigger than 2000 \n"
    return None

def run(context,input):            
    x = (Stream
        .read_from(input)
        .each(count)
        .map(show_line_number)
    )
    return x

lines_number = 0 

def count(x):
    global lines_number
    lines_number += 1
    return lines_number

def show_line_number(x):
    global lines_number
    if lines_number <1000:
        return f"{lines_number} \n"
    elif lines_number > 2000:
        return f"{lines_number} bigger than 2000 \n"
    return None

def run(context,input):            
    x = (Stream
        .read_from(input)
        .each(count)
        .map(show_line_number)
    )
    return x

package.json

{
    "name": "@scramjet/python-big-files",
    "version": "0.22.0",
    "lang": "python",
    "main": "./main.py",
    "author": "XYZ",
    "license": "GPL-3.0",
    "engines": {
        "python3": "3.9.0"
    },
    "scripts": {
        "build:refapps": "yarn build:refapps:only",
        "build:refapps:only": "mkdir -p dist/__pypackages__/ && cp *.py dist/ && pip3 install -t dist/__pypackages__/ -r requirements.txt",
        "postbuild:refapps": "yarn prepack && yarn packseq",
        "packseq": "PACKAGES_DIR=python node ../../scripts/packsequence.js",
        "prepack": "PACKAGES_DIR=python node ../../scripts/publish.js",
        "clean": "rm -rf ./dist"
    }
}

requirements.txt scramjet-framework-py

tomkeee avatar Sep 26 '22 09:09 tomkeee

@tomkeee Print operation in every language is slow, if you want to print every line of big file, you have to keep in mind that will drastically slow down your operation. What is a time of execution (processing full file) without printing?

Tatarinho avatar Sep 26 '22 14:09 Tatarinho

@Tatarinho You are right that print operations are quite slow, yet I tried to do a similar operation on my local machine (code below) and the execution time was 55 sec (on Scramjet it was 25min).

import time

lines_number = 0
with open("/home/sirocco/Pulpit/data.csv") as file_in:
    start = time.time()
    for i in file_in:
        if lines_number <1000:
            print(f"{lines_number} \n")
        elif lines_number > 2000:
            print(f"{lines_number} bigger than 2000 \n")
        lines_number += 1

    print(f"the line_numbers is {lines_number}\n execution time: {time.time()-start}")

Zrzut ekranu z 2022-09-29 09-20-57

tomkeee avatar Sep 29 '22 07:09 tomkeee

Hi @tomkeee, we'll be looking into this next week.

MichalCz avatar Sep 30 '22 09:09 MichalCz

Hmm... so I did some initial digging and was able to run a test with local network and a similar program in node works quite fast, but not as fast as from the disk...

We need to take into account the network connection, but that wouldn't explain 25 minutes.

Could you follow this guide: https://docs.scramjet.org/platform/self-hosted-installation

Then based on this, can you try your program with the data sent to 127.0.0.1? We'd exclude the network and the platform configuration as the culprit...

MichalCz avatar Oct 03 '22 08:10 MichalCz