matrixone [Performance]: spark sql java application insert speed low.

Is there an existing issue for performance?

[X] I have checked the existing issues.

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93): v0.5.0
- Hardware parameters: 20G memory
- OS type: ubuntu 20.04 
- Others: use docker run to start MO.

Details of Performance

spark sql jar application. standardalone, single node to execute spark java application, use dataset.write() function:

Dataset<Row> testRandom = spark.sql("select fid, abs(int(rand()*1000)) as id from test limit 100000"); //spark sql testRandom.write() .mode(SaveMode.Append) .option("batchsize", "10000") .option("isolationLevel", "NONE") .jdbc(dbConnectionUrl, "test.test1", props);

insert data includes 2 column, type is int, insert row count is 100,000, random number like fid id 0 6008889 1 4988098

Insert speed benchmark:

spark to clickhouse:
2493 ms

spark to mo: 853625 ms

spark to mysql: 542404 ms

Additional information

For example: Have you compared MatrixOne with other databases? If yes, what's their difference?

Aug 12 '22 03:08 yangtian9999

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.12</artifactId>
    <version>3.1.1</version>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.12</artifactId>
    <version>3.1.1</version>
</dependency>
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>8.0.30</version>
</dependency>

<dependency>
    <groupId>com.clickhouse</groupId>
    <artifactId>clickhouse-jdbc</artifactId>
    <version>0.3.2-patch11</version>
    <!-- use uber jar with all dependencies included, change classifier to http for smaller jar -->
    <classifier>all</classifier>
    <exclusions>
        <exclusion>
            <groupId>*</groupId>
            <artifactId>*</artifactId>
        </exclusion>
    </exclusions>
</dependency>

Aug 12 '22 03:08 yangtian9999

btw, I found "isolationLevel" can not be set to other values. Please check the spark website for more info, https://spark.apache.org/docs/3.1.1/sql-data-sources-jdbc.html

isolationLevel: The transaction isolation level, which applies to current connection. It can be one of NONE, READ_COMMITTED, READ_UNCOMMITTED, REPEATABLE_READ, or SERIALIZABLE, corresponding to standard transaction isolation levels defined by JDBC's Connection object, with default of READ_UNCOMMITTED. This option applies only to writing. Please refer the documentation in java.sql.Connection.

Aug 12 '22 11:08 yangtian9999

@yangtian9999 I think you can reassign it to CN guys.

Nov 26 '22 03:11 XuPeng-SH

First can we check if 1. prepared, 2. one big transaction instead of many small ones. If all true, reassign to CN S3 write team

Feb 09 '23 17:02 fengttt

matrixone matrixone copied to clipboard

[Performance]: spark sql java application insert speed low.

Is there an existing issue for performance?

Environment

Details of Performance

Additional information

matrixone
matrixone copied to clipboard