matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

[Performance]: spark sql java application insert speed low.

Open yangtian9999 opened this issue 2 years ago • 2 comments

Is there an existing issue for performance?

  • [X] I have checked the existing issues.

Environment

- Version or commit-id (e.g. v0.1.0 or 8b23a93): v0.5.0
- Hardware parameters: 20G memory
- OS type: ubuntu 20.04 
- Others: use docker run to start MO.

Details of Performance

spark sql jar application. standardalone, single node to execute spark java application, use dataset.write() function:

Dataset<Row> testRandom = spark.sql("select fid, abs(int(rand()*1000)) as id from test limit 100000"); //spark sql testRandom.write() .mode(SaveMode.Append) .option("batchsize", "10000") .option("isolationLevel", "NONE") .jdbc(dbConnectionUrl, "test.test1", props);

insert data includes 2 column, type is int, insert row count is 100,000, random number like fid id 0 6008889 1 4988098

Insert speed benchmark:

spark to clickhouse:
2493 ms

spark to mo: 853625 ms

spark to mysql: 542404 ms

Additional information

For example: Have you compared MatrixOne with other databases? If yes, what's their difference?

yangtian9999 avatar Aug 12 '22 03:08 yangtian9999

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.12</artifactId>
    <version>3.1.1</version>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.12</artifactId>
    <version>3.1.1</version>
</dependency>
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>8.0.30</version>
</dependency>

<dependency>
    <groupId>com.clickhouse</groupId>
    <artifactId>clickhouse-jdbc</artifactId>
    <version>0.3.2-patch11</version>
    <!-- use uber jar with all dependencies included, change classifier to http for smaller jar -->
    <classifier>all</classifier>
    <exclusions>
        <exclusion>
            <groupId>*</groupId>
            <artifactId>*</artifactId>
        </exclusion>
    </exclusions>
</dependency>

yangtian9999 avatar Aug 12 '22 03:08 yangtian9999

btw, I found "isolationLevel" can not be set to other values. Please check the spark website for more info, https://spark.apache.org/docs/3.1.1/sql-data-sources-jdbc.html

isolationLevel: The transaction isolation level, which applies to current connection. It can be one of NONE, READ_COMMITTED, READ_UNCOMMITTED, REPEATABLE_READ, or SERIALIZABLE, corresponding to standard transaction isolation levels defined by JDBC's Connection object, with default of READ_UNCOMMITTED. This option applies only to writing. Please refer the documentation in java.sql.Connection.

yangtian9999 avatar Aug 12 '22 11:08 yangtian9999

@yangtian9999 I think you can reassign it to CN guys.

XuPeng-SH avatar Nov 26 '22 03:11 XuPeng-SH

First can we check if 1. prepared, 2. one big transaction instead of many small ones. If all true, reassign to CN S3 write team

fengttt avatar Feb 09 '23 17:02 fengttt