spark-excel [BUG] option 'ignoreAfterHeader' not work

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

public static void main(String[] args) {
    SparkSession sparkSession = SparkSession.builder()
            .master("local[*]")
            .appName("demo")
            .getOrCreate();
    Dataset<Row> rows = sparkSession.read()
            .format("com.crealytics.spark.excel")
            .option("dataAddress", "'Sheet1'!A1")
            .option("header", true)
            .option("ignoreAfterHeader", 1L)
            .option("maxRowsInMemory", 20)
            .load("file:///Users/td/Downloads/20w_2id_AtypeSM3.xlsx");
    rows.show();
}

test_id	time_point	id_number	mobile
测试序号	回溯日期	身份证号	手机号
1	2021/12/8 17:06	cbdddb8e8421b23498480570d7d75330538a6882f5dfdc3b64115c647f3328c4	cbdddb8e8421b23498480570d7d75330538a6882f5dfdc3b64115c647f3328c4
2	2021/12/8 17:06	a0dc2d74b9b0e3c87e076003dbfe472a424cb3032463cb339e351460765a822e	a0dc2d74b9b0e3c87e076003dbfe472a424cb3032463cb339e351460765a822e
3	2021/12/8 17:06	55e3192d096e62d4f9cd00e734a949de2b8e55b13d9b85b1d2d2999c9db2e72c	55e3192d096e62d4f9cd00e734a949de2b8e55b13d9b85b1d2d2999c9db2e72c
4	2021/12/8 17:06	9b602e9b9e8556eff1a28962d4580b34d9bf054f4831f4f924d4a6dfad660e88	9b602e9b9e8556eff1a28962d4580b34d9bf054f4831f4f924d4a6dfad660e88
5	2021/12/8 17:06	5c0d4f4953843ed6f3c54ea7ca2cc4a86d8b7723c3bf0f3fd403d4c61a77feca	5c0d4f4953843ed6f3c54ea7ca2cc4a86d8b7723c3bf0f3fd403d4c61a77feca

com.crealytics spark-excel_2.12 0.14.0

I wanted to use 'ignoreAfterHeader' to ignore the second line, but it didn't work.

console output:

+--------+-------------+--------------------+--------------------+ | test_id| time_point| id_number| mobile| +--------+-------------+--------------------+--------------------+ |测试序号| 回溯日期| 身份证号| 手机号| | 1|12/8/21 17:06|cbdddb8e8421b2349...|cbdddb8e8421b2349...| | 2|12/8/21 17:06|a0dc2d74b9b0e3c87...|a0dc2d74b9b0e3c87...| | 3|12/8/21 17:06|55e3192d096e62d4f...|55e3192d096e62d4f...| | 4|12/8/21 17:06|9b602e9b9e8556eff...|9b602e9b9e8556eff...| | 5|12/8/21 17:06|5c0d4f4953843ed6f...|5c0d4f4953843ed6f...| | 6|12/8/21 17:06|f83340f3147b49827...|f83340f3147b49827...| | 7|12/8/21 17:06|d712cf4114c03dc43...|d712cf4114c03dc43...| | 8|12/8/21 17:06|fefad899b5dc20858...|fefad899b5dc20858...| | 9|12/8/21 17:06|8e7a98f9565619a4d...|8e7a98f9565619a4d...| | 10|12/8/21 17:06|3eaa72f81914fb894...|3eaa72f81914fb894...| | 11|12/8/21 17:06|d5744897e47fb6d78...|d5744897e47fb6d78...| | 12|12/8/21 17:06|6f61c3af9dcc39522...|6f61c3af9dcc39522...| | 13|12/8/21 17:06|abe1b0a5a9e58808c...|abe1b0a5a9e58808c...| | 14|12/8/21 17:06|87c186adf88a37443...|87c186adf88a37443...| | 15|12/8/21 17:06|7b4073a22410aafc3...|7b4073a22410aafc3...| | 16|12/8/21 17:06|dab089f470a4bcb77...|dab089f470a4bcb77...| | 17|12/8/21 17:06|1f78641036c71b8e6...|1f78641036c71b8e6...| | 18|12/8/21 17:06|47fb25b4d4af9f2da...|47fb25b4d4af9f2da...| | 19|12/8/21 17:06|8f1818a052ee87314...|8f1818a052ee87314...| +--------+-------------+--------------------+--------------------+

Expected Behavior

I expect that option 'ignoreAfterHeader' do work.

Steps To Reproduce

public static void main(String[] args) { SparkSession sparkSession = SparkSession.builder() .master("local[*]") .appName("demo") .getOrCreate(); Dataset<Row> rows = sparkSession.read() .format("com.crealytics.spark.excel") .option("dataAddress", "'new贷前画像-DCPACP指标3.0'!A1") .option("header", true) .option("ignoreAfterHeader", 1L) .option("maxRowsInMemory", 20) .load("file:///Users/td/Downloads/20w_2id_AtypeSM3.xlsx"); rows.show(); }

Environment

- Spark version: 3.1.1
- Spark-Excel version: 0.14.0
- OS: MacOS
- Cluster environment local[*]

Anything else?

No response

Jul 15 '22 12:07 mgyboom

Please try a newer spark-excel version and use .format("excel").

Jul 15 '22 12:07 nightscape

Thank you for your reply. When I used .format("excel"), it worked. However, when I tried to run on k8s, it produced an exception, as follows:

Exception in thread "Thread-25" java.lang.ClassNotFoundException: Failed to find data source: excel. Please find packages at http://spark.apache.org/third-party-projects.html

 	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:689)

 	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:743)

 	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:266)

 	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)

 	at cn.tongdun.sparkdatahandler.handler.impl.ExcelHandler.read(ExcelHandler.java:26)

 	at cn.tongdun.sparkdatahandler.Simple$ReadInputFileTask.run(Simple.java:119)

 	at java.base/java.lang.Thread.run(Unknown Source)

 Caused by: java.lang.ClassNotFoundException: excel.DefaultSource

 	at java.base/java.net.URLClassLoader.findClass(Unknown Source)

 	at java.base/java.lang.ClassLoader.loadClass(Unknown Source)

 	at java.base/java.lang.ClassLoader.loadClass(Unknown Source)

 	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:663)

 	at scala.util.Try$.apply(Try.scala:213)

 	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:663)

 	at scala.util.Failure.orElse(Try.scala:224)

 	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:663)

 	... 6 more

Oddly enough, I didn't have this exception when I used .format("com.crealytics.spark.excel").However, it cannot ignore some lines after the header. @nightscape

Jul 15 '22 15:07 mgyboom

Please check these potential duplicates:

[#615] [BUG] partitionBy not working as expected (62.91%) If this issue is a duplicate, please add any additional info to the ticket with the most information and close this one.

Jul 16 '22 11:07 github-actions[bot]

@mgyboom it seems like on your cluster you're still using an outdated version of spark-excel...

Aug 20 '22 00:08 nightscape

@nightscape I upgraded spark-excel to version 3.1.2_0.17.1, but when I used .format("excel"), exceptions java.lang.ClassNotFoundException: excel.DefaultSource still occur. spark on k8s, not local.

Sep 29 '22 03:09 mgyboom

How about 0.18.0?

Sep 29 '22 03:09 pjfanning

How about 0.18.0?

I can't find version 3.1.2_0.18.0 in the mvnrepository.

Sep 29 '22 06:09 mgyboom

It's 3.2.2_0.18.0 - see https://mvnrepository.com/artifact/com.crealytics/spark-excel

Sep 29 '22 09:09 pjfanning

Currently the cross-Spark publishing does not work. I created a new issue for that: https://github.com/crealytics/spark-excel/issues/648

Sep 29 '22 15:09 nightscape

@nightscape @pjfanning I unzipped my jar and did not find com.crealytics.spark.v2.excel.DataSource in the META-INF/services/org.apache.spark.sql.DataSourceRegister file. Is it possible that this is causing ClassNotFoundException exception ？ I was packaging my Java program with maven.

Sep 30 '22 02:09 mgyboom

它是 3.2.2_0.18.0 - 请参阅https://mvnrepository.com/artifact/com.crealytics/spark-excel

Although I use 3.2.2_0.18.0 version, it is still so.

Sep 30 '22 03:09 mgyboom

@mgyboom can you try 0.18.3 which should now be correctly cross-published for all Spark versions.

Oct 16 '22 08:10 nightscape

@mgyboom can you try 0.18.3 which should now be correctly cross-published for all Spark versions.

Although I use 3.1.1_0.18.3 version, it is still so.

java.lang.ClassNotFoundException: Failed to find data source: excel. Please find packages at http://spark.apache.org/third-party-projects.html
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:689)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:743)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:266)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
	at cn.tongdun.sparkdatahandler.handler.impl.ExcelHandler.read(ExcelHandler.java:34)
	at cn.tongdun.sparkdatahandler.handler.impl.ExcelHandler.read(ExcelHandler.java:13)
	at cn.tongdun.sparkdatahandler.BaseMain.read(BaseMain.java:87)
	at cn.tongdun.sparkdatahandler.Sharding.main(Sharding.java:58)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: excel.DefaultSource
	at java.base/java.net.URLClassLoader.findClass(Unknown Source)
	at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
	at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:663)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:663)
	at scala.util.Failure.orElse(Try.scala:224)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:663)
	... 19 more

Oct 17 '22 08:10 mgyboom

spark-excel spark-excel copied to clipboard

[BUG] option 'ignoreAfterHeader' not work

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

spark-excel
spark-excel copied to clipboard