spark-excel
spark-excel copied to clipboard
[BUG] option 'ignoreAfterHeader' not work
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
public static void main(String[] args) {
SparkSession sparkSession = SparkSession.builder()
.master("local[*]")
.appName("demo")
.getOrCreate();
Dataset<Row> rows = sparkSession.read()
.format("com.crealytics.spark.excel")
.option("dataAddress", "'Sheet1'!A1")
.option("header", true)
.option("ignoreAfterHeader", 1L)
.option("maxRowsInMemory", 20)
.load("file:///Users/td/Downloads/20w_2id_AtypeSM3.xlsx");
rows.show();
}
test_id | time_point | id_number | mobile |
---|---|---|---|
测试序号 | 回溯日期 | 身份证号 | 手机号 |
1 | 2021/12/8 17:06 | cbdddb8e8421b23498480570d7d75330538a6882f5dfdc3b64115c647f3328c4 | cbdddb8e8421b23498480570d7d75330538a6882f5dfdc3b64115c647f3328c4 |
2 | 2021/12/8 17:06 | a0dc2d74b9b0e3c87e076003dbfe472a424cb3032463cb339e351460765a822e | a0dc2d74b9b0e3c87e076003dbfe472a424cb3032463cb339e351460765a822e |
3 | 2021/12/8 17:06 | 55e3192d096e62d4f9cd00e734a949de2b8e55b13d9b85b1d2d2999c9db2e72c | 55e3192d096e62d4f9cd00e734a949de2b8e55b13d9b85b1d2d2999c9db2e72c |
4 | 2021/12/8 17:06 | 9b602e9b9e8556eff1a28962d4580b34d9bf054f4831f4f924d4a6dfad660e88 | 9b602e9b9e8556eff1a28962d4580b34d9bf054f4831f4f924d4a6dfad660e88 |
5 | 2021/12/8 17:06 | 5c0d4f4953843ed6f3c54ea7ca2cc4a86d8b7723c3bf0f3fd403d4c61a77feca | 5c0d4f4953843ed6f3c54ea7ca2cc4a86d8b7723c3bf0f3fd403d4c61a77feca |
I wanted to use 'ignoreAfterHeader' to ignore the second line, but it didn't work.
console output:
+--------+-------------+--------------------+--------------------+ | test_id| time_point| id_number| mobile| +--------+-------------+--------------------+--------------------+ |测试序号| 回溯日期| 身份证号| 手机号| | 1|12/8/21 17:06|cbdddb8e8421b2349...|cbdddb8e8421b2349...| | 2|12/8/21 17:06|a0dc2d74b9b0e3c87...|a0dc2d74b9b0e3c87...| | 3|12/8/21 17:06|55e3192d096e62d4f...|55e3192d096e62d4f...| | 4|12/8/21 17:06|9b602e9b9e8556eff...|9b602e9b9e8556eff...| | 5|12/8/21 17:06|5c0d4f4953843ed6f...|5c0d4f4953843ed6f...| | 6|12/8/21 17:06|f83340f3147b49827...|f83340f3147b49827...| | 7|12/8/21 17:06|d712cf4114c03dc43...|d712cf4114c03dc43...| | 8|12/8/21 17:06|fefad899b5dc20858...|fefad899b5dc20858...| | 9|12/8/21 17:06|8e7a98f9565619a4d...|8e7a98f9565619a4d...| | 10|12/8/21 17:06|3eaa72f81914fb894...|3eaa72f81914fb894...| | 11|12/8/21 17:06|d5744897e47fb6d78...|d5744897e47fb6d78...| | 12|12/8/21 17:06|6f61c3af9dcc39522...|6f61c3af9dcc39522...| | 13|12/8/21 17:06|abe1b0a5a9e58808c...|abe1b0a5a9e58808c...| | 14|12/8/21 17:06|87c186adf88a37443...|87c186adf88a37443...| | 15|12/8/21 17:06|7b4073a22410aafc3...|7b4073a22410aafc3...| | 16|12/8/21 17:06|dab089f470a4bcb77...|dab089f470a4bcb77...| | 17|12/8/21 17:06|1f78641036c71b8e6...|1f78641036c71b8e6...| | 18|12/8/21 17:06|47fb25b4d4af9f2da...|47fb25b4d4af9f2da...| | 19|12/8/21 17:06|8f1818a052ee87314...|8f1818a052ee87314...| +--------+-------------+--------------------+--------------------+
Expected Behavior
I expect that option 'ignoreAfterHeader' do work.
Steps To Reproduce
public static void main(String[] args) { SparkSession sparkSession = SparkSession.builder() .master("local[*]") .appName("demo") .getOrCreate(); Dataset<Row> rows = sparkSession.read() .format("com.crealytics.spark.excel") .option("dataAddress", "'new贷前画像-DCPACP指标3.0'!A1") .option("header", true) .option("ignoreAfterHeader", 1L) .option("maxRowsInMemory", 20) .load("file:///Users/td/Downloads/20w_2id_AtypeSM3.xlsx"); rows.show(); }
Environment
- Spark version: 3.1.1
- Spark-Excel version: 0.14.0
- OS: MacOS
- Cluster environment local[*]
Anything else?
No response
Please try a newer spark-excel version and use .format("excel")
.
Thank you for your reply.
When I used .format("excel")
, it worked. However, when I tried to run on k8s, it produced an exception, as follows:
Exception in thread "Thread-25" java.lang.ClassNotFoundException: Failed to find data source: excel. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:689)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:743)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:266)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
at cn.tongdun.sparkdatahandler.handler.impl.ExcelHandler.read(ExcelHandler.java:26)
at cn.tongdun.sparkdatahandler.Simple$ReadInputFileTask.run(Simple.java:119)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.ClassNotFoundException: excel.DefaultSource
at java.base/java.net.URLClassLoader.findClass(Unknown Source)
at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:663)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:663)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:663)
... 6 more
Oddly enough, I didn't have this exception when I used .format("com.crealytics.spark.excel")
.However, it cannot ignore some lines after the header.
@nightscape
Please check these potential duplicates:
- [#615] [BUG] partitionBy not working as expected (62.91%) If this issue is a duplicate, please add any additional info to the ticket with the most information and close this one.
@mgyboom it seems like on your cluster you're still using an outdated version of spark-excel...
@nightscape
I upgraded spark-excel to version 3.1.2_0.17.1, but when I used .format("excel")
, exceptions java.lang.ClassNotFoundException: excel.DefaultSource
still occur.
spark on k8s, not local.
How about 0.18.0?
How about 0.18.0?
I can't find version 3.1.2_0.18.0
in the mvnrepository.
It's 3.2.2_0.18.0 - see https://mvnrepository.com/artifact/com.crealytics/spark-excel
Currently the cross-Spark publishing does not work. I created a new issue for that: https://github.com/crealytics/spark-excel/issues/648
@nightscape @pjfanning
I unzipped my jar and did not find com.crealytics.spark.v2.excel.DataSource in the META-INF/services/org.apache.spark.sql.DataSourceRegister file.
Is it possible that this is causing ClassNotFoundException
exception ?
I was packaging my Java program with maven.
它是 3.2.2_0.18.0 - 请参阅https://mvnrepository.com/artifact/com.crealytics/spark-excel
Although I use 3.2.2_0.18.0
version, it is still so.
@mgyboom can you try 0.18.3
which should now be correctly cross-published for all Spark versions.
@mgyboom can you try
0.18.3
which should now be correctly cross-published for all Spark versions.
Although I use 3.1.1_0.18.3
version, it is still so.
java.lang.ClassNotFoundException: Failed to find data source: excel. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:689)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:743)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:266)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
at cn.tongdun.sparkdatahandler.handler.impl.ExcelHandler.read(ExcelHandler.java:34)
at cn.tongdun.sparkdatahandler.handler.impl.ExcelHandler.read(ExcelHandler.java:13)
at cn.tongdun.sparkdatahandler.BaseMain.read(BaseMain.java:87)
at cn.tongdun.sparkdatahandler.Sharding.main(Sharding.java:58)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: excel.DefaultSource
at java.base/java.net.URLClassLoader.findClass(Unknown Source)
at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
at java.base/java.lang.ClassLoader.loadClass(Unknown Source)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:663)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:663)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:663)
... 19 more