pxf
pxf copied to clipboard
Can't read data from hdfs when the erasure coding policy is specified
Greenplum version or build
I install gpdb from 6.2.1 rpm.
postgres=# select version();
version
------------------------------
PostgreSQL 9.4.24 (Greenplum Database 6.2.1 build commit:d90ac1a1b983b913b3950430d4d9e47ee8827fd4) on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Dec 12 2019 18:35:48
(1 row)
pxf version
$ pxf --version
PXF version 5.10.0
OS version and uname -a
Linux 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Installation information ( pg_config )
$ pg_config
BINDIR = /usr/local/greenplum-db-6.2.1/bin
DOCDIR = /usr/local/greenplum-db-6.2.1/share/doc/postgresql
HTMLDIR = /usr/local/greenplum-db-6.2.1/share/doc/postgresql
INCLUDEDIR = /usr/local/greenplum-db-6.2.1/include
PKGINCLUDEDIR = /usr/local/greenplum-db-6.2.1/include/postgresql
INCLUDEDIR-SERVER = /usr/local/greenplum-db-6.2.1/include/postgresql/server
LIBDIR = /usr/local/greenplum-db-6.2.1/lib
PKGLIBDIR = /usr/local/greenplum-db-6.2.1/lib/postgresql
LOCALEDIR = /usr/local/greenplum-db-6.2.1/share/locale
MANDIR = /usr/local/greenplum-db-6.2.1/man
SHAREDIR = /usr/local/greenplum-db-6.2.1/share/postgresql
SYSCONFDIR = /usr/local/greenplum-db-6.2.1/etc/postgresql
PGXS = /usr/local/greenplum-db-6.2.1/lib/postgresql/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--with-quicklz' '--enable-gpperfmon' '--with-gssapi' '--enable-mapreduce' '--enable-orafce' '--enable-orca' '--with-libxml' '--with-pgport=5432' '--disable-debug-extensions' '--disable-tap-tests' '--with-perl' '--with-python' '--with-includes=/tmp/build/f8c7ee08/gpdb_src/gpAux/ext/rhel7_x86_64/include /tmp/build/f8c7ee08/gpdb_src/gpAux/ext/rhel7_x86_64/include/libxml2' '--with-libraries=/tmp/build/f8c7ee08/gpdb_src/gpAux/ext/rhel7_x86_64/lib' '--with-openssl' '--with-pam' '--with-ldap' '--prefix=/usr/local/greenplum-db-devel' '--mandir=/usr/local/greenplum-db-devel/man' 'CC=gcc -m64' 'CFLAGS=-m64 -O3 -fargument-noalias-global -fno-omit-frame-pointer -g'
CC = gcc -m64
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2 -I/tmp/build/f8c7ee08/gpdb_src/gpAux/ext/rhel7_x86_64/include -I/usr/local/greenplum-db-6.2.1/include
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -fno-aggressive-loop-optimizations -Wno-unused-but-set-variable -Wno-address -m64 -O3 -fargument-noalias-global -fno-omit-frame-pointer -g -std=gnu99 -Werror=uninitialized -Werror=implicit-function-declaration -I/usr/local/greenplum-db-6.2.1/include
CFLAGS_SL = -fPIC
LDFLAGS = -L/tmp/build/f8c7ee08/gpdb_src/gpAux/ext/rhel7_x86_64/lib -Wl,--as-needed -Wl,-rpath,'/usr/local/greenplum-db-devel/lib',--enable-new-dtags -L/usr/local/greenplum-db-6.2.1/lib
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lgpopt -lnaucrates -lgpdbcost -lgpos -lxerces-c -lxml2 -lpam -lrt -lyaml -lgssapi_krb5 -lquicklz -lzstd -lrt -lcrypt -ldl -lm -L/usr/local/greenplum-db-6.2.1/lib
VERSION = PostgreSQL 9.4.24
hadoop vresion
hadoop-3.1.0
Expected behavior
Reading HDFS text data by pxf external table.
Actual behavior
I can reading HDFS data by pxf external table when the erasure coding policy is unspecified, but can't reading when the erasure coding policy is specified.
reading from hdfs is ok when erasure coding policy is unspecified
hdfs file.
$hdfs ec -getPolicy -path hdfs://tmp/part-05998
The erasure coding policy of hdfs://tmp/part-05998 is unspecified
pxf reading test.
postgres=# CREATE EXTERNAL TABLE public.pxf_example (
offsetid bigint,
tdid text,
monthid integer,
app_install bigint[]
) LOCATION (
'pxf://tmp/part-05998?PROFILE=hdfs:text'
) ON ALL
FORMAT 'text' (delimiter E'\t' null E'' escape E'\\')
ENCODING 'UTF8'
SEGMENT REJECT LIMIT 1 PERCENT;
CREATE EXTERNAL TABLE
postgres=# select offsetid,tdid,monthid from pxf_example limit 100;
offsetid | tdid | monthid
-------------+-----------------------------------+---------
9830096987 | 3ab70a0bebbf2f599aa1fc5c66baf6705 | 201910
4668257082 | 31fe2d65a078ba0e496c474d704da2603 | 201910
4702428099 | 31c74ce394e56f12560e769d9a2e95594 | 201910
7521396273 | 30fa188c2155852adddcfd51ed1be30c9 | 201910
7478403144 | 3e6f27f147cbe2ce18c022343adeb1225 | 201910
5942421014 | 3cbbccdc5503f0e6bee0cfb18f5d39e63 | 201910
9052621959 | 366a3a33123900d9aabd863982cdbfbac | 201910
9806218394 | 3054d1ef923047a6b3a695497e2f9ad18 | 201910
9447557062 | 339a43c09826de999e4a1759c0550e32a | 201910
10078571309 | 385f1116c88900d490ce9b148ee250fa5 | 201910
reading from hdfs is failed when erasure coding policy is specified
hdfs file (specified erasure coding policy to RS-10-4-1024k).
$hdfs ec -getPolicy -path /spark_logs/part-05998
RS-10-4-1024k
read hadoop file by hdfs client is ok.
$hdfs dfs -cat /spark_logs/part-05998 | more
2020-01-02 11:30:41,876 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable
9063519718 3c6cc505cdb0c947bfe9bda367d648640 201910 {-5061992991064318030,-6370420019429456499,3317077434980644433,
0602584}
10097449320 3c16b484c845f230b007ec1d8b57f3b4b 201910 {4345025949894747099,-6753995990058415489}
pxf reading failed.
postgres=# CREATE EXTERNAL TABLE public.pxf_example_ec (
offsetid bigint,
tdid text,
monthid integer,
app_install bigint[]
) LOCATION (
'pxf://spark_logs/part-05998?PROFILE=hdfs:text'
) ON ALL
FORMAT 'text' (delimiter E'\t' null E'' escape E'\\')
ENCODING 'UTF8'
SEGMENT REJECT LIMIT 1 PERCENT;
CREATE EXTERNAL TABLE
postgres=# select offsetid,tdid,monthid from pxf_example_ec limit 100;
ERROR: remote component error (500) from '127.0.0.1:5888': Type Exception Report Message Could not obtain block: BP-2067671923-172.8.9.1-1530169621728:blk_-9223372016158418064_1669579651 file=/spark_logs/part-05998 Description The server encountered an unexpected condition that prevented it from fulfilling the request. Exception java.io.IOException: Could not obtain block: BP-2067671923-172.8.9.1-1530169621728:blk_-9223372016158418064_1669579651 file=/spark_logs/part-05998 (libchurl.c:920) (seg41 slice1 172.xx.xx.xx:24001 pid=149842) (libchurl.c:920)
CONTEXT: External table pxf_example_ec, line 1 of file pxf://spark_logs/part-05998?PROFILE=hdfs:text
pxf instance log
Jan 02, 2020 11:34:03 AM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet [PXF REST Service] in context with path [/pxf] threw exception
java.io.IOException: Could not obtain block: BP-2067671923-172.x.x.x-1530169621728:blk_-9223372016158418064_1669579651 file=/spark_logs/part-05998
at org.greenplum.pxf.service.rest.BridgeResource$1.write(BridgeResource.java:147)
at com.sun.jersey.core.impl.provider.entity.StreamingOutputProvider.writeTo(StreamingOutputProvider.java:71)
at com.sun.jersey.core.impl.provider.entity.StreamingOutputProvider.writeTo(StreamingOutputProvider.java:57)
at com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:306)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1437)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.greenplum.pxf.service.servlet.SecurityServletFilter.lambda$doFilter$0(SecurityServletFilter.java:146)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.greenplum.pxf.service.servlet.SecurityServletFilter.doFilter(SecurityServletFilter.java:158)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:110)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:444)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:1025)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:445)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1137)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:637)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-2067671923-172.x.x.x-1530169621728:blk_-9223372016158418064_1669579651 file=/spark_logs/part-05998
at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1084)
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1068)
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1047)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:655)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:949)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1004)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.greenplum.pxf.plugins.hdfs.ChunkReader.readChunk(ChunkReader.java:107)
at org.greenplum.pxf.plugins.hdfs.ChunkRecordReader.next(ChunkRecordReader.java:210)
at org.greenplum.pxf.plugins.hdfs.ChunkRecordReader.next(ChunkRecordReader.java:56)
at org.greenplum.pxf.plugins.hdfs.HdfsSplittableDataAccessor.readNextObject(HdfsSplittableDataAccessor.java:132)
at org.greenplum.pxf.service.bridge.ReadBridge.getNext(ReadBridge.java:94)
at org.greenplum.pxf.service.rest.BridgeResource$1.write(BridgeResource.java:138)
... 37 more
Jan 02, 2020 11:34:10 AM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet [PXF REST Service] in context with path [/pxf] threw exception
java.io.IOException: Could not obtain block: BP-2067671923-172.x.x.x-1530169621728:blk_-9223372016158418064_1669579651 file=/spark_logs/part-05998
at org.greenplum.pxf.service.rest.BridgeResource$1.write(BridgeResource.java:147)
at com.sun.jersey.core.impl.provider.entity.StreamingOutputProvider.writeTo(StreamingOutputProvider.java:71)
at com.sun.jersey.core.impl.provider.entity.StreamingOutputProvider.writeTo(StreamingOutputProvider.java:57)
at com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:306)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1437)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.greenplum.pxf.service.servlet.SecurityServletFilter.lambda$doFilter$0(SecurityServletFilter.java:146)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.greenplum.pxf.service.servlet.SecurityServletFilter.doFilter(SecurityServletFilter.java:158)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:110)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:444)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:1025)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:445)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1137)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:637)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:317)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-2067671923-172.x.x.x-1530169621728:blk_-9223372016158418064_1669579651 file=/spark_logs/part-05998
at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1084)
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1068)
at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1047)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:655)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:949)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1004)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.greenplum.pxf.plugins.hdfs.ChunkReader.readLine(ChunkReader.java:155)
at org.greenplum.pxf.plugins.hdfs.ChunkRecordReader.<init>(ChunkRecordReader.java:146)
at org.greenplum.pxf.plugins.hdfs.LineBreakAccessor.getReader(LineBreakAccessor.java:71)
at org.greenplum.pxf.plugins.hdfs.HdfsSplittableDataAccessor.getNextSplit(HdfsSplittableDataAccessor.java:119)
at org.greenplum.pxf.plugins.hdfs.HdfsSplittableDataAccessor.openForRead(HdfsSplittableDataAccessor.java:88)
at org.greenplum.pxf.service.bridge.ReadBridge.beginIteration(ReadBridge.java:72)
at org.greenplum.pxf.service.rest.BridgeResource$1.write(BridgeResource.java:131)
... 37 more
Do you have any news? Can PXF work with hdfs erasure coding?
It is a longer term goal for us to support erasure coding, but this is on our radar and we'll hopefully get to in the next couple of quarters.