pxf icon indicating copy to clipboard operation
pxf copied to clipboard

Can't read data from hdfs when the erasure coding policy is specified

Open weiyong-dba opened this issue 5 years ago • 2 comments

Greenplum version or build

I install gpdb from 6.2.1 rpm.

postgres=# select version();
                                                                                               version                                                                                                
------------------------------
 PostgreSQL 9.4.24 (Greenplum Database 6.2.1 build commit:d90ac1a1b983b913b3950430d4d9e47ee8827fd4) on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 6.4.0, 64-bit compiled on Dec 12 2019 18:35:48
(1 row)

pxf version

$ pxf --version
PXF version 5.10.0

OS version and uname -a

Linux 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Installation information ( pg_config )

$ pg_config
BINDIR = /usr/local/greenplum-db-6.2.1/bin
DOCDIR = /usr/local/greenplum-db-6.2.1/share/doc/postgresql
HTMLDIR = /usr/local/greenplum-db-6.2.1/share/doc/postgresql
INCLUDEDIR = /usr/local/greenplum-db-6.2.1/include
PKGINCLUDEDIR = /usr/local/greenplum-db-6.2.1/include/postgresql
INCLUDEDIR-SERVER = /usr/local/greenplum-db-6.2.1/include/postgresql/server
LIBDIR = /usr/local/greenplum-db-6.2.1/lib
PKGLIBDIR = /usr/local/greenplum-db-6.2.1/lib/postgresql
LOCALEDIR = /usr/local/greenplum-db-6.2.1/share/locale
MANDIR = /usr/local/greenplum-db-6.2.1/man
SHAREDIR = /usr/local/greenplum-db-6.2.1/share/postgresql
SYSCONFDIR = /usr/local/greenplum-db-6.2.1/etc/postgresql
PGXS = /usr/local/greenplum-db-6.2.1/lib/postgresql/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--with-quicklz' '--enable-gpperfmon' '--with-gssapi' '--enable-mapreduce' '--enable-orafce' '--enable-orca' '--with-libxml' '--with-pgport=5432' '--disable-debug-extensions' '--disable-tap-tests' '--with-perl' '--with-python' '--with-includes=/tmp/build/f8c7ee08/gpdb_src/gpAux/ext/rhel7_x86_64/include /tmp/build/f8c7ee08/gpdb_src/gpAux/ext/rhel7_x86_64/include/libxml2' '--with-libraries=/tmp/build/f8c7ee08/gpdb_src/gpAux/ext/rhel7_x86_64/lib' '--with-openssl' '--with-pam' '--with-ldap' '--prefix=/usr/local/greenplum-db-devel' '--mandir=/usr/local/greenplum-db-devel/man' 'CC=gcc -m64' 'CFLAGS=-m64 -O3 -fargument-noalias-global -fno-omit-frame-pointer -g'
CC = gcc -m64
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2 -I/tmp/build/f8c7ee08/gpdb_src/gpAux/ext/rhel7_x86_64/include -I/usr/local/greenplum-db-6.2.1/include
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -fno-aggressive-loop-optimizations -Wno-unused-but-set-variable -Wno-address -m64 -O3 -fargument-noalias-global -fno-omit-frame-pointer -g -std=gnu99 -Werror=uninitialized -Werror=implicit-function-declaration -I/usr/local/greenplum-db-6.2.1/include
CFLAGS_SL = -fPIC
LDFLAGS = -L/tmp/build/f8c7ee08/gpdb_src/gpAux/ext/rhel7_x86_64/lib -Wl,--as-needed -Wl,-rpath,'/usr/local/greenplum-db-devel/lib',--enable-new-dtags -L/usr/local/greenplum-db-6.2.1/lib
LDFLAGS_EX = 
LDFLAGS_SL = 
LIBS = -lpgcommon -lpgport -lgpopt -lnaucrates -lgpdbcost -lgpos -lxerces-c -lxml2 -lpam -lrt -lyaml -lgssapi_krb5 -lquicklz -lzstd -lrt -lcrypt -ldl -lm -L/usr/local/greenplum-db-6.2.1/lib
VERSION = PostgreSQL 9.4.24

hadoop vresion

hadoop-3.1.0

Expected behavior

Reading HDFS text data by pxf external table.

Actual behavior

I can reading HDFS data by pxf external table when the erasure coding policy is unspecified, but can't reading when the erasure coding policy is specified.

reading from hdfs is ok when erasure coding policy is unspecified

hdfs file.

$hdfs ec -getPolicy -path hdfs://tmp/part-05998
The erasure coding policy of hdfs://tmp/part-05998 is unspecified

pxf reading test.

postgres=# CREATE EXTERNAL TABLE public.pxf_example (
    offsetid bigint,
    tdid text,
    monthid integer,
    app_install bigint[]
) LOCATION (
    'pxf://tmp/part-05998?PROFILE=hdfs:text'
) ON ALL 
FORMAT 'text' (delimiter E'\t' null E'' escape E'\\')
ENCODING 'UTF8'
SEGMENT REJECT LIMIT 1 PERCENT;
CREATE EXTERNAL TABLE

postgres=# select offsetid,tdid,monthid from  pxf_example limit 100;
  offsetid   |               tdid                | monthid 
-------------+-----------------------------------+---------
  9830096987 | 3ab70a0bebbf2f599aa1fc5c66baf6705 |  201910
  4668257082 | 31fe2d65a078ba0e496c474d704da2603 |  201910
  4702428099 | 31c74ce394e56f12560e769d9a2e95594 |  201910
  7521396273 | 30fa188c2155852adddcfd51ed1be30c9 |  201910
  7478403144 | 3e6f27f147cbe2ce18c022343adeb1225 |  201910
  5942421014 | 3cbbccdc5503f0e6bee0cfb18f5d39e63 |  201910
  9052621959 | 366a3a33123900d9aabd863982cdbfbac |  201910
  9806218394 | 3054d1ef923047a6b3a695497e2f9ad18 |  201910
  9447557062 | 339a43c09826de999e4a1759c0550e32a |  201910
 10078571309 | 385f1116c88900d490ce9b148ee250fa5 |  201910

reading from hdfs is failed when erasure coding policy is specified

hdfs file (specified erasure coding policy to RS-10-4-1024k).

$hdfs ec -getPolicy -path /spark_logs/part-05998
RS-10-4-1024k

read hadoop file by hdfs client is ok.

$hdfs dfs -cat /spark_logs/part-05998 | more
2020-01-02 11:30:41,876 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable
9063519718      3c6cc505cdb0c947bfe9bda367d648640       201910  {-5061992991064318030,-6370420019429456499,3317077434980644433,
0602584}
10097449320     3c16b484c845f230b007ec1d8b57f3b4b       201910  {4345025949894747099,-6753995990058415489}

pxf reading failed.

postgres=# CREATE EXTERNAL TABLE public.pxf_example_ec (
    offsetid bigint,
    tdid text,
    monthid integer,
    app_install bigint[]
) LOCATION (
    'pxf://spark_logs/part-05998?PROFILE=hdfs:text'
) ON ALL 
FORMAT 'text' (delimiter E'\t' null E'' escape E'\\')
ENCODING 'UTF8'
SEGMENT REJECT LIMIT 1 PERCENT;
CREATE EXTERNAL TABLE

postgres=# select offsetid,tdid,monthid from  pxf_example_ec limit 100;
ERROR:  remote component error (500) from '127.0.0.1:5888':  Type  Exception Report   Message  Could not obtain block: BP-2067671923-172.8.9.1-1530169621728:blk_-9223372016158418064_1669579651 file=/spark_logs/part-05998   Description  The server encountered an unexpected condition that prevented it from fulfilling the request.   Exception   java.io.IOException: Could not obtain block: BP-2067671923-172.8.9.1-1530169621728:blk_-9223372016158418064_1669579651 file=/spark_logs/part-05998 (libchurl.c:920)  (seg41 slice1 172.xx.xx.xx:24001 pid=149842) (libchurl.c:920)
CONTEXT:  External table pxf_example_ec, line 1 of file pxf://spark_logs/part-05998?PROFILE=hdfs:text

pxf instance log

Jan 02, 2020 11:34:03 AM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet [PXF REST Service] in context with path [/pxf] threw exception
java.io.IOException: Could not obtain block: BP-2067671923-172.x.x.x-1530169621728:blk_-9223372016158418064_1669579651 file=/spark_logs/part-05998
        at org.greenplum.pxf.service.rest.BridgeResource$1.write(BridgeResource.java:147)
        at com.sun.jersey.core.impl.provider.entity.StreamingOutputProvider.writeTo(StreamingOutputProvider.java:71)
        at com.sun.jersey.core.impl.provider.entity.StreamingOutputProvider.writeTo(StreamingOutputProvider.java:57)
        at com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:306)
        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1437)
        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
        at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
        at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
        at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.greenplum.pxf.service.servlet.SecurityServletFilter.lambda$doFilter$0(SecurityServletFilter.java:146)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
        at org.greenplum.pxf.service.servlet.SecurityServletFilter.doFilter(SecurityServletFilter.java:158)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:110)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:444)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:1025)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:445)
        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1137)
        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:637)
        at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:317)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-2067671923-172.x.x.x-1530169621728:blk_-9223372016158418064_1669579651 file=/spark_logs/part-05998
        at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1084)
        at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1068)
        at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1047)
        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:655)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:949)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1004)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at org.greenplum.pxf.plugins.hdfs.ChunkReader.readChunk(ChunkReader.java:107)
        at org.greenplum.pxf.plugins.hdfs.ChunkRecordReader.next(ChunkRecordReader.java:210)
        at org.greenplum.pxf.plugins.hdfs.ChunkRecordReader.next(ChunkRecordReader.java:56)
        at org.greenplum.pxf.plugins.hdfs.HdfsSplittableDataAccessor.readNextObject(HdfsSplittableDataAccessor.java:132)
        at org.greenplum.pxf.service.bridge.ReadBridge.getNext(ReadBridge.java:94)
        at org.greenplum.pxf.service.rest.BridgeResource$1.write(BridgeResource.java:138)
        ... 37 more

Jan 02, 2020 11:34:10 AM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet [PXF REST Service] in context with path [/pxf] threw exception
java.io.IOException: Could not obtain block: BP-2067671923-172.x.x.x-1530169621728:blk_-9223372016158418064_1669579651 file=/spark_logs/part-05998
        at org.greenplum.pxf.service.rest.BridgeResource$1.write(BridgeResource.java:147)
        at com.sun.jersey.core.impl.provider.entity.StreamingOutputProvider.writeTo(StreamingOutputProvider.java:71)
        at com.sun.jersey.core.impl.provider.entity.StreamingOutputProvider.writeTo(StreamingOutputProvider.java:57)
        at com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:306)
        at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1437)
        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
        at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
        at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
        at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
        at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:303)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.greenplum.pxf.service.servlet.SecurityServletFilter.lambda$doFilter$0(SecurityServletFilter.java:146)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
        at org.greenplum.pxf.service.servlet.SecurityServletFilter.doFilter(SecurityServletFilter.java:158)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:219)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:110)
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:444)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:169)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
        at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:1025)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:445)
        at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1137)
        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:637)
        at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:317)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-2067671923-172.x.x.x-1530169621728:blk_-9223372016158418064_1669579651 file=/spark_logs/part-05998
        at org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1084)
        at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1068)
        at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1047)
        at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:655)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:949)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1004)
        at java.io.DataInputStream.read(DataInputStream.java:100)
        at org.greenplum.pxf.plugins.hdfs.ChunkReader.readLine(ChunkReader.java:155)
        at org.greenplum.pxf.plugins.hdfs.ChunkRecordReader.<init>(ChunkRecordReader.java:146)
        at org.greenplum.pxf.plugins.hdfs.LineBreakAccessor.getReader(LineBreakAccessor.java:71)
        at org.greenplum.pxf.plugins.hdfs.HdfsSplittableDataAccessor.getNextSplit(HdfsSplittableDataAccessor.java:119)
        at org.greenplum.pxf.plugins.hdfs.HdfsSplittableDataAccessor.openForRead(HdfsSplittableDataAccessor.java:88)
        at org.greenplum.pxf.service.bridge.ReadBridge.beginIteration(ReadBridge.java:72)
        at org.greenplum.pxf.service.rest.BridgeResource$1.write(BridgeResource.java:131)
        ... 37 more

weiyong-dba avatar Jan 02 '20 03:01 weiyong-dba

Do you have any news? Can PXF work with hdfs erasure coding?

RuslanFialkovsky avatar Jul 14 '20 09:07 RuslanFialkovsky

It is a longer term goal for us to support erasure coding, but this is on our radar and we'll hopefully get to in the next couple of quarters.

frankgh avatar Jul 15 '20 13:07 frankgh