WeDataSphere icon indicating copy to clipboard operation
WeDataSphere copied to clipboard

【有奖征文】DSS在程序化广告中应用实践

Open urzeric opened this issue 4 years ago • 1 comments

一. 应用场景

珑玺科技的大数据管理平台DMP依托于Hadoop的下的HiveSpark等工具链展开, 之前的相关任务都是通过shell脚本的通过定时任务开展, 随着业务需求越来越复杂, 所沉淀的脚本越来越多, 变得难以维护, 增加了后续迭代和项目成员之间的沟通成本. 这时候我们看到微众刚刚开源的这个项目, 可以利用整个平台完成大数据的以下几个工作流:

  • 数据交换
  • 数据开发
  • 数据质量
  • 数据可视化
  • 数据发送

完成从数据的收集, 去重, 清洗, 规整等工作流, 使整个数据治理中的几乎所有工作, 可以通过 DSS 得到顺利流转, 提升了大数据的数据管理和分析的体验.

我司的DMP平台主要基于Hadoop 2.6来做集成的, 主要功能集中在元数据处理, 数据的去重,清洗, 和标准化. 实现平台的OLAP的工作流, 最终实现的业务需求, 主要集中在4个方面:

  • 元数据管理
  • 用户标签沉淀
  • 反作弊数据分析
  • 相关BI报表输出

二. 解决的问题

多数据源支持

平台支持主要的数据源有MySQL, Hive, 甚至支持最新的NewSQL平台分布式数据库TiDB, 和其他第三方数据数据格式. 使用此平台前, 需要做频繁的数据转换操作, 上下文切换成本比较高; 通过引入此系统之后, 可以通过平台集成的数据交换模块, 非常平滑的引入各种数据源进行交叉分析, 提升了多数据源管理的处理效率.

数据脚本开发

Picture1

DSS部署之前平台的各种任务都是编写shell脚本, 来实现对这个大数据看分析的流程, 随着业务的迭代, 和需求的增多, 脚本的可维护性变得越来越差, 而通过数据开发Scripts模块, 完全兼容hql, MySQL, PySpark, 几乎可以重用之前的大部分脚本, 而且执行过程和结果可视化, 提升了数据分析的工作效率.

Picture2

三. 最佳实践

阿里云OSS数据读取的问题

我们大部分数据都存储在阿里云对象存储OSS中, 所以需要另外配置读取OSSjar文件 同步阿里云OSSjars包到linkislib目录下面

附: CDH集成阿里云OSS说明

cd /opt/linkis
find . -name "lib" -print | awk '{printf("cp /home/hadoop/aliyun-oss-jars/*.jar /opt/linkis/%s/\n", $1)}' |sh
find . -name "lib" -print | awk '{printf("cp /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/jars/hadoop-oss-cdh-5.14.4/httpclient-4.5.2.jar /opt/linkis/%s/\n", $1)}'
find . -name "lib" -print | awk '{printf("cp /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/jars/hadoop-oss-cdh-5.14.4/httpcore-4.4.4.jar /opt/linkis/%s/\n", $1)}'
find . -name "lib" -print | awk '{printf("cp /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/jdom-1.1.jar /opt/linkis/%s/\n", $1)}'
$ cd aliyun-oss-jars/
$ ls -l
total 2932
-rw-r--r-- 1 hadoop hadoop 116337 Jan 2 10:59 aliyun-java-sdk-core-3.4.0.jar
-rw-r--r-- 1 hadoop hadoop 788137 Jan 2 10:59 aliyun-java-sdk-ecs-4.2.0.jar
-rw-r--r-- 1 hadoop hadoop 215492 Jan 2 10:59 aliyun-java-sdk-ram-3.0.0.jar
-rw-r--r-- 1 hadoop hadoop 13277 Jan 2 10:59 aliyun-java-sdk-sts-3.0.0.jar
-rw-r--r-- 1 hadoop hadoop 562719 Jan 2 10:59 aliyun-sdk-oss-3.4.1.jar
-rw-r--r-- 1 hadoop hadoop 71074 Jan 2 15:12 hadoop-aliyun-2.6.0-cdh5.14.4.jar
-rw-r--r-- 1 hadoop hadoop 736658 Jan 2 15:10 httpclient-4.5.2.jar
-rw-r--r-- 1 hadoop hadoop 326724 Jan 2 15:10 httpcore-4.4.4.jar
-rw-r--r-- 1 hadoop hadoop 153115 Jan 2 15:10 jdom-1.1.jar

同步阿里云的库到DSS:

cd /opt/aliyun-oss-jars/
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-spark-enginemanager/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-hive-entrance/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-spark-entrance/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-resourcemanager/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/eureka/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-jdbc-entrance/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/modulebak/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/module/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-publicservice/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-bml/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-python-enginemanager/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-python-entrance/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-gateway/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-hive-enginemanager/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-metadata/lib/%s\n", $2, $2)}' | sh
cd /opt/linkis
find . -name "lib" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/hadoop-aliyun-2.6.0-cdh5.14.4.jar /opt/linkis/%s/lib/hadoop-aliyun.jar\n", $2)}' 

Scriptis的右侧刷不出来数据,一直在刷新中

步骤1: 修改文件

vim /home/hadoop//Linkis20191218/metadata/target/classes/com/webank/wedatasphere/linkis/metadata/hive/dao/impl/HiveMetaDao.xml
<select id="getDbsByUser" resultType="java.lang.String" parameterType="java.lang.String">
        <!--select NAME from(
        select t2.NAME  as NAME
        from DB_PRIVS t1, DBS t2
        where (lcase(t1.PRINCIPAL_NAME) = #{userName,jdbcType=VARCHAR}
        OR t1.PRINCIPAL_NAME IN (SELECT ROLE FROM(SELECT r.ROLE_NAME AS ROLE, u.PRINCIPAL_NAME AS USER FROM ROLES r LEFT JOIN (SELECT * FROM ROLE_MAP WHERE PRINCIPAL_TYPE = 'USER') u ON r.ROLE_ID = u.ROLE_ID)AS T where T.USER = #{userName,jdbcType=VARCHAR}))
        and lcase(t1.DB_PRIV) in ('select','all') and t1.DB_ID =t2.DB_ID
        union all
        select t3.NAME as NAME
        from TBL_PRIVS t1, TBLS t2 , DBS t3
        where t1.TBL_ID=t2.TBL_ID and lcase(t1.TBL_PRIV) in ('select','all') and (
        lcase(t1.PRINCIPAL_NAME) = #{userName,jdbcType=VARCHAR} or lcase(t1.PRINCIPAL_NAME) in (SELECT ROLE FROM(SELECT r.ROLE_NAME AS ROLE, u.PRINCIPAL_NAME AS USER FROM ROLES r LEFT JOIN (SELECT * FROM ROLE_MAP WHERE PRINCIPAL_TYPE = 'USER') u ON r.ROLE_ID = u.ROLE_ID)AS T where T.USER = #{userName,jdbcType=VARCHAR}))
        and t2.DB_ID=t3.DB_ID) a
        GROUP BY NAME
        order by NAME-->
        select name from DBS
    </select>

    <select id="getTablesByDbNameAndUser" resultType="map"  parameterType="map">
        <!--select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
        from DB_PRIVS t1,TBLS t2, DBS t3
        where  t1.DB_ID =t3.DB_ID
        and t2.DB_ID=t3.DB_ID
        and lcase(t1.DB_PRIV) in ('select','all')
        and lcase(t1.PRINCIPAL_NAME) = #{userName,jdbcType=VARCHAR}
        and t3.NAME = #{dbName,jdbcType=VARCHAR}
        union
        select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
        from DB_PRIVS t1,TBLS t2, DBS t3
        where  t1.DB_ID =t3.DB_ID
        and t2.DB_ID=t3.DB_ID
        and lcase(t1.DB_PRIV) in ('select','all')
        and lcase(t1.PRINCIPAL_NAME) in (select ROLE_NAME from ROLES where ROLE_ID in (select ROLE_ID from ROLE_MAP where PRINCIPAL_NAME = #{userName,jdbcType=VARCHAR}))
        and t3.NAME = #{dbName,jdbcType=VARCHAR}
        union
        select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
        from TBL_PRIVS t1, TBLS t2 , DBS t3
        where t1.TBL_ID=t2.TBL_ID
        and t2.DB_ID=t3.DB_ID
        and lcase(t1.TBL_PRIV) in ('select','all')
        and t1.PRINCIPAL_NAME = #{userName,jdbcType=VARCHAR}
        and t3.NAME = #{dbName,jdbcType=VARCHAR}
        union
        select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
        from TBL_PRIVS t1, TBLS t2 , DBS t3
        where t1.TBL_ID=t2.TBL_ID
        and t2.DB_ID=t3.DB_ID
        and lcase(t1.TBL_PRIV) in ('select','all')
        and t1.PRINCIPAL_NAME in (select ROLE_NAME from ROLES where ROLE_ID in (select ROLE_ID from ROLE_MAP where PRINCIPAL_NAME = #{userName,jdbcType=VARCHAR}))
        and t3.NAME = #{dbName,jdbcType=VARCHAR}
        order by NAME;-->
        select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
        from TBLS t2 , DBS t3
        where 
         t2.DB_ID=t3.DB_ID
        and t3.NAME = #{dbName,jdbcType=VARCHAR}
    </select>

步骤2: 进入Linkis-20191218/metadata 重新编译

root@cdh04:/home/hadoop/Linkis-20191218/metadata# mvn clean 
root@cdh04:/home/hadoop/Linkis-20191218/metadata# mvn install

步骤3: 确认相应的包已经更新

cp linkis-metadata-0.9.2.jar /opt/linkis/linkis-metadata/lib/linkis-metadata-0.9.2.jar 

步骤4: 重启所有linkis-metadata微服务

错误码: errCode: 10905

错误日志

[2019-12-24 22:47:39.120 [INFO ] [qtp296594285-26] c.w.w.d.s.l.LockAspect (63) [around] - 调用方法:addRootFlow
2019-12-24 22:47:39.120 [INFO ] [qtp296594285-26] c.w.w.d.s.l.LockAspect (72) [around] - projectVersionID为:1
2019-12-24 22:47:39.135 [INFO ] [qtp296594285-26] c.w.w.d.s.l.LockAspect (80) [around] - 执行过程出现异常 
com.webank.wedatasphere.linkis.httpclient.exception.HttpClientResultException: errCode: 10905 ,
		desc: URL http://127.0.0.1:9001/api/rest_j/v1/bml/upload request failed! ResponseBody is 
		{"timestamp":"2019-12-24T14:47:39.132+0000","status":404,"error":"Not Found","message":"/api/rest_j/v1/bml/upload","path":"/api/rest_j/v1/bml/upload"}. ,
		ip: cdh04 ,port: 9004 ,serviceKind: dss-server
		]

解决方案: 到部署目录linkis删除下面所有微服务的包

find . -name "jasper-*" -print | awk  '{printf("mv %s %s_del\n", $1,$1)}' |sh

软件版本:

  • Hadoop 2.6.0
  • MySQL 5.7.26
  • Hive 1.1.0
  • HBase 1.2.0
  • Spark2 2.2.0

 

urzeric avatar Nov 13 '20 07:11 urzeric