WeDataSphere
WeDataSphere copied to clipboard
【有奖征文】DSS在程序化广告中应用实践
一. 应用场景
珑玺科技的大数据管理平台DMP依托于Hadoop的下的Hive和Spark等工具链展开, 之前的相关任务都是通过shell脚本的通过定时任务开展, 随着业务需求越来越复杂, 所沉淀的脚本越来越多, 变得难以维护, 增加了后续迭代和项目成员之间的沟通成本. 这时候我们看到微众刚刚开源的这个项目, 可以利用整个平台完成大数据的以下几个工作流:
- 数据交换
- 数据开发
- 数据质量
- 数据可视化
- 数据发送
完成从数据的收集, 去重, 清洗, 规整等工作流, 使整个数据治理中的几乎所有工作, 可以通过 DSS 得到顺利流转, 提升了大数据的数据管理和分析的体验.
我司的DMP平台主要基于Hadoop 2.6来做集成的, 主要功能集中在元数据处理, 数据的去重,清洗, 和标准化. 实现平台的OLAP的工作流, 最终实现的业务需求, 主要集中在4个方面:
- 元数据管理
- 用户标签沉淀
- 反作弊数据分析
- 相关BI报表输出
二. 解决的问题
多数据源支持
平台支持主要的数据源有MySQL, Hive, 甚至支持最新的NewSQL平台分布式数据库TiDB, 和其他第三方数据数据格式. 使用此平台前, 需要做频繁的数据转换操作, 上下文切换成本比较高; 通过引入此系统之后, 可以通过平台集成的数据交换模块, 非常平滑的引入各种数据源进行交叉分析, 提升了多数据源管理的处理效率.
数据脚本开发

DSS部署之前平台的各种任务都是编写shell脚本, 来实现对这个大数据看分析的流程, 随着业务的迭代, 和需求的增多, 脚本的可维护性变得越来越差, 而通过数据开发Scripts模块, 完全兼容hql, MySQL, PySpark, 几乎可以重用之前的大部分脚本, 而且执行过程和结果可视化, 提升了数据分析的工作效率.

三. 最佳实践
阿里云OSS数据读取的问题
我们大部分数据都存储在阿里云对象存储OSS中, 所以需要另外配置读取OSS的jar文件
同步阿里云OSS的jars包到linkis的lib目录下面
附: CDH集成阿里云OSS的说明
cd /opt/linkis
find . -name "lib" -print | awk '{printf("cp /home/hadoop/aliyun-oss-jars/*.jar /opt/linkis/%s/\n", $1)}' |sh
find . -name "lib" -print | awk '{printf("cp /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/jars/hadoop-oss-cdh-5.14.4/httpclient-4.5.2.jar /opt/linkis/%s/\n", $1)}'
find . -name "lib" -print | awk '{printf("cp /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/jars/hadoop-oss-cdh-5.14.4/httpcore-4.4.4.jar /opt/linkis/%s/\n", $1)}'
find . -name "lib" -print | awk '{printf("cp /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/hadoop/lib/jdom-1.1.jar /opt/linkis/%s/\n", $1)}'
$ cd aliyun-oss-jars/
$ ls -l
total 2932
-rw-r--r-- 1 hadoop hadoop 116337 Jan 2 10:59 aliyun-java-sdk-core-3.4.0.jar
-rw-r--r-- 1 hadoop hadoop 788137 Jan 2 10:59 aliyun-java-sdk-ecs-4.2.0.jar
-rw-r--r-- 1 hadoop hadoop 215492 Jan 2 10:59 aliyun-java-sdk-ram-3.0.0.jar
-rw-r--r-- 1 hadoop hadoop 13277 Jan 2 10:59 aliyun-java-sdk-sts-3.0.0.jar
-rw-r--r-- 1 hadoop hadoop 562719 Jan 2 10:59 aliyun-sdk-oss-3.4.1.jar
-rw-r--r-- 1 hadoop hadoop 71074 Jan 2 15:12 hadoop-aliyun-2.6.0-cdh5.14.4.jar
-rw-r--r-- 1 hadoop hadoop 736658 Jan 2 15:10 httpclient-4.5.2.jar
-rw-r--r-- 1 hadoop hadoop 326724 Jan 2 15:10 httpcore-4.4.4.jar
-rw-r--r-- 1 hadoop hadoop 153115 Jan 2 15:10 jdom-1.1.jar
同步阿里云的库到DSS:
cd /opt/aliyun-oss-jars/
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-spark-enginemanager/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-hive-entrance/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-spark-entrance/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-resourcemanager/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/eureka/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-jdbc-entrance/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/modulebak/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/module/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-publicservice/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-bml/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-python-enginemanager/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-python-entrance/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-gateway/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-ujes-hive-enginemanager/lib/%s\n", $2, $2)}' | sh
find . -name "*.jar" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/%s /opt/linkis/linkis-metadata/lib/%s\n", $2, $2)}' | sh
cd /opt/linkis
find . -name "lib" -print | awk -F'/' '{printf("ln -s /opt/aliyun-oss-jars/hadoop-aliyun-2.6.0-cdh5.14.4.jar /opt/linkis/%s/lib/hadoop-aliyun.jar\n", $2)}'
Scriptis的右侧刷不出来数据,一直在刷新中
步骤1: 修改文件
vim /home/hadoop//Linkis20191218/metadata/target/classes/com/webank/wedatasphere/linkis/metadata/hive/dao/impl/HiveMetaDao.xml
<select id="getDbsByUser" resultType="java.lang.String" parameterType="java.lang.String">
<!--select NAME from(
select t2.NAME as NAME
from DB_PRIVS t1, DBS t2
where (lcase(t1.PRINCIPAL_NAME) = #{userName,jdbcType=VARCHAR}
OR t1.PRINCIPAL_NAME IN (SELECT ROLE FROM(SELECT r.ROLE_NAME AS ROLE, u.PRINCIPAL_NAME AS USER FROM ROLES r LEFT JOIN (SELECT * FROM ROLE_MAP WHERE PRINCIPAL_TYPE = 'USER') u ON r.ROLE_ID = u.ROLE_ID)AS T where T.USER = #{userName,jdbcType=VARCHAR}))
and lcase(t1.DB_PRIV) in ('select','all') and t1.DB_ID =t2.DB_ID
union all
select t3.NAME as NAME
from TBL_PRIVS t1, TBLS t2 , DBS t3
where t1.TBL_ID=t2.TBL_ID and lcase(t1.TBL_PRIV) in ('select','all') and (
lcase(t1.PRINCIPAL_NAME) = #{userName,jdbcType=VARCHAR} or lcase(t1.PRINCIPAL_NAME) in (SELECT ROLE FROM(SELECT r.ROLE_NAME AS ROLE, u.PRINCIPAL_NAME AS USER FROM ROLES r LEFT JOIN (SELECT * FROM ROLE_MAP WHERE PRINCIPAL_TYPE = 'USER') u ON r.ROLE_ID = u.ROLE_ID)AS T where T.USER = #{userName,jdbcType=VARCHAR}))
and t2.DB_ID=t3.DB_ID) a
GROUP BY NAME
order by NAME-->
select name from DBS
</select>
<select id="getTablesByDbNameAndUser" resultType="map" parameterType="map">
<!--select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
from DB_PRIVS t1,TBLS t2, DBS t3
where t1.DB_ID =t3.DB_ID
and t2.DB_ID=t3.DB_ID
and lcase(t1.DB_PRIV) in ('select','all')
and lcase(t1.PRINCIPAL_NAME) = #{userName,jdbcType=VARCHAR}
and t3.NAME = #{dbName,jdbcType=VARCHAR}
union
select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
from DB_PRIVS t1,TBLS t2, DBS t3
where t1.DB_ID =t3.DB_ID
and t2.DB_ID=t3.DB_ID
and lcase(t1.DB_PRIV) in ('select','all')
and lcase(t1.PRINCIPAL_NAME) in (select ROLE_NAME from ROLES where ROLE_ID in (select ROLE_ID from ROLE_MAP where PRINCIPAL_NAME = #{userName,jdbcType=VARCHAR}))
and t3.NAME = #{dbName,jdbcType=VARCHAR}
union
select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
from TBL_PRIVS t1, TBLS t2 , DBS t3
where t1.TBL_ID=t2.TBL_ID
and t2.DB_ID=t3.DB_ID
and lcase(t1.TBL_PRIV) in ('select','all')
and t1.PRINCIPAL_NAME = #{userName,jdbcType=VARCHAR}
and t3.NAME = #{dbName,jdbcType=VARCHAR}
union
select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
from TBL_PRIVS t1, TBLS t2 , DBS t3
where t1.TBL_ID=t2.TBL_ID
and t2.DB_ID=t3.DB_ID
and lcase(t1.TBL_PRIV) in ('select','all')
and t1.PRINCIPAL_NAME in (select ROLE_NAME from ROLES where ROLE_ID in (select ROLE_ID from ROLE_MAP where PRINCIPAL_NAME = #{userName,jdbcType=VARCHAR}))
and t3.NAME = #{dbName,jdbcType=VARCHAR}
order by NAME;-->
select t2.TBL_NAME as NAME, t2.TBL_TYPE as TYPE, t2.CREATE_TIME as CREATE_TIME, t2.LAST_ACCESS_TIME as LAST_ACCESS_TIME, t2.OWNER as OWNER
from TBLS t2 , DBS t3
where
t2.DB_ID=t3.DB_ID
and t3.NAME = #{dbName,jdbcType=VARCHAR}
</select>
步骤2: 进入Linkis-20191218/metadata 重新编译
root@cdh04:/home/hadoop/Linkis-20191218/metadata# mvn clean
root@cdh04:/home/hadoop/Linkis-20191218/metadata# mvn install
步骤3: 确认相应的包已经更新
cp linkis-metadata-0.9.2.jar /opt/linkis/linkis-metadata/lib/linkis-metadata-0.9.2.jar
步骤4: 重启所有linkis-metadata微服务
错误码: errCode: 10905
错误日志
[2019-12-24 22:47:39.120 [INFO ] [qtp296594285-26] c.w.w.d.s.l.LockAspect (63) [around] - 调用方法:addRootFlow
2019-12-24 22:47:39.120 [INFO ] [qtp296594285-26] c.w.w.d.s.l.LockAspect (72) [around] - projectVersionID为:1
2019-12-24 22:47:39.135 [INFO ] [qtp296594285-26] c.w.w.d.s.l.LockAspect (80) [around] - 执行过程出现异常
com.webank.wedatasphere.linkis.httpclient.exception.HttpClientResultException: errCode: 10905 ,
desc: URL http://127.0.0.1:9001/api/rest_j/v1/bml/upload request failed! ResponseBody is
{"timestamp":"2019-12-24T14:47:39.132+0000","status":404,"error":"Not Found","message":"/api/rest_j/v1/bml/upload","path":"/api/rest_j/v1/bml/upload"}. ,
ip: cdh04 ,port: 9004 ,serviceKind: dss-server
]
解决方案:
到部署目录linkis删除下面所有微服务的包
find . -name "jasper-*" -print | awk '{printf("mv %s %s_del\n", $1,$1)}' |sh
软件版本:
Hadoop 2.6.0MySQL 5.7.26Hive 1.1.0HBase 1.2.0Spark2 2.2.0