SDM-RDFizer icon indicating copy to clipboard operation
SDM-RDFizer copied to clipboard

The time-consuming problem of converting csv data to RDF

Open nullgogo opened this issue 4 years ago • 5 comments

Problem Description:

With 8 csv files, it took more than a day to convert about 600M data into RDF. We also tested the conversion of two csv files to RDF separately, which took more than a few hours.

Data source:

The data comes from CMDB, a total of 8 csv files, including host (18M), vm (18M), software (160M) and other data, there is a one-to-many and many-to-many semantic relationship between these data.

1

Config.ini and mapping.ttl Configuration:

2 3

Execute:

4

environment: os: centos7 cpu core:64 memory: 96G

nullgogo avatar Mar 05 '21 03:03 nullgogo

@eiglesias34 We request team to help us to see the above performance problem,

1 [Problem domain] Our AIOps team to build our infra operational KG using SDM-RDFizer 2 give us some suggestions or directions for deep investigation 3 if needing any other info, please tell me

Thanks!

tangyong avatar Mar 05 '21 07:03 tangyong

Dear @tangyong

Many thanks for sharing this use case. We have implemented new optimization techniques to speed up the execution of the joins in the mappings. Please, let us arrange a meeting, and we can share with you the new version which is still in development stage. Please, contact me at [email protected]

Best regards, Maria-Esther Vidal

mevs avatar Mar 05 '21 10:03 mevs

Dear @tangyong

Many thanks for sharing this use case. We have implemented new optimization techniques to speed up the execution of the joins in the mappings. Please, let us arrange a meeting, and we can share with you the new version which is still in development stage. Please, contact me at [email protected]

Best regards, Maria-Esther Vidal

thanks @mevs very much! I will arrange a meeting and contact with you.

tangyong avatar Mar 05 '21 10:03 tangyong

Dear @mevs ,

I have discussed with my team that we wish to firstly obtain the new optimaized version for comparing performance improvment and feedback you again. I will send my quest to your email.

Thanks!

tangyong avatar Mar 06 '21 03:03 tangyong

Problem Description:

With 8 csv files, it took more than a day to convert about 600M data into RDF. We also tested the conversion of two csv files to RDF separately, which took more than a few hours.

Data source:

The data comes from CMDB, a total of 8 csv files, including host (18M), vm (18M), software (160M) and other data, there is a one-to-many and many-to-many semantic relationship between these data.

1

Config.ini and mapping.ttl Configuration:

2 3

Execute:

4

environment: os: centos7 cpu core:64 memory: 96G

Dear @mevs @dachafra @eiglesias34 ,

We have made a dataset for reproducing the problem and we wish to send you for assisting in investigation/fix. If you have time to help us , please telling me how to share the dataset (~800M) and we will upload the dataset into shared storage.

Thanks!
Best regards, Tang.

tangyong avatar Mar 10 '21 10:03 tangyong