open-digger icon indicating copy to clipboard operation
open-digger copied to clipboard

[Data Export] shorten the oss uploading time in the beginning of every month

Open tyn1998 opened this issue 2 years ago • 5 comments

Description

Hi community,

Is it possible to shorten the oss uploading time in the beggining of every month? Or could you choose a fixed day from a month and anounce it as a due date before which all data exporting tasks are completed?

This is so important for downstream apps who consume OpenDigger's valuable data.

image image

tyn1998 avatar Jun 02 '23 05:06 tyn1998

@tyn1998 I think we had the discussion before and the solution is put the update time into meta data of each repo, like in the file https://oss.x-lab.info/open_digger/github/X-lab2017/open-digger/meta.json , there will be a field called updatedAt which is timestamp to indicate when the data is updated, you can use the field to find out if the data has been updated or not for current month.

frank-zsy avatar Jun 02 '23 09:06 frank-zsy

Hi @frank-zsy, thanks for your reply.

I know the existence of meta.json files. Actually in this issue, I mean if some methods like parallel computing and uploading can be adopted to speed up the data exporting and uploading processes. So hopefully all export tasks can be completed in 24 hours or even in several hours.

I noticed that writeFileSync (the syncronous version of writeFile) is used in cron tasks to write json files into the file system of your machine. Would it be faster if fs.writeFile is used instead so following computing tasks don't need to wait for file writing?

I also assume that after the cron tasks are executed another set of scripts(not included in this repository) are run to upload the exported files to the aliyun oss. Could those scripts for uploading files be improved to shorten the uploading time?

What is the bottleneck now? Computing or uploading?

tyn1998 avatar Jun 02 '23 10:06 tyn1998

Understood, so I will elaborate the tasks here, there are several steps needed for the data update process.

  • First, we need to wait all the data imported into ClickHouse instance for last month. As we are in UTC+8, it will be about 10am or 11am the first day of a month.
  • Then before the metrics export task, we need to calculate the OpenRank for all the repos and users for last month.
    • We need to calculate the activity for whole GitHub collaboration network and import into Neo4j database.
    • Calculate the OpenRank of all the repos and users.
  • After that, we need to export the OpenRank values from Neo4j database back into ClickHouse instance to make sure the metrics export task can read OpenRank values directly from ClickHouse which is much faster than reading from graph database.
  • Then we can run the monthly export task to generate metrics data, it normally takes hours to finish, I agree that writeFile instead of writeFileSync may improve the performance but most of the time are consumed by ClickHouse computation, the improvement maybe limited.
  • And we also need to export networks for all the repos and users to be exported. It is also a CPU intense task on Neo4j database which may take more time than metrics data export task, although we can make the tasks running at same time.
  • The final step is to upload all the files to OSS by a shell script with oss-util. This step will also take 5-6 hours to complete for about 23 million files.

So if we start all the task in the 11am the first day of a month, OpenRank data import, calculation and export may take about 2 hours, then metrics computation and network export may take 5 hours, and the data upload may also take 5-6 hours to complete.

So if we can make all the process parallel and automated, the whole process may take about 12-13 hours to complete which is the midnight of the first day of the month.

But right now, the process is not fully automated so the data may be updated about the 2nd day of a month like for 2023.5, the data is updated on today's morning.

frank-zsy avatar Jun 02 '23 11:06 frank-zsy

@frank-zsy Thanks for your detailed elaboration! This is the first time I have known the complete steps for exporting monthly data and I am convinced that the tasks are indeed time consuming.

I recommend to write the steps mentioned above into src/cron/README.md so more interested people can share the knowledge of how datas are exported by OpenDigger in every month :D

tyn1998 avatar Jun 02 '23 11:06 tyn1998

Agreed, I will add the information into README file, and to improve the performance, I think several things can be done.

  • Use asynchronous functions instead of synchronous functions in data export task.
  • Actually I have already checked multiple ways to upload the data to OSS, and currently I think the time consumption is hard to reduce.
    • I have tried to compress all the metrics data and upload to OSS, then use uncompress serverless service to extract the files from the compressed file. But compress more than 20 million files into a file is very challenging and time consuming, and further more, the serverless service to uncompress file on OSS do not provide enough resource like only 2c4g which will definitely crash during uncompress process.
    • And to iterate more than 20 million files under a single folder is also very challenging. oss-util does a great work in the situation. Right now, the script I use is:
ossutilmac64 sync ~/github_data/open_digger/github oss://xlab-open-source/open_digger/github --force --job=1000 --meta "Expires:2023-07-01T22:00:00+08:00" --config-file=~/.ossutilconfig-xlab

The script upload files in 1000 parallel thread and set meta making the process a little bit longer than just upload the files to OSS. So maybe bigger parallel job parameter or deploy the task in the same VPC with OSS may reduce the time but not very much I think because right now the network payload is not very high, maybe because the files iteration process are time consuming.

frank-zsy avatar Jun 02 '23 11:06 frank-zsy