Partitions with similar size

Description

Right now, our partition implementation is a partition per repository, that means, partitions can have really different sizes, causing problems like really long times processing 5% of the total amount repositories (only one thread parsing a huge repo).

With this proposal, I want to normalize partition size changing the way we split the data into several threads.

Proposal

Change actual partition ID:

[REPOSITORY_ID]

To:

[REPOSITORY_ID]+[PACKFILE_HASH]+[OFFSET_FROM]+[OFFSET_TO]

Values description

[REPOSITORY_ID]: gitbase ID identifying the repository. Examples gitbase, kubernetes, spark.
[PACKFILE_HASH]: packfile hash. Example: ccbe18d4ba50a14a28855a7b53880d93d87dc498
[OFFSET_FROM]: packfile offset where the first object for that partition is. Example 23144
[OFFSET_TO]: packfile offset where the last object for that partition is. Example 7009211

How to do repartition

We need to decide the number of objects that will be on each partition to generate the offsets. That object count will be a magic number and won't be changed. If we change it over configuration, indexes and other internal modules will not work correctly.

Partitions will be done consulting the packfile indexes when PartitionIter is requested. All the partitions will have the same amount of objects, excluding the last one. This operation should be fast enough to do it per query. We can cache the partition names if necessary.

Taking into account repositories updates, this approach is totally valid. Partitions are related to the packfile hash, so the content cannot change, only be deleted or create new packfiles.

How to handle this partitions on each main tables

Object tables (Commits, Blobs, Trees, Tree entries)

Each partition will only iterate the objects on their ranges, so no objects will be duplicated at the output.

References

Only references that are pointing to an object inside the partition range will be an output for the partition. To do this, we can use the packfile indexes and should be fast.

Files

Files will be appearing only on the partition where the root tree entry is present. Might be necessary to get tree_entries from other packfile ranges to generate the path using the repository.

Caveats

Problems

We need to check how go-git behaves when several threads are using it.

Benefits

Partitions will be more homogeneous and will take more or less the same time to process, giving to the user a good approximation about the size of the query and remaining time.

Nov 16 '18 09:11 ajnavarro

Files will be appearing only on the partition where the blob is present. Might be necessary to get tree_entries from other packfile ranges to generate the path using the repository.

I think it will be easier to implement and more consistent with other tables if we partition according to root tree_entry, not blob hash.

Only references that are pointing to an object inside the partition range will be an output for the partition. To do this, we can use the packfile indexes and should be fast.

Can't we just partition them as any other table?

Nov 28 '18 12:11 smola

@smola I changed files table to be partitioned by the root tree entry on the main issue description.

How we can make references works like any other table? they are not into the packfile, so they don't appear on the index, so we cannot apply the following partition to them:

[REPOSITORY_ID]+[PACKFILE_HASH]+[OFFSET_FROM]+[OFFSET_TO]

Because the references are just that, references to an object into the packfile, we can just show on each partition the references that are pointing to objects inside that partition.

Nov 28 '18 13:11 ajnavarro

How we can make references works like any other table?

I guess they would need a different partitioning scheme. For example: REPOSITORY_ID+PARTITION_NUM, then selecting references with something like fast_hash(reference_name) % PARTITION_NUM.

The problem I see with partitioning them based on what they point to, is that they can reference tag objects or commit objects, so its logic is not that trivial?

Nov 29 '18 10:11 smola

@smola If they are objects, they will be in the index, doesn't matter if it is a tag object or a commit object.

Maybe I'm not understanding completely what you want to mean.

Nov 29 '18 11:11 ajnavarro

@smola Actually I was thinking about ways to determine partitions upfront. But I see you're proposing to check repositories to find the partitions:

Partitions will be done consulting the packfile indexes when PartitionIter is requested. All the partitions will have the same amount of objects, excluding the last one. This operation should be fast enough to do it per query. We can cache the partition names if necessary.

So, yeah, your approach of "references by offset of the object they point to" should work.

Nov 29 '18 12:11 smola

Static partition sizes have some benefits. For instance you don't have calculate mapping where is what. On the other hand it still won't be super balanced. We may try to distribute objects across multiple partitions and track their sizes (kind of round-robin), but it will require an extra mapping where the object is or what object is there. Last but not least, I'm not sure if we need super balanced partitions. For small sizes maybe it's better to have one bigger partition instead of many tiny partitions (because of extra iteration + context switching if we want to parallelise queries)

Jan 30 '19 09:01 kuba--

Having this issue in mind https://github.com/src-d/gitbase/issues/686 this proposal makes less sense.

We should think in other ways to parallelize queries. Instead of just parallelize per repository, we can parallelize parts of the tree after reading the data from tables.

For the other hand, if we still want to create more than one partition per repository, we should think about how to split the DAG into several parallel parts, as another idea.

Jan 30 '19 10:01 ajnavarro

We based our parallelization logic on some ideas of this paper:

https://blog.acolyer.org/2015/02/11/encapsulation-of-parallelism-in-the-volcano-query-processing-system/
https://core.ac.uk/download/pdf/54846465.pdf

There you can see other ways to parallelize data processing.

Jan 30 '19 10:01 ajnavarro

Do we still want to do this @ajnavarro?

Oct 09 '19 08:10 erizocosmico

Will solve a lot of the problems that we actually have (unknown query processing time left, repositories taking a lot of time to process and so on.). I thought about partitioning at repository level + split the DAG into more or less similar parts, but I didn't have time for this.

Oct 09 '19 09:10 ajnavarro