volo Generate thrift code in multiple mods/files

Feature Request

Crates

volo-thrift / pilota

Motivation

Pilota often generates all code in a single file, which can result in extremely large files when the thrift IDL is substantial.

This issue is evident in hive_metastore, where the generated file size approaches 10MiB. Such large files can slow down IDEs and cause web pages to crash.

Try: https://docs.rs/hive_metastore/latest/src/hive_metastore/hms.rs.html#52510-52520

Proposal

Generate thrift code in multiple mods/files by struct / service.

Alternatives

None.

Jun 19 '24 11:06 Xuanwo

I think this is a workaround for the lack of IDE and website capabilities. However, it may be worth doing.

We need detailed design, especially in large-scale idl (thousands of structs, enums, etc.) to avoid IDE problems, website problems, and most importantly, naming conflicts caused by too many files, and to allow users to reference all files at once in a convenient way and maintain the namespace relationship, which will be a more troublesome part.

At present, this feature does not affect normal use and is not a high priority.

Jun 19 '24 13:06 PureWhiteWu

This issue becomes worse the more entites there are, with some files growing in size as large as several hunders megabyes.

For the design, the solution might be simply utilizing the include! macro to include the generated parts into the volo_gen.rs. For each entity, a separate file can be generated, named volo_gen_$N.rs, where $N is an incrementing counter. Then, this file will be included in volo_gen.rs using include!("volo_gen_$N.rs");.

On most filesystems, many files in a single directory can lead to performance degradations. To workaround this problem, a separate directory can be generated for each 1_000 or 1_000_000 files. Then, the files would be included like include!("directory_$M/volo_gen_$N.rs");.

This solution preserves the single entry point volo_gen.rs, while allowing to reduce its file size. If the resulting file volo_gen.rs gets too big even with this solution, then volo_gen_$N.rs files can be generated not for a single entity, but for a batch of entities, e.g. for each 10 entities.

Sep 09 '24 12:09 missingdays

I think maybe we can try separating by mods first? Or if the file is still too large, we can try separating by struct/enum/service/const?

For example, we have two idls article.thrift and author.thrift:

// article.thrift
namespace rs article

struct Article {
    1: required i64 id,
    2: required string title,
    3: required string content,
    4: required i64 author_id,
}

struct GetArticleRequest {
    1: required i64 id,
}

struct GetArticleResponse {
    1: required Article article,
}

service ArticleService {
    GetArticleResponse GetArticle(1: GetArticleRequest req),
}

// author.thrift
namespace rs article.author // just for example

struct Author {
    1: required i64 id,
    2: required string username,
    3: required string email,
}

In that case, we can generate the following layout:

 ┌─ volo_gen.rs
 │  ┌─ struct_Article.rs
 │  │  ┌─ mod.rs
 │  │  ├─ struct_Author.rs
 │  ├─ author
 │  ├─ struct_GetArticleResponse.rs
 │  ├─ struct_GetArticleRequest.rs
 │  ├─ service_ArticleService.rs
 │  ├─ mod.rs
 ├─ article
 $OUT_DIR

With the following example code:

// in volo_gen.rs
mod article {
    include!("article/mod.rs");
}

// in article/mod.rs
include!("struct_Article.rs");
include!("struct_GetArticleRequest.rs");
include!("struct_GetArticleResponse.rs");
include!("service_ArticleService.rs");

mod author {
    include!("author/mod.rs");
}

// in article/author/mod.rs
include!("struct_Author.rs");

This will result in more readable code than just generate something like volo_gen_1.rs, volo_gen_2.rs... What do you think?

Furthermore, I'm not quite familiar with how IDEs(RustRover, VSCode + rust-analyzer) deal with include! macro, will this still result in performance problems(such as, include all files together and then analyze, which will not solve the performance problem) or some feature disabled(such as doesn't support analyzing include! macro in include! macro)? @missingdays

Sep 09 '24 13:09 PureWhiteWu

That naming scheme is better for sure, and will make exploring the generated files easier.

will this still result in performance problems

While there's no 100% guarantee that no performance problems will arise, the main problem I see is the actual file size. To analyze files with include!, IDEs usually don't replace with macro call with the file contents, but rather analyze the included file separately.

some feature disabled

A well-formed files which contain valid Rust code should be well-supported by IDEs, even if they are include!d.

This generation scheme could also be optional, enabled only by an explicit flag. Or it can be turned on automatically only if the big scheme is detected. For example, if there are less than 1000 entities - generate a single file, and a lot of separate files otherwise.

Sep 09 '24 13:09 missingdays

@PureWhiteWu if the issue is not a high priority for you right now, I can try implementing it, though I would probably need some guidance on some entry points.

Sep 10 '24 08:09 missingdays

@missingdays Great! Thanks very much! In fact, we really want to realize this function earlier, but we have not had the manpower.

Feel free to contact @Millione if you need any help.

Sep 10 '24 09:09 PureWhiteWu

After looking around the code a bit, I noticed that there's a Workspace mode supported by pilota, and that @Millione has done some work in supporting this mode in volo-build/src/workspace.rs, though it seems to not be used at the moment

If I'm reading the implementation correctly, it creates a workspace with one crate for each definition. So it looks like it solves the same problem, but in a more straightforward way

Sep 10 '24 16:09 missingdays

In fact, the problem we encountered comes from the both workspace and normal mode. Although the workspace mode has split at the idl level, the single idl and its dependencies are still too complicated.

In addition, because we generate all common-used(used by multi idl) structures in a common library by default, the generated files in this library are very large (this is the problem we encountered when using RustRover).

Sep 11 '24 02:09 PureWhiteWu

This needs a support on the pilota side, I've created a pull request with the proposed solution here https://github.com/cloudwego/pilota/pull/270

Sep 23 '24 15:09 missingdays