datafaker Architectural changes creating either modules or abstractions for the generators

Context

This is an excellent fork from JavaFaker that adds everything that the project commuters aren't approving. Most of the requests are either improving the library or adding new generators.

Problem state

One of the problems from the fork is that we got a lot of "not so focused and useful" generators, as a lot of movies something (in JavaFaker almost of half generators are movie-related).

Proposition

Either create modules for those kinds of generators (movies, for example) or the creation of one more abstraction layer to group different generators.

New Module

It would be a new library that can be added on the top of the main one. For example:

Main library with "most important generators"

<dependency>
    <groupId>net.datafaker</groupId>
    <artifactId>datafaker</artifactId>
    <version>1.3.0</version>
</dependency>

Library extension

<dependency>
    <groupId>net.datafaker</groupId>
    <artifactId>datafaker-movies-module</artifactId>
    <version>1.0.0</version>
</dependency>

Pros

See only generators that make sense in some context
Isolation during development, fixes, and releases
Centralized contributions

Cons

Hard to define the "most important generators"
Possible changes in the current architecture
Different places to maintain, as most of the generators would be modules
Breaking change added for this adoption
Time-consuming to implement

New abstraction

It consists of the addition of one more object during the Fluent Interface usage, like:

faker.medical().bloodType()...
faker.business().creditCardType()...
faker.movies().avatar()...
faker.personal().address()...
faker.sports().formula1()...

Pros

Beter code usage from the consumer side, there it can focus on finding and using the generators
More consistent architecture
Code isolation

Cons

Breaking change added for this adoption

Benefits

Also, in a corporate world, it's sad trying to use a generator and see a lot of those generators.

Final comments

I believe that the adoption of one of the approaches will bring benefits to this project. And I am open to helping in the implementation.

May 15 '22 12:05 eliasnogueira

Hi @eliasnogueira , thanks for this great writeup. From a high level view, I agree with you, and I also think we should break up the data and the modules. However, as you've correctly identified, there's a few challenges, and I'll add a few more: ideally, I'd like to keep the API "stable" as much as possible, which is tricky with the way how Datafaker has been created. However, I don't think it's impossible, though I haven't attempted it yet, so I could be wrong.

So, what I have in mind is the following:

datafaker

In the above solution, Datafaker-all would depend on all the individual modules, and would have the same API as Datafaker has right now, but it would just delegate the logic to the individual modules. The modules would depend on "core", which has the logic for loading ymls, executing expressions, etc, everything to make the individual modules work.

If then you, say in a corporate environment, only want the some core providers, such a the Person, Internet, Address, etc ones, you could depend on datafaker-people (or whatever this module would be named), and when I'm developing a "fun" sideproject, I'll either depend on datafaker-all, or datafaker-fun + datafaker-media or so.

In code, this could look like the following:

// when using datafaker-all
val allFaker = Faker()
println(faker.person().name())
println(faker.tv().show()) // would still work, since it would just delegate to datafaker-media

vs

// when using datafaker-media
val mediaFaker = MediaFaker()
println(faker.tv().show())

Would this perhaps be an approach which could work? I think that, in theory, it would address most of the cons you mentioned:

Not needed to define what "core" providers are, since all providers have now been split to separate modules
Pretty easy to maintain, it's just a maven multi module, and whenever you add something to a child module, the only thing you need to do is to add it to the "main" Faker too

But alternatively, you could also just create your own CorporateFaker, and only expose Datafakers providers you're interested in, and ignore the rest. Wouldn't that be just as simple?

May 15 '22 13:05 bodiam

Hi @bodiam, I loved this explanation and approach. I have only a few questions related to the diagram and possible architecture...

wouldn't be the datafaker-all project just a BOM implementation?
- I can see it as the collections of all possible modules, without the necessity to add the datafaker-core
would datafaker-core be a dependency for any datafaker-module?
- because in this way we can isolate any module version and still let people opt to use one or more modules

I don't know if the core would have some generators or if it would be only a core, where we need to add at least one module. I understood from one of the code snippets provided that you are thinking about having some in the core, is that right?

Please involve me if you will start a PoC to apply this. :-)

May 15 '22 15:05 eliasnogueira

Hi @eliasnogueira ! No, it wouldn't be just a BOM, because of the current Faker API. It has hardcoded methods to the fakers, and I'd prefer not to break the api. So, the "faker-all", imo, must be a thin layer on top of all the modules to keep the api the same as it is now. Maybe it world be called faker-api or so, instead of faker-all, but just having a BOM with all the modules wouldn't allow you to write this like:

faker.address.street

For example, since "street" is a method on the faker class.

And yes, faker-core would be a dependency for any faker module indeed.

Note : the above is how I currently have it in my head and how I think it could work. There's no guarantee that it will work until we do some poc I guess.

Btw, Core itself would have 0 generators, it's just for the plumbing of loading files and doing lookups in the yml files.

May 15 '22 23:05 bodiam

5 more cents in that direction:

currently providers are registered in a way like

    public Address address() {
        return getProvider(Address.class, () -> new Address(this));
    }

it could be extracted into a separate e.g. CatalogProvider with registration something like

    public <T extends Address> T address() {
        return (T) getProvider(Address.class, () -> new Address(faker));
    }

it could allow to have a default implementation in core module (existing legacy which could be marked as deprecated). At the same time such approach allows to extend it within e.g. new modules or even client's code. 2. Probably for expressions SPI could be a handy feature

May 16 '22 21:05 snuyanzin

@snuyanzin @eliasnogueira It doesn't work 100% yet, tests are failing, etc, but I wanted to give a bit of insights in terms of code what I had in mind. This is just an experiment, I'll probably remove the branch, etc, but you can find the idea here:

https://github.com/datafaker-net/datafaker/tree/experimental-new-api

In short, I've moved the core infra into datafaker-core. Faker itself is a bit of a mess, and should be split in two parts, which I've tried to do with InternalFaker vs Faker. InternalFaker has the logic to evaluate expressions etc, while Faker has almost no logic: it's just a wrapper around everything in the project. You can see an example for EnglandFootball and TheItCrowd, which I've moved to datafaker-sports and datafaker-media.

I'm not sure if it's needed, but I like the idea, is to have a MediaFaker, and a SportsFaker. These can be used instead of the existing Faker class, which I'll call UberFaker for now. UberFaker will encapsulate all other module fakers, so as to break the existing API. Then, if you want to use only the basic stuff, like fake persons, addresses, etc, you can use BasicFaker, without the need for the Sports and Media. You'll still get a bit of extra functionality, such as the option to create your own fakers, maybe CSV (though I'm considering to move that also out of the core, into a datafaker-formats, or datafaker-json, but we'll see).

If you open the branch, it compiles, but it doesn't work. That is because I haven't moved the yml file yet, and because I disabled some code in FakeValuesGrouping. I think that's fixable.

May 21 '22 04:05 bodiam

What if, in this way to isolate/categorize the generator, we have:

the categories as isolated modules (new artifacts that can be added)
change in the instances, as each module will have their class

For example, datafaker-core will remain with the all base code, and only The possible Business Faker module which would have for example company and creditCard generators would be used as:

FakerBusiness fakerBusiness = new FakerBusiness();
fakerBusiness.company().name();
fakerBusiness.creditCard().number();

For a Business Media module/library:

FakerMedia fakerMedia = new FakerMedia();
fakerMedia.movies().lordOfTheRings().characters();
fakerMedia.tvShows().friends().joeyJokes();

Each Faker<module> class will inherit all the necessary classes, loaders, etc... So we can isolate the generators, files, etc...

This would change the DataFaker approach but would add a new identity to it, removing partially the roots and not done improvements inherited from JavaFaker.

May 21 '22 11:05 eliasnogueira

Not sure what you mean. Isn't that how the new code is organized, except that it has a compatibility layer on top of it?

May 21 '22 13:05 bodiam

What do you think about factories?

Business fakerBusiness = Faker.getInstance(Business.class);
fakerBusiness.company().name();
fakerBusiness.creditCard().number();

or

Media fakerMedia = Faker.getInstance(Media.class);
fakerMedia.movies().lordOfTheRings().characters();
fakerMedia.tvShows().friends().joeyJokes();

it will allow to extract/load more fakers from libs. No need to remember all possible faker names and probably cope with clash of namings in future. Old way could still exist however being marked as deprecated .

May 21 '22 14:05 snuyanzin

Actually, it is @bodiam 😅 After writing that I also did an experiment locally, where I ended up with the same changes you did in the branch.

@snuyanzin this would bring, from my perspective, more clarity that the "user" is using an external dependency/library for that. I liked this approach.

@bodiam what do you think to create those different projects under https://github.com/datafaker-net to test the approach with, at least two data libraries as a private repo to prove its concept?

May 21 '22 14:05 eliasnogueira

@snuyanzin the thing I like about the current API is that the IDE helps a lot. If it becomes dynamic, by providing a class a "getInstance", that clarity is a little lost I think. Besides, I see hardly no reason to use a getInstance method vs a new BusinessFaker(), though maybe the 'getInstance' method can do some magic or so. Btw, I so like that organisation of stuff in "movies", "tvshows", etc, either in packages, modules, or both. I had a look a Rubyfaker, I think they make a pretty reasonable split to start with: https://github.com/faker-ruby/faker/tree/master/doc

correction I had another look, and I think their organisation is pretty poor actually. I think we can do better :-)

@eliasnogueira I'm happy to move the modules eventually to different projects under the datafaker.net github, but I think that can be done at a later stage: the branch with the Maven multimodule project works fine for the time being, and I see no advantage yet to split it up.

May 21 '22 14:05 bodiam

@eliasnogueira can you share your code branch, or if it's not on a branch, can you put it on one and push it? I'd love to see your work, I might get some inspiration from it. Thanks!

May 21 '22 15:05 bodiam

@bodiam I'm actually creating the same thing as you, but not as a monorepo... I've created two different projects locally changing basically nothing, except the access of some protected attributes/inner classes in the main Faker class to be able to use the getLocale(), getProvider() and access the RandomService and FakeValuesService attributes outside the core lib.

I believe it's a matter of personal opinion... I don't like to work on a monorepo project when the modules are supposed to be independent of the others (of course except by the main core dependency). When we evolve all the modules at the same time, where we must have the same version for everything, the monorepo might help, but in the case of different code, versions, and isolation I can't see much benefit.

What do you prefer?

working in the experimental branch to improve a little bit the modules
see it as different projects

May 21 '22 15:05 eliasnogueira

@bodiam May be getInstance name is not the best one. There could also be something like of e.g.

Business fakerBusiness = Faker.of(Business.class);
fakerBusiness.company().name();
fakerBusiness.creditCard().number();

The main difference is that this static method brings is that under the hood there could be logic for different fakers loaded from libs. E.g. it could load and return not only generator for Business data but also any of it's successor. In case new each time it requires code changes to

If it becomes dynamic, by providing a class a "getInstance", that clarity is a little lost I think.

The only possible place of loosing clarity I see here is args for getInstance, of, or whatever. IMHO it could be solved with e.g. marker interface for fakers. Or is there anything else I miss?

May 21 '22 15:05 snuyanzin

@snuyanzin My apologies, maybe I wasn't clear. What I meant was is that I'm not really a big fan of Faker.<method>(<some-provider>). How would a user know which providers there are with only using their IDE?

May 21 '22 23:05 bodiam

@eliasnogueira I prefer neither option, and I def don't see it as different projects. It adds the overhead of having multiple projects be on different branches, while having to make sure modules are compatible with the main core. I don't see a benefit at all for doing that.

To be honest, I hardly see a reason for this change at all. As I said before, if the number of fakers is in the way for one way or the other, make a Faker on top which hides all of them except for the one you need. The data is loaded on request, so that shouldn't be an issue, the binary of this library is quite small, so I'm just wondering which problem we're trying to solve here. I like the potential idea of it, but that's it, it's not a problem high on my list of addressing, if it's a problem at all.

But if it is a problem, as I said before, share some code please.

May 22 '22 06:05 bodiam

How would a user know which providers there are with only using their IDE?

imagine there is a marker interface

public interface Fake {
}

which providers implement and there is a method

    public static <T extends Fake> T of(Class<T> clazz) {
       ...
    };

Then just looking at implementers of Fake interface a user can see all the providers

May 22 '22 06:05 snuyanzin

@snuyanzin When I'm typing: Faker.of(...., I'm just not sure if that would give me easy code completion besides looking at all the implementers of the Fake interface. I think the current code completion is a bit easier, or am I wrong in this?

May 22 '22 07:05 bodiam

@bodiam I have some main points

Not see thousands of generators when I use faker....
Evolve the architecture inherited by JavaFaker
Make the user choose the generators he wants
Try to improve the general codebase by adding more serious data

One thing that really bothers me, personally, is to use faker. and see things like babylon5 that even make sense for me, and I bet it does not make sense for 99% of the users.

I'm trying to find a library that can accept serious contributions and not merge things like the babylon5 instead of the CPFGenation or any kind of serious and good contributions. I want a reliable library, an extensible one, that can be used for real in companies.

That's why, I think, we start a conversation in the JavaFaker issue and we have this ticket.

I understand you might not want to evolve it because it's working, but in the other wand you would end up with thousands of uncategorized generators, unused ones, and old inherited code.

Even professionally we evolve and categorized the data approach, most of the time in the way @snuyanzin mentioned as Factories, and as you can see me teaching others with this as an example

We want to help you, as the main commuter and owner, to evolve it and make it the main replacement for JavaFaker. :-) If it's not changing the architecture, the abstraction layer implementation as the Ruby library has would be awesome to see.

May 22 '22 13:05 eliasnogueira

@eliasnogueira I told you a few times before : create your own Faker, expose the methods of the Datafaker class you need and delegate to this version of the faker.

May 22 '22 22:05 bodiam

Sorry @bodiam but I not getting the advice. Do you mean that I can create, for me internally, a layer on top of DataFaker to manipulate the generator the way I want?

May 23 '22 08:05 eliasnogueira

class EliasFaker {
   private Faker faker = new Faker();

   public Address address() {
       return faker.address();
   }

   public DateAndTime date() {
      return faker.dateAndTime();
   }
}

var betterFaker = new EliasFaker();
betterFaker.<ctrl+space>

Now the code completion will only show the address and date prefix, and it will be hard for anyone to see all the extra methods in your code. Wouldn't that be an option for you?

May 23 '22 10:05 bodiam

Hi @bodiam , It is not an option for me because, in this case, I can use whatever library I want to do the job, and the main focus of the issue is to understand if we can, together, improve datafaker.

Thank you for the help and effort you put into this discussion.

May 23 '22 10:05 eliasnogueira

@bodiam I tend to agree that the API in its current state may become problematic. The Faker class will become more and more polluted over time. I think splitting things out into the separate domains you were referencing would be a good start. Another middle ground might be to provide the Faker.of(Foo.class) factory as something available from faker-core, in order to provide flexibility in order to meet the request here, while retaining the API you prefer.

May 24 '22 03:05 jpeffer

@bodiam ok, I got your point about code completion... (I hope completion for of case it will be solved by IDE vendors in future).

Besides that there is another issue where constructors will not help.

There are some users of expression functionality which allows via expressions invocation of different providers e.g.

faker.expression("#{options.option '23','2','5','$','%','*'}"); // *
faker.expression("#{date.birthday 'yy DDD hh:mm:ss'}"); // 61 327 08:11:45

There is an idea to support SPI. That means that a custom provider could be done in a separate project. And there is just a user, e.g. flink-faker user (typically not a java/kotlin developer person) could just add a prebuilt jars to start using this custom providers. So this a case where such static methods could help.

What do you think to keep both functionality at the moment? Ok. there is a number of providers in the project which are accessible via api. + static methods which could be used for both: internal and external providers?

May 24 '22 20:05 snuyanzin

Well now there is some separation done. Currently it is still within the same artifact however it is possible to work with or without videogames/movies/sport...

For those who work with Faker and want to continue work with it nothing will be changed. So it is still possible to have code like

Faker faker = new faker();
faker.movies().avatar()...
faker.personal().address()...
faker.sports().formula1()...

At the same time if there is a wish to work with only some base things then BaseFaker could be used e.g.

BaseFaker faker = new Faker();
faker.number()...

Same for movies or sport or videogames

MovieFaker faker = new MovieFaker();
faker.movies().avatar()...

Ok, what if I want to work with sport and movie but without videogames? it is also possible, you need just create a class implementing marker interfaces (no method implementation is required)

class MyFaker extends BaseFaker implements MovieProviders, SportProviders { }
MyFaker faker = new MyFaker();
faker.movies().avatar()...
faker.sports().formula1()...

and that's it.

And yes, all autocompletion things within IDE will work accordingly

// cc @eliasnogueira i guess it should be more or less closer to what you've described may be except a separate maven module

Sep 24 '22 08:09 snuyanzin

@snuyanzin this approach is not closer, but exactly an ideal one to provide more clarity and code organization! But you've closed your PR :(

What's a possible next step now?

Sep 24 '22 13:09 eliasnogueira

The changes that I described in my previous comment are already in main repo. The PR that I closed i did in much smaller steps with less code changes but similar functionality and without extractions of maven modules.

The possible steps:

probably to see what else could be extracted
something else

Sep 24 '22 13:09 snuyanzin

Thank you @snuyanzin ! I will take a look and do some "experiments".

What would be the best approach for you? a direct PR creation or an issue first to discuss the change?

Sep 24 '22 13:09 eliasnogueira

Depends on scope if it is an easy fix or something like that then PR is fine If it is something that probably should be discussed or could bring a large impact then i think should be an issue with discussion

Sep 24 '22 13:09 snuyanzin