datafaker
datafaker copied to clipboard
Architectural changes creating either modules or abstractions for the generators
Context
This is an excellent fork from JavaFaker that adds everything that the project commuters aren't approving. Most of the requests are either improving the library or adding new generators.
Problem state
One of the problems from the fork is that we got a lot of "not so focused and useful" generators, as a lot of movies something (in JavaFaker almost of half generators are movie-related).
Proposition
Either create modules for those kinds of generators (movies, for example) or the creation of one more abstraction layer to group different generators.
New Module
It would be a new library that can be added on the top of the main one. For example:
Main library with "most important generators"
<dependency>
<groupId>net.datafaker</groupId>
<artifactId>datafaker</artifactId>
<version>1.3.0</version>
</dependency>
Library extension
<dependency>
<groupId>net.datafaker</groupId>
<artifactId>datafaker-movies-module</artifactId>
<version>1.0.0</version>
</dependency>
Pros
- See only generators that make sense in some context
- Isolation during development, fixes, and releases
- Centralized contributions
Cons
- Hard to define the "most important generators"
- Possible changes in the current architecture
- Different places to maintain, as most of the generators would be modules
- Breaking change added for this adoption
- Time-consuming to implement
New abstraction
It consists of the addition of one more object during the Fluent Interface usage, like:
faker.medical().bloodType()...
faker.business().creditCardType()...
faker.movies().avatar()...
faker.personal().address()...
faker.sports().formula1()...
Pros
- Beter code usage from the consumer side, there it can focus on finding and using the generators
- More consistent architecture
- Code isolation
Cons
- Breaking change added for this adoption
Benefits
Also, in a corporate world, it's sad trying to use a generator and see a lot of those generators.
Final comments
I believe that the adoption of one of the approaches will bring benefits to this project. And I am open to helping in the implementation.
Hi @eliasnogueira , thanks for this great writeup. From a high level view, I agree with you, and I also think we should break up the data and the modules. However, as you've correctly identified, there's a few challenges, and I'll add a few more: ideally, I'd like to keep the API "stable" as much as possible, which is tricky with the way how Datafaker has been created. However, I don't think it's impossible, though I haven't attempted it yet, so I could be wrong.
So, what I have in mind is the following:
In the above solution, Datafaker-all would depend on all the individual modules, and would have the same API as Datafaker has right now, but it would just delegate the logic to the individual modules. The modules would depend on "core", which has the logic for loading ymls, executing expressions, etc, everything to make the individual modules work.
If then you, say in a corporate environment, only want the some core providers, such a the Person, Internet, Address, etc ones, you could depend on datafaker-people (or whatever this module would be named), and when I'm developing a "fun" sideproject, I'll either depend on datafaker-all, or datafaker-fun + datafaker-media or so.
In code, this could look like the following:
// when using datafaker-all
val allFaker = Faker()
println(faker.person().name())
println(faker.tv().show()) // would still work, since it would just delegate to datafaker-media
vs
// when using datafaker-media
val mediaFaker = MediaFaker()
println(faker.tv().show())
Would this perhaps be an approach which could work? I think that, in theory, it would address most of the cons you mentioned:
- Not needed to define what "core" providers are, since all providers have now been split to separate modules
- Pretty easy to maintain, it's just a maven multi module, and whenever you add something to a child module, the only thing you need to do is to add it to the "main" Faker too
But alternatively, you could also just create your own CorporateFaker, and only expose Datafakers providers you're interested in, and ignore the rest. Wouldn't that be just as simple?
Hi @bodiam, I loved this explanation and approach. I have only a few questions related to the diagram and possible architecture...
- wouldn't be the
datafaker-all
project just a BOM implementation?- I can see it as the collections of all possible modules, without the necessity to add the
datafaker-core
- I can see it as the collections of all possible modules, without the necessity to add the
- would
datafaker-core
be a dependency for anydatafaker-module
?- because in this way we can isolate any
module
version and still let people opt to use one or more modules
- because in this way we can isolate any
I don't know if the core would have some generators or if it would be only a core, where we need to add at least one module. I understood from one of the code snippets provided that you are thinking about having some in the core, is that right?
Please involve me if you will start a PoC to apply this. :-)
Hi @eliasnogueira ! No, it wouldn't be just a BOM, because of the current Faker API. It has hardcoded methods to the fakers, and I'd prefer not to break the api. So, the "faker-all", imo, must be a thin layer on top of all the modules to keep the api the same as it is now. Maybe it world be called faker-api or so, instead of faker-all, but just having a BOM with all the modules wouldn't allow you to write this like:
faker.address.street
For example, since "street" is a method on the faker class.
And yes, faker-core would be a dependency for any faker module indeed.
Note : the above is how I currently have it in my head and how I think it could work. There's no guarantee that it will work until we do some poc I guess.
Btw, Core itself would have 0 generators, it's just for the plumbing of loading files and doing lookups in the yml files.
5 more cents in that direction:
- currently providers are registered in a way like
public Address address() {
return getProvider(Address.class, () -> new Address(this));
}
it could be extracted into a separate e.g. CatalogProvider
with registration something like
public <T extends Address> T address() {
return (T) getProvider(Address.class, () -> new Address(faker));
}
it could allow to have a default implementation in core
module (existing legacy
which could be marked as deprecated). At the same time such approach allows to extend it within e.g. new modules or even client's code.
2. Probably for expressions SPI could be a handy feature
@snuyanzin @eliasnogueira It doesn't work 100% yet, tests are failing, etc, but I wanted to give a bit of insights in terms of code what I had in mind. This is just an experiment, I'll probably remove the branch, etc, but you can find the idea here:
https://github.com/datafaker-net/datafaker/tree/experimental-new-api
In short, I've moved the core infra into datafaker-core. Faker itself is a bit of a mess, and should be split in two parts, which I've tried to do with InternalFaker
vs Faker
. InternalFaker has the logic to evaluate expressions etc, while Faker has almost no logic: it's just a wrapper around everything in the project. You can see an example for EnglandFootball
and TheItCrowd
, which I've moved to datafaker-sports and datafaker-media.
I'm not sure if it's needed, but I like the idea, is to have a MediaFaker, and a SportsFaker. These can be used instead of the existing Faker class, which I'll call UberFaker for now. UberFaker will encapsulate all other module fakers, so as to break the existing API. Then, if you want to use only the basic stuff, like fake persons, addresses, etc, you can use BasicFaker, without the need for the Sports and Media. You'll still get a bit of extra functionality, such as the option to create your own fakers, maybe CSV (though I'm considering to move that also out of the core, into a datafaker-formats, or datafaker-json, but we'll see).
If you open the branch, it compiles, but it doesn't work. That is because I haven't moved the yml file yet, and because I disabled some code in FakeValuesGrouping. I think that's fixable.
What if, in this way to isolate/categorize the generator, we have:
- the categories as isolated modules (new artifacts that can be added)
- change in the instances, as each module will have their class
For example, datafaker-core
will remain with the all base code, and only
The possible Business Faker
module which would have for example company
and creditCard
generators would be used as:
FakerBusiness fakerBusiness = new FakerBusiness();
fakerBusiness.company().name();
fakerBusiness.creditCard().number();
For a Business Media
module/library:
FakerMedia fakerMedia = new FakerMedia();
fakerMedia.movies().lordOfTheRings().characters();
fakerMedia.tvShows().friends().joeyJokes();
Each Faker<module>
class will inherit all the necessary classes, loaders, etc...
So we can isolate the generators, files, etc...
This would change the DataFaker approach but would add a new identity to it, removing partially the roots and not done improvements inherited from JavaFaker.
Not sure what you mean. Isn't that how the new code is organized, except that it has a compatibility layer on top of it?
What do you think about factories?
Business fakerBusiness = Faker.getInstance(Business.class);
fakerBusiness.company().name();
fakerBusiness.creditCard().number();
or
Media fakerMedia = Faker.getInstance(Media.class);
fakerMedia.movies().lordOfTheRings().characters();
fakerMedia.tvShows().friends().joeyJokes();
it will allow to extract/load more fakers from libs. No need to remember all possible faker names and probably cope with clash of namings in future. Old way could still exist however being marked as deprecated .
Actually, it is @bodiam 😅 After writing that I also did an experiment locally, where I ended up with the same changes you did in the branch.
@snuyanzin this would bring, from my perspective, more clarity that the "user" is using an external dependency/library for that. I liked this approach.
@bodiam what do you think to create those different projects under https://github.com/datafaker-net to test the approach with, at least two data libraries as a private repo to prove its concept?
@snuyanzin the thing I like about the current API is that the IDE helps a lot. If it becomes dynamic, by providing a class a "getInstance", that clarity is a little lost I think. Besides, I see hardly no reason to use a getInstance method vs a new BusinessFaker()
, though maybe the 'getInstance' method can do some magic or so. Btw, I so like that organisation of stuff in "movies", "tvshows", etc, either in packages, modules, or both. I had a look a Rubyfaker, I think they make a pretty reasonable split to start with: https://github.com/faker-ruby/faker/tree/master/doc
correction I had another look, and I think their organisation is pretty poor actually. I think we can do better :-)
@eliasnogueira I'm happy to move the modules eventually to different projects under the datafaker.net github, but I think that can be done at a later stage: the branch with the Maven multimodule project works fine for the time being, and I see no advantage yet to split it up.
@eliasnogueira can you share your code branch, or if it's not on a branch, can you put it on one and push it? I'd love to see your work, I might get some inspiration from it. Thanks!
@bodiam I'm actually creating the same thing as you, but not as a monorepo...
I've created two different projects locally changing basically nothing, except the access of some protected
attributes/inner classes in the main Faker
class to be able to use the getLocale()
, getProvider()
and access the RandomService
and FakeValuesService
attributes outside the core lib.
I believe it's a matter of personal opinion... I don't like to work on a monorepo project when the modules are supposed to be independent of the others (of course except by the main core dependency). When we evolve all the modules at the same time, where we must have the same version for everything, the monorepo might help, but in the case of different code, versions, and isolation I can't see much benefit.
What do you prefer?
- working in the experimental branch to improve a little bit the modules
- see it as different projects
@bodiam May be getInstance
name is not the best one. There could also be something like of
e.g.
Business fakerBusiness = Faker.of(Business.class);
fakerBusiness.company().name();
fakerBusiness.creditCard().number();
The main difference is that this static method brings is that under the hood there could be logic for different fakers loaded from libs. E.g. it could load and return not only generator for Business
data but also any of it's successor.
In case new
each time it requires code changes to
If it becomes dynamic, by providing a class a "getInstance", that clarity is a little lost I think.
The only possible place of loosing clarity I see here is args for getInstance
, of
, or whatever. IMHO it could be solved with e.g. marker interface for fakers. Or is there anything else I miss?
@snuyanzin My apologies, maybe I wasn't clear. What I meant was is that I'm not really a big fan of Faker.<method>(<some-provider>)
. How would a user know which providers there are with only using their IDE?
@eliasnogueira I prefer neither option, and I def don't see it as different projects. It adds the overhead of having multiple projects be on different branches, while having to make sure modules are compatible with the main core. I don't see a benefit at all for doing that.
To be honest, I hardly see a reason for this change at all. As I said before, if the number of fakers is in the way for one way or the other, make a Faker on top which hides all of them except for the one you need. The data is loaded on request, so that shouldn't be an issue, the binary of this library is quite small, so I'm just wondering which problem we're trying to solve here. I like the potential idea of it, but that's it, it's not a problem high on my list of addressing, if it's a problem at all.
But if it is a problem, as I said before, share some code please.
How would a user know which providers there are with only using their IDE?
imagine there is a marker interface
public interface Fake {
}
which providers implement and there is a method
public static <T extends Fake> T of(Class<T> clazz) {
...
};
Then just looking at implementers of Fake
interface a user can see all the providers
@snuyanzin When I'm typing: Faker.of(....
, I'm just not sure if that would give me easy code completion besides looking at all the implementers of the Fake interface. I think the current code completion is a bit easier, or am I wrong in this?
@bodiam I have some main points
- Not see thousands of generators when I use
faker....
- Evolve the architecture inherited by JavaFaker
- Make the user choose the generators he wants
- Try to improve the general codebase by adding more serious data
One thing that really bothers me, personally, is to use faker.
and see things like babylon5
that even make sense for me, and I bet it does not make sense for 99% of the users.
I'm trying to find a library that can accept serious contributions and not merge things like the babylon5
instead of the CPFGenation
or any kind of serious and good contributions. I want a reliable library, an extensible one, that can be used for real in companies.
That's why, I think, we start a conversation in the JavaFaker issue and we have this ticket.
I understand you might not want to evolve it because it's working, but in the other wand you would end up with thousands of uncategorized generators, unused ones, and old inherited code.
Even professionally we evolve and categorized the data approach, most of the time in the way @snuyanzin mentioned as Factories, and as you can see me teaching others with this as an example
We want to help you, as the main commuter and owner, to evolve it and make it the main replacement for JavaFaker. :-) If it's not changing the architecture, the abstraction layer implementation as the Ruby library has would be awesome to see.
@eliasnogueira I told you a few times before : create your own Faker, expose the methods of the Datafaker class you need and delegate to this version of the faker.
Sorry @bodiam but I not getting the advice. Do you mean that I can create, for me internally, a layer on top of DataFaker to manipulate the generator the way I want?
class EliasFaker {
private Faker faker = new Faker();
public Address address() {
return faker.address();
}
public DateAndTime date() {
return faker.dateAndTime();
}
}
var betterFaker = new EliasFaker();
betterFaker.<ctrl+space>
Now the code completion will only show the address and date prefix, and it will be hard for anyone to see all the extra methods in your code. Wouldn't that be an option for you?
Hi @bodiam , It is not an option for me because, in this case, I can use whatever library I want to do the job, and the main focus of the issue is to understand if we can, together, improve datafaker.
Thank you for the help and effort you put into this discussion.
@bodiam I tend to agree that the API in its current state may become problematic. The Faker class will become more and more polluted over time. I think splitting things out into the separate domains you were referencing would be a good start. Another middle ground might be to provide the Faker.of(Foo.class) factory as something available from faker-core, in order to provide flexibility in order to meet the request here, while retaining the API you prefer.
@bodiam ok, I got your point about code completion... (I hope completion for of
case it will be solved by IDE vendors in future).
Besides that there is another issue where constructors will not help.
There are some users of expression
functionality which allows via expressions invocation of different providers e.g.
faker.expression("#{options.option '23','2','5','$','%','*'}"); // *
faker.expression("#{date.birthday 'yy DDD hh:mm:ss'}"); // 61 327 08:11:45
There is an idea to support SPI. That means that a custom provider could be done in a separate project.
And there is just a user, e.g. flink-faker
user (typically not a java/kotlin developer person) could just add a prebuilt jars to start using this custom providers. So this a case where such static methods could help.
What do you think to keep both functionality at the moment? Ok. there is a number of providers in the project which are accessible via api. + static methods which could be used for both: internal and external providers?
Well now there is some separation done. Currently it is still within the same artifact however it is possible to work with or without videogames/movies/sport...
For those who work with Faker
and want to continue work with it nothing will be changed.
So it is still possible to have code like
Faker faker = new faker();
faker.movies().avatar()...
faker.personal().address()...
faker.sports().formula1()...
At the same time if there is a wish to work with only some base things then BaseFaker
could be used e.g.
BaseFaker faker = new Faker();
faker.number()...
Same for movies or sport or videogames
MovieFaker faker = new MovieFaker();
faker.movies().avatar()...
Ok, what if I want to work with sport and movie but without videogames? it is also possible, you need just create a class implementing marker interfaces (no method implementation is required)
class MyFaker extends BaseFaker implements MovieProviders, SportProviders { }
MyFaker faker = new MyFaker();
faker.movies().avatar()...
faker.sports().formula1()...
and that's it.
And yes, all autocompletion things within IDE will work accordingly
// cc @eliasnogueira i guess it should be more or less closer to what you've described may be except a separate maven module
@snuyanzin this approach is not closer, but exactly an ideal one to provide more clarity and code organization! But you've closed your PR :(
What's a possible next step now?
The changes that I described in my previous comment are already in main repo. The PR that I closed i did in much smaller steps with less code changes but similar functionality and without extractions of maven modules.
The possible steps:
- probably to see what else could be extracted
- something else
Thank you @snuyanzin ! I will take a look and do some "experiments".
What would be the best approach for you? a direct PR creation or an issue first to discuss the change?
Depends on scope if it is an easy fix or something like that then PR is fine If it is something that probably should be discussed or could bring a large impact then i think should be an issue with discussion