cookiecutter-data-science icon indicating copy to clipboard operation
cookiecutter-data-science copied to clipboard

Rename `src` directory to `project_name`

Open oczkoisse opened this issue 6 years ago • 11 comments

First off, thanks for the wonderful template! I think this should work very well for my future projects. However, I could appreciate your thoughts on some aspects of the template:

  • Filenames like make_features.py and make_dataset.py in the src package suggest that those are scripts, not modules. Wouldn't it be better separation to let the reusable code live in modules inside src package while the scripts reside in a scripts directory in project root? This way both the scripts and the notebooks could use the modules inside the src package. Could you please elaborate the rationale behind the current template layout, which seems to have just scripts in the src package?

  • If I make my own project-specific packages, do they go under src? If I have a custom package that models, say, the dataset structure, does it make more sense to put it under src/data instead?

One minor nitpick: the name src could have been renamed to the project name itself. It would be clearer, I think, and also integrate well with a popular project structure suggested here, although I do understand that it may be a matter of personal preference.

oczkoisse avatar Aug 30 '18 08:08 oczkoisse

As someone who recently discovered this template:

I had the same initial thought, but I think (and someone please correct me if I am wrong) that one should view the src subfolders (data, models, features) as separate apps. In that case, it makes sense that the make_dataset.py and make_features.py are stored in the folder specific to the app as a "main" file. This structure has high functional cohesion and is similar to how the projects in the web framework Django is organised: https://simpleisbetterthancomplex.com/tutorial/2018/08/27/how-to-create-custom-django-management-commands.html

One minor nitpick: the name src could have been renamed to the project name itself. It would be clearer, I think, and also integrate well with a popular project structure suggested here, although I do understand that it may be a matter of personal preference.

As you suggest, this is probably a matter of personal preference. For instance, I would usually call such a folder python because I usually also have a significant amount of SQL and bash scripts in sql and bash folders, respectively.

TobiasSkovgaardJepsen avatar Aug 31 '18 17:08 TobiasSkovgaardJepsen

(1) Agreed that src -> {{project name}} makes sense. We'd accept that PR. (2) At times it can be useful to distinguish scripts from packages. In our experience, we don't find that we need this distinction in most of the ML/DS projects we do. Often our code that ends up in a package also has a CLI that you can enter like a script. If projects get built up to the point where they have scripts and packages, users can create a new folder. Just not sure there is a strong case for this to be built in as boilerplate for the default. I think the simplicity of one folder for source code is probably best for now.

pjbull avatar Aug 31 '18 20:08 pjbull

@TobiasSkovgaardJepsen Thanks for replying! I agree that considering the src subfolders as separate apps provides better separation and cohesion. Maybe I missed something in the documentation, but having no background in Django, this wasn't immediately clear to me. The fact that src is a package made me immediately think of the subfolders as subpackages, and everything in them as modules. My bad :) I do think, however, that this rationale could be briefly mentioned in the docs for better understanding for newcomers.

@pjbull I'd be happy to create a PR at some point, given that I'll probably need to get up to speed on cookiecutter. Along with @TobiasSkovgaardJepsen 's reply above, your clarification about the use of modules like CLI apps makes a lot of sense too. Thanks!

oczkoisse avatar Sep 01 '18 03:09 oczkoisse

Renaming this issue to represent (1) above so that we can be in line with Python best practices, which in turn should help with any confusion about where packages are installed from as has been brought up in #124

pjbull avatar Oct 24 '18 18:10 pjbull

I've been running with a variant of this idea for a couple of months. I added a {{ cookiecutter.module_name }} as there have been cases where the project name is a little unwieldy for use as the module name. I also set the default to src, so the existing behavior is unchanged unless you choose to use it.

In practice, I tend to use src until there's a reason to rename the module. (often, that's never).

I'll put together a diff for this

hackalog avatar Oct 25 '18 01:10 hackalog

On the make_features.py and so on, I think these should be invoked as modules. e.g. My makefile does things like:

fetch:
	$(PYTHON_INTERPRETER) -m {{ cookiecutter.module_name }}.data.make_dataset fetch

that -m is important, as the make_dataset.py can then do relative imports from within the src module itself

hackalog avatar Oct 25 '18 01:10 hackalog

I've made these changes (both adding module_name and using -m when invoking things like make_dataset.py in PR #148

hackalog avatar Oct 25 '18 01:10 hackalog

I was about to open a new issue for this and then found this. So I'm adding my thoughts here. I think this thread is moving in the right direction.

I do think that the cookiecutter should be using module_name as the default (@hackalog ). Or at least as one of the questions during the cookiecutter process you should be asking whether you want to use module_name or src. Here's my reasoning and I hope it comes across as simple and good-intentioned.

First, every python package on the planet uses its own name and not src. We'd be just following convention/best practices (which seems to be in line with what @pjbull is saying). Second, there's a very immediate problem if you use src for every package/egg you build using this pipeline. Specifically, you start having conflicts as soon as you have more than one package. So you if you have cat_herding and predictive_cats both wind up installing themselves to src. It shouldn't be src.models.predict for both. It should be cat_herding.models.predict and predictive_cats.models.predict.

My guess is that the original intent of this project was to act as a good template for just playing around. However, at some point you might want to build an actual package out of it. All the architecture is already there, it's just currently breaking convention.

Keep up the great work.

BadToast avatar Oct 25 '18 14:10 BadToast

@pjbull, any news on this front, regarding @hackalog's PR? Thanks!

leblancfg avatar Jan 09 '19 18:01 leblancfg

Changing the package name away from src has many benefits as pointed out above.

What can be considered for the directory structure though would not be a simple rename of the src directory to {{project name}}. The disadvantage of this is then that we have project name/project name where one would immediately ask why there is a subfolder with exactly the same name as the parent. A way out of that would be to adopt the directory structure similar to what is suggested by the pyscaffold people:

./project name
    /docs
    /[...]
    /src
        /project name
            __init__.py
            skeleton.py
            /models
                predict.py
                [...]

    /[...]

Then we keep the nice high level directory description of the underlying files and still a descriptive name for the module.

asteppke avatar Sep 03 '19 16:09 asteppke

The disadvantage of this is then that we have project name/project name where one would immediately ask why there is a subfolder with exactly the same name as the parent.

Sorry, but that really does not sound like a strong argument in any sense? Similarly, the first question people would ask in the structure that you proposed is "Why is there a subfolder in src with the same name as the parent of src?"

nraw avatar Nov 13 '19 23:11 nraw

Closing—implemented for V2.

jayqi avatar Aug 30 '23 21:08 jayqi