kedro
kedro copied to clipboard
kedro-datasets: dependencies and package structure. Are we doing the right thing?
Context
Let’s pause and take stock of where we are in https://github.com/kedro-org/kedro/issues/1457. This is where I think things stand:
- @idanov planned for us to move kedro datasets into a new package
kedro-datasets
. This would mean users dopip install kedro-datasets[pandas.CSVDataSet]
and imports becomefrom kedro-datasets import ...
- @deepyaman suggested using a namespaced package for
kedro-datasets
. In short, this would mean that it’s still a separatepip install
able package but the import path would still come from thekedro
namespace:from kedro.datasets import ...
- this was generally agreed to be a good idea. The motivation for splitting out
kedro-datasets
is more for distribution purposes rather than us suggesting that datasets could be used independently of kedro - this would mean that instead of doing
pip install kedro[pandas.CSVDataSet]
, a user would dopip install kedro kedro-datasets[pandas.CSVDataSet]
. I argued that this doesn’t seem like such a smooth user journey and also it’s actually a bit confusing topip install kedro-datasets
but then importfrom kedro.datasets
rather thanfrom kedro-datsets
- hence we decided we would maintain the “redirect” in which
kedro
’sextras_require
would ensure that doingpip install kedro[pandas.CSVDataSet]
would work as it does now. The intention with this is not purely for backwards compatibility but the recommended way to installkedro-datasets
, so that e.g. even in requirements.txt files you would not specifykedro-datasets
but insteadkedro[...]
. See #1495 for more details - @noklam raised a very good question about how documentation would work for
kedro-datasets
: #1651. We decided that it should remain part of the core kedro documentation (i.e. live in same place as API docs on RTD that it does now). I set out a plan for how we could achieve this, but it’s very complicated and not 100% satisfactory - while trying to come up with a solution for the documentation question, my spidey sense started tingling. Something didn’t feel quite right, and I thought that beyond the complexity of handling documentation, there may be some deeper issue here with how we’re handling
kedro-datasets
. I discussed with @deepyaman briefly, who had some interesting ideas
Note. Regardless of whether it's a namespace package or not, most times something from
kedro-datasets
is used you wouldn’t actually need to do this import explicitly, since in the data catalog you don’t need to specify the full import path to the dataset type but rather justpandas.CSVDataSet
.
Concerns
The current concerns are (feel free to add if anyone has any others):
- are we making bad circular dependencies?
- is there just a whole better packaging model that we’ve not considered? e.g. metapackage, which doesn’t directly conflict with the current approach but would influence decisions we make now
Overall, the kedro-datsets
work is quite complex. When it was first planned, we were not aware of the possibility of a namespace package, which changes the way we think about it quite a bit. I am concerned that we have not quite got the scheme right yet and might be missing something that would reduce overall complexity. My suggestion to resolve the situation:
- let’s discuss the circular dependencies issue. Hopefully it’s not a problem at all, but I would like to feel more confident about this
- let’s investigate how other libraries are handling similar situations. e.g. I believe the idea for
kedro-datasets
might have been inspired by howdjango
packages different components (?). @deepyaman mentionedjupyter
’s metapackage approach. Again, maybe what we are doing is the best approach, but I would like to feel more confident about this. Just as we missed the possibility of namespace packages in the first place, maybe we’re missing something big here
We don’t need to completely pause work on kedro-datasets
while we resolve these questions, but I think the outcome does affect some of the tickets (e.g, #1651 #1495). I do think, however, that we shouldn’t release kedro-datasets
before we’re really confident on these.
Circular dependencies
This is what first set my spidey sense tingling.
-
kedro
is a dependency ofkedro-datsets
. - to enable
pip install kedro[pandas.CSVDataSet]
,kedro-datasets
becomes an optional dependency ofkedro
throughextra_requires
(1) initially seemed to be non-negotiable to me but @deepyaman pointed out maybe that's not right (see below conversation). We don’t have to do (2) since we can just require people to pip install kedro-datasets
, but it felt like at least a “nice to have” before.
Key question: is this form of circular dependency going to cause problems?
- if yes, we need to change one of the above 2 points, i.e. either not specify
kedro
as a dependency ofkedro-datasets
or revert the decision to enablepip install kedro[pandas.CSVDataSet]
and go back topip install kedro-datasets
. This would overall simplify things quite a bit but comes with some disadvantages (most important: not such a smooth user experience, less important: import paths don’t match package name) - if no, great. Let’s continue as we are. But we need to think carefully about exactly what
kedro
’sextra_requires
points to (e.g.kedro-datasets~=1.0
is the current plan) and likewise whatkedro-datasets
specifies as itskedro
version specifier
My discussion with @deepyaman:
(Note the last comment here is considering that we should not allow pip install kedro[...]
and instead be explicit about pip install kedro-datasets
.)
Are we missing something?
Maybe there is a whole different way of handling the kedro
vs. kedro-datasets
split which would resolve the question of dependencies, what a user should pip install
, how to handle the namespace, etc. e.g. @deepyaman suggested a kedro
metapackage in which kedro-framework
and kedro-datasets
are both namespaced packages underneath that.
We don’t need to commit to implementing the kedro-framework
split now if we don’t want to, but I think it would be good to get a feeling for whether this a route we might want to go down in future because it influences our current decision on how to handle kedro-datasets
. e.g. it might convince us that pip install kedro[pandas.CSVDataSet]
is good or bad.
I think I like the sound of the metapackage. When I was reading through the python docs for Packaging namespace packages I found it describes a structure of kedro
with kedro-framework
and kedro-dataset
inside rather than having a separate package altogether under the same namespace. This blog post kind of describes what we were trying to achieve but the conclusion is that namespacing doesn't offer a great solution.
We'll need to figure out if the metapackage solution will solve the problem of different release velocities, which I think it does, but would like to see how it works in practice.
Just to add one more thing🔥, the current way of kedro
as a normal package and kedro-datasets
as a namespace package seems to break dev install pip install -e .
, there may be solution for it, but I don't know yet.
This is the exact issue that I was worried about in https://github.com/kedro-org/kedro/issues/1652, regardless the final solution we should definitely added this into the checklist.
- [ ] Making sure development install will work properly.
CC @AhdraMeraliQB
One more thing to throw into the mix... While talking through https://github.com/kedro-org/kedro-plugins/pull/49 with @AhdraMeraliQB I found that she was using distribution name kedro.datasets
rather than kedro-datasets
as I had been expecting, i.e. you would do pip install kedro.datasets
Actually it turns out that .
and -
and _
are all treated the same way by pip install
. Note that surprisingly you can already do pip install kedro_viz
or kedro-viz
or kedro.viz
. This actually answers the minor issue of pip install kedro-datasets
not matching the import path from kedro.datasets
because you can do pip install kedro.datasets
. What it doesn't address is the more important issue of user experience since you would still need to pip install
a separate package. I also don't know whether it's actually a good idea because it doesn't seem common to name packages with .
this way even if it works.
If you're not sure about the difference between distribution name and package name I highly recommend these: https://stackoverflow.com/questions/53346450/is-it-acceptable-to-have-python-package-names-with-numbers-in-it https://stackoverflow.com/questions/62834928/how-to-find-the-package-name-for-a-specific-module. The classic example is that you do pip install scikit-learn
but from sklearn import ...
.
Note. Just to put down my latest thoughts on this while I remember: I think
kedro-datasets
being a namespaced package that is dependent onkedro
still feels like the right thing to do. What I'm not so sure about is whether we should allowpip install kedro[...]
or just force an explicitpip install kedro-datasets
.
@noklam doing pip install -e .
on kedro-datasets
works ok for me but it's possible I've missed something here or am using a contaminated environment...
@AntonyMilneQB Would be great if you can check the output of these
import kedro
import kedro.datasets
import sys
print(kedro), print(kedro.datasets),print(sys.path)
My hypothesis is that it doesn't work because of the Python's module Finder has the "first match wins" concept. As soon as it finds kedro
in xxxx/site-pacakges/kedro/
, it will think kedro.datasets
is in xxxx/site-packages/kedro/datasets/
. Therefore, pip install .
would work fine because it install the namespaced package in site-packages/kedro/*
, but development install didn't do that.
@noklam ah yes, you are absolutely right - I must have been doing something wrong before. Your theory sounds like a good one, but judging by this it seems like this should work because Python looks through all the places in sys.path
until it finds kedro.datasets
. Maybe we've got something wrong here since the right path is definitely in my sys.path
(cwd always is I think) but Python's import
isn't picking up on it 🤔 Either way, it's irritating but not a show stopper since it only affects develop install.
I'm actually also getting some weird behaviour on https://github.com/kedro-org/kedro-plugins/pull/49 where pip install ".[pandas.CSVDataSet]"
seems to install all the requirements rather than just the relevant extra_requires
(doesn't matter if it's develop install or not). Looking at setup.py I can see why this would be, but I was sure that was working differently this afternoon so not quite sure what's going on there 🤔 @AhdraMeraliQB
- let’s investigate how other libraries are handling similar situations. e.g. I believe the idea for
kedro-datasets
might have been inspired by howdjango
packages different components (?). @deepyaman mentionedjupyter
’s metapackage approach. Again, maybe what we are doing is the best approach, but I would like to feel more confident about this. Just as we missed the possibility of namespace packages in the first place, maybe we’re missing something big here
Agree that it would make sense to see/learn from the experience of more projects here. If somebody can find/share how Django does this, that would be great, because I haven't found it yet. :) While I think the metapackage approach sounds clean in theory, I wonder if it's overcomplicating things, if Kedro-Framework is essentially required, and Kedro-Datasets is the only additional package. Also, which (if any) of these approaches expect the underlying packages to be independent, and which support packages depending on each other (possibly again going back to the question of avoiding circular dependencies)?
Notes from technical design discussion on 10 August
-
Cost-benefit of doing the split. In general several people did not feel convinced that the cost of this work (engineering time to setup, complexity of multiple release flows, maintenance of docs) justified the benefits. Possibly this is because we do not fully understand the benefits and need @idanov to explain them more. From what I can tell, the benefits would be:
- original reasons outlined in https://github.com/kedro-org/kedro/issues/1457, which are to do with different release cadence of datasets vs. core and breaking changes. However, this only seems to have been a problem once or twice in the past, and the release cadence of new Python versions seems to be once per year for recent 3.x, which does not feel too far out of sync with kedro breaking releases. @yetudada noted that people often pin their kedro version to not even allow for patch releases (I think our telemetry data demonstrates this when you look at the number of e.g. 0.17.4 downloads), which makes minor breaking changes in patches less of a problem. Overall, based on our current understanding and after quite a bit of discussion, people did not find the "breaking changes" argument to be compelling in practice
- CI should be simpler and more stable in
kedro
(a reason I gave in #1457). But we could achieve this without doing thekedro-datasets
split; it would just be a nice extra we get out of doing it - philosophical reasons: we should have a learn core and then be pluggable. Not discussed anywhere, but personally I quite like this perspective on Kedro.
-
@yetudada confirmed what was stated above that point 2 of Circular dependencies is a "nice to have" and not essential. i.e. it's not ideal but is acceptable if users need to do
pip install kedro kedro-datasets[pandas.CSVDataSet]
. If we are happy to do this then it would immediately resolve the circular dependencies problem and #1495 should be done after all -
@yetudada confirmed that having the kedro-datasets documentation in the same place as it currently sits in the API docs of kedro is important. I'm not sure whether this is made easier or harder with namespace package - see https://github.com/kedro-org/kedro/issues/1651#issuecomment-1211339215
-
We questioned whether point 1 of Circular dependencies (
kedro
is a dependency ofkedro-datsets
) was actually necessary. We think the reasoning for this originated from keepingAbstractDataSet
etc. inkedro
, but no one thought it was obvious why we should do that (see next steps) -
It was generally agreed that we should look at how other packages handle this situation. No one had any particular knowledge on this already so it would need some further investigation
Next steps
- [x] Discuss above points with @idanov. Are we missing some of the benefits? What exactly is the breaking changes argument and how often would it be important in practice? Does he think the circular dependencies is an issue, and, if so, how would he solve it? What does he think of?
- [x] https://github.com/kedro-org/kedro/issues/1776
- [x] https://github.com/kedro-org/kedro/issues/1777
Hi kedro team,
I've followed with great attention your journey on making kedro-datasets
an independent package, and i'd like to share my thoughts on some of the questions which seem still open on this topic.
Question 1 : Should kedro really split kedro-datasets in a separate package?
In my opinion, this is a big yes because it will tremendously improve enterprise support, provided some specific implementation that I'll detail further.
The major benefit I expect from this split, apart from the ones summarised above by @AntonyMilneQB, is the ability to upgrade only partially between major versions of the framework (technically in terms of SemVer, i am taliking of minor version, but your understand what I mean: kedro-0.16, kedro-0.17, kedro 0.18).
Kedro is becoming more and more prevalent in the industry, but users can't pay migrations costs very often. My team moved this summer from 0.16.5 to 0.18.2, and reading the discord or the various github issues, it seems that many users are still stuck in 0.16 and 0.17 versions. The download statistics on pepy also indicate that 0.17.x is more used than 0.18.x series, and that 0.16.x, albeit less downloaded, is not completely abandoned by users.
I feel from personal experience (maybe it would need some users research to confirm / quantify it) that what scares users and prevents them from migrating are the template changes. This is a bit ironical since changing the template is often a matter of a couple of minutes, but there is a cost of understanding where objects goes in each new template. My intuition is that most of them would migrate much more often if they could just pip install kedro
with the newest version.
Some good news though: the motivation for migration is very often (once again, based on personal experience) to get some improvements for datasets, for instance:
- newer datasets that do not exist in old versions
- annoying bugs in some datasets (e.g. old implementations of
MatplotlibWriter
) - incomplete features for some datasets (e.g. old implementations of
ApiDataSet
) - new
fsspec
protocol in more recent versions (smb
,ftp
,abfss
...) - upgrade old dependencies in kedro requirements which creates conflicts with other librairies (e.g.
fsspec<0.7
inkedro-0.16
is breaking many packages!)
It would feel much more modular and safer to be able to upgrade an application in production gradually by upgrading only the kedro-datasets
version in its requirements rather than modifying the entire template, and it will enable to solve all above common feature requests.
Obviously, users will have to migrate entirely at some point, but being able to upgrade datasets much faster than we are able to do now would be a tremendous improvement for production maintenance (my team has maintained custom plugins for fsspec
connections with unsupported protocol for two years because we were not able to migrate, while it would be awesome to just upgrade a version number with kedro-datasets!).
Question 2 : What should be the dependencies relationship between kedro and kedro-datasets?
Three scenarios are on the table at the moment. Based on q1, if we want to enable upgrading kedro-datasets with very old kedro-versions:
-
kedro-datasets import kedro and reciprocally. This scenario is a no go for me. Apart from the circular dependency issue you are facing and discussing above, this makes the desired feature of upgrading easily only the
kedro-datasets
(cf. question 1) almost impossible to achieve. Indeed , kedro-datasets would reinstall a newer version of kedro incompatible with the template of your old project, except if requirement bounds are very extensive which is unlikely. -
kedro import kedro-datasets but not the opposite.
This feels quite natural, because it avoids asking users to both packages. However, I would find this very unpleasant if kedro-datasets
upper bound was too tight and prevents me from upgrading easily. This is very likely if any upper bound is set, because many breaking changes in kedro-datasets
would not be breaking from the kedro point of view (i.e. a breaking change will occur in one specific dataset implementation, but no breaking change in the "core" module, i.e. the AbstractDataset will still have load
and save
methods). It is very likely that users do want to benefit from breaking changes to specific dataset implementations and be able to upgrade the package which will raise pip VersionConflicts if the upper bound is set too tight.
- Ask users to install separately
kedro
andkedro-datasets
separately
This is my preferred option, because this would make the updates over versions very easy, since the user would be responsible for managing the dependencies.
I understand that it is less users friendly and that you would likely get a lot of users claiming that they'd like to have both installed automatically, but if they get a very clear error message on their first kedro run
, I guess it should be pretty ok. Another possibility is to make kedro-datasets
a dependency of kedro with no upper bound, but I'm pretty sure you won't like this option :)
As a side note, I totally agree that documentaiton should still be hosted in the same place whatever is decided in the end for 2 reasons:
- it would make clear users have to install kedro-datasets
- everything will be searchable in the same place
Question 3 : what part of the kedro.io
folder should move to kedro-datasets?
Basically it seems a consensus that all specific implementations + lambda / memory /partitioned /cached datasets (as well as load_obj
utils) should be moved to kedro-datasets
and it feels completly natural.
Regarding the AbstractDataSet
and AbstractVersionDataSet
, I am completly convinced they belong to kedro-datasets
. The key arguments are:
- not moving it would make
kedro-datasets
have kedro has a dependency, which is my worst scenario as described in question 2 - this would enable to create custom datasets without importing kedro. From an upgrade perspective, it would be great to benefit from any improvement to this dataset for a custom implementation inside a project.
- kedro should not know how
AbstractDataSet
work under the hood. The only "contract" between the two is that a dataset has aload
and asave
method. This is already done because you assume pickle libray haveload
anddumps
methods.
Regarding the DataCatalog
, I have less stronger feelings, but I feel that it should be part of kedro-datasets too. This is the native "container" for datasets, and I don't think people have ever customized it (but I may be wrong!), and if someone wants to use the package without the rest of kedro, this seems natural to have this utilities accessible directly.
Question 4: should kedro namespace kedro-datasets ?
From the first time this idea has been suggested, I feel there are much more drawbacks than advantages, but I understand the arguments at stake here.
Overall, I think that there are many cons to this :
- the engineering setup cost seems higher thaty what you expected at first (but actually, this is not really a con, it is totally up to you to estimate if it is worth the cost)
- it is very confusing for users to know what's going on internally. It seems quite easy to understand that
kedro_datasets.pandas.CSVDataSet
imports the module (and it is eventually easy to go check the code), whilekedro.pandas.CSVDataSet
obfuscates a lot the fact that the code lies inkedro-datasets
package. - if I understand well, the main motivation of making this namespacing is to help people using absolute import in their catalog instead of usual relative import to upgrade transparently (=the ones who currently use
kedro.extras.dataset.pandas.CSVDataSet
instead ofpandas.CSVDataSet
). This does not seem a good motivation because:- these people are very likely a very small part of users
- these people will have to suffer migration costs to 0.19.0 whatsoever, and I am deeply convinced that upgrading the catalog path will be extremely easy for them because they understand the underlying import mechanism.
- even worse, this should be counterproductive because :
- it may be counterintuitive for them (why should I still use
kedro.extras
when the code is inkedro-datasets
?) - it likely stands against their initial motivation (I guess that using the absolute path is to make clear to readers where the code is, and if you read an import written as
kedro.extras.datasets.pandas.CSVDataSet
but there are no such folder in the kedro repo, this is very confusing)
- it may be counterintuitive for them (why should I still use
- There are very dangerous side effects :
- @noklam issue on editable install and "first import win"
- What about autocomplete in IDE? Will it be affected by namespacing?
Non answered questions :
- what is the way to package and release a distribution of subpackages instead of a single package?
I am no expert and don't know what are recommend best practices here, but
tidyverse
is a well known distribution in R which may be informative. The key idea is that you can install each package separately (e.g.kedro-datasets
andkedro-framework
) AND install the entire distribution (pip install kedro
) so you can have both flexibility and ease of upgrade (if packages are installed separately) and ease of install (if the user install the entire distribution).
Following a discussion between @idanov, @AntonyMilneQB, @AhdraMeraliQB and @noklam this afternoon, here's where things stand:
- Circular dependencies was indeed deemed to be an issue with the current approach. The solution to this is to drop assumption 2 in the top post. i.e. we will no longer enable
pip install kedro[pandas.CSVDataSet]
. This was always a "nice to have" so fine to drop.kedro
will be a dependency ofkedro-datasets
and not vice versa. We must think carefully before upper-bounding thekedro
dependency so as to avoid conflicts as suggested by @Galileo-Galilei above. - @idanov felt strongly that
io
should remain inkedro
. Hencekedro-datasets
will contain just the datasets. See https://github.com/kedro-org/kedro/issues/1776#issuecomment-1234432081 for the reasoning. - We would still like to have
kedro-datasets
a namespace package as in https://github.com/kedro-org/kedro-plugins/pull/49, but this is just a "nice to have". There are issues currently aroundpytest
(probably related to the develop install) which we are going to try to resolve. If it turns out that what we’re doing (mixing namespacekedro
+ non-namespacekedro
package) is really not recommended and does not work well then we will makekedro-datasets
a non-namespace package. This is an easy change to make. - We should not release
kedro-datasets
until the namespace package question is fully resolved. - A
kedro
metapackage is a nice idea that would resolve the circular dependency issue and still allow forpip install kedro
. However, it feels like overkill for now, i.e. even more additional complexity with not enough obvious benefit.
Next steps
- [ ] Discuss which direction to proceed in regarding namespacing; @noklam has outlined some alternatives solutions below. We should bring this back to technical design and decide whether we should continue with namespacing, and if so, how to implement it.
- [ ] #1495 is back again
- [x] #1496 is no longer needed
@Galileo-Galilei thank you very much, as ever, for your extremely carefully thought out and helpful response! We discussed this all again this afternoon - see above for the summarised outcomes. Let me respond to each of your points in turn here.
Question 1 : Should kedro really split kedro-datasets in a separate package?
Your answer to this really helps to motivate what we're doing and has given me a lot more confidence that it's a worthwhile change, thank you. It hugely helps to have your outside perspective of using kedro in the wild here 🙇
Question 2 : What should be the dependencies relationship between kedro and kedro-datasets?
This is very helpful, not least because it's identified a major weakness of the proposal in https://github.com/kedro-org/kedro/issues/1776, which is now off the table. Your point about pip dependency conflicts is very important. If our solution does not allow for users to easily do a breaking upgrade to kedro-datasets
while leaving kedro
version unchanged then we've failed to achieve the main incentive for this piece of work. Our solution here is your preferred solution 3. kedro-datasets
will have a kedro
dependency but with a suitable version specified so that pip dependency conflicts will not be an issue (so no upper bound I guess).
Question 3 : what part of the kedro.io folder should move to kedro-datasets?
Here we have gone in the opposite direction and decided that all of io
should in fact remain in kedro
. This is something of a change from where the consensus was heading, but @idanov made it clear that kedro dataset implementations are really what we're trying to split off here, and not the AbstractDataSet
or even "core" dataset implementations like MemoryDataSet
.
Question 4: should kedro namespace kedro-datasets
This is still something of an open question. In principle we are still in favour of using a namespace package but only if we can get round some of the technical difficulties that we're currently facing in https://github.com/kedro-org/kedro-plugins/pull/49. The main argument in favour of a namespace package is not really to keep the imports the same - as you say, that is a small number of users since we are dropping extras
from the path so those people will need to change the import paths anyway. Instead it's the "feel" you get that kedro-datasets
is still part of the kedro
package once it has been pip install
ed. We do understand that this use of namespace packages is not so well unknown and less obvious to users though. @deepyaman curious if you have any more thoughts here.
Non answered questions
Your tidyverse
example sounds very similar to the metapackage idea that we were considering in https://github.com/kedro-org/kedro/issues/1777. We think this probably does have some advantages but it was not clear overall that the additional complexity would be worth it at the moment.
This is mostly just re-stating the same thing in different ways. Feel free to edit this. cc @AntonyMilneQB @AhdraMeraliQB
Requirements
- [ ] Easy to upgrade
kedro-dataset
without upgradingkedro
(Primary Goal) - [ ] keeping the
kedro
namespace (nice to have) - backward compatibility is not relevant here, since we are movingkedro.extras.datasets
->kedro.datasets
anyway. i.e. (fromkedro.extras.datasets import SomeDataSet
- [ ] Possible to keep
pip install kedro[xxx]
- [ ] Editable install or
pytest
should be possible without installing the package first - [ ] Did I missed something more?
1. Original Proposal - no namespace, classic kedro_dataset |
2. Namespace Package | 3. Namespace-ish Package - dask , dask.distributed |
4. Metapacakge kedro , kedro-framework , kedro-datasets |
|
---|---|---|---|---|
Easy to upgrade | Depends on how we set the upper bound (if any) assume `kedro-dataset` depends on `kedro` |
Depends on how we set the upper bound (if any) assume `kedro-dataset` depends on `kedro` |
Depends on how we set the upper bound (if any) assume `kedro-dataset` depends on `kedro` |
Depends on how we set the upper bound (if any) |
Keep kedro namespace |
No | Yes | Yes-ish | Not sure |
Circular Dependency - possible to do pip install kedro[xxx] |
Circular dependency exist | Circular dependency exist | Circular dependency exist | No Circular dependency |
Editable install or pytest should be possible without installing the package first | Not a Problem | Current approach is problematic - need more research on namespace package | Not a problem | Not a problem |
placeholder |
More explanation for approach 3 - the dask way
Dask has a package dask
but also a dask.distributed
namespace. It didn't use a real Python namespace package but use a trick instead.
In short, the real package here is dask
and distributed
, but dask
keep a dask/distributed.py
that just importing everything from distributed
into the namespace of dask.distributed
. As a result, pytest
would just work but in a slightly weird way which is using import distributed
instead of dask.distributed
as evidence here
# dask/distributed.py
try:
from distributed import *
except ImportError as e:
if e.msg == "No module named 'distributed'":
raise ImportError(_import_error_message) from e
else:
raise
Notes from technical design discussion on 14 September
After consideration of the 4 approaches outlined above, we agreed that the most correct way to proceed would be to Metapackage (option 4), but the engineering costs involved were not justified by the value addition of being able to import from kedro.datasets
instead of kedro_datasets
. Additionally, once implemented, it is very difficult to reverse metapackaging whilst minimising how the users are affected - it is currently just too high of a commitment. As such, we will be closing this issue and #1693.
Points to follow up on
@yetudada highlighted the addition in complexity for the users should we continue to separate out kedro-datasets
without namespacing. We should conduct some user interviews to gauge how they feel about splitting out the datasets.
- [ ] #1850