V1 version metadata for "Provides" drifts from V0
There's a few things I noticed changed randomly within a dist when I switched to V1
( Dist path, Module Name, Provided version )
id/C/CH/CHOROBA/XML-XSH2-2.1.25.tar.gz,XML::XSH2::FilteredIterator,(undef)
-id/C/CH/CHOROBA/XML-XSH2-2.1.25.tar.gz,XML::XSH2::Functions,(undef)
+id/C/CH/CHOROBA/XML-XSH2-2.1.25.tar.gz,XML::XSH2::Functions,2.1.25
id/C/CH/CHOROBA/XML-XSH2-2.1.25.tar.gz,XML::XSH2::Grammar,(undef)
@@ ... @@
id/P/PL/PLAVEN/Padre-1.00.tar.gz,Padre::Cache,1.00
id/P/PL/PLAVEN/Padre-1.00.tar.gz,Padre::Command,1.00
id/P/PL/PLAVEN/Padre-1.00.tar.gz,Padre::Comment,1.00
-id/P/PL/PLAVEN/Padre-1.00.tar.gz,Padre::Config,1.00
+id/P/PL/PLAVEN/Padre-1.00.tar.gz,Padre::Config,(undef)
id/P/PL/PLAVEN/Padre-1.00.tar.gz,Padre::Config::Apply,1.00
id/P/PL/PLAVEN/Padre-1.00.tar.gz,Padre::Config::Host,1.00
id/P/PL/PLAVEN/Padre-1.00.tar.gz,Padre::Config::Human,1.00
@@ ... @@
id/T/TO/TODDR/Exporter-5.72.tar.gz,Exporter,5.72
-id/T/TO/TODDR/Exporter-5.72.tar.gz,Exporter::Heavy,5.68
+id/T/TO/TODDR/Exporter-5.72.tar.gz,Exporter::Heavy,5.72
@@ ... @@
Some of these look like an improvement, others ( Like Padre ) look like a regression.
@kentfredric: since latest release of Padre has "version": "1.00" in ES... I wonder how did you produce this list.
Can you please provide the misbehaving code?
The "real code" is here: https://metacpan.org/source/KENTNL/Gentoo-Util-VirtualDepend-0.003017/maint/update-module-to-gentoo.pl
But I can trivially reproduce the failure:
#!perl
use strict;
use warnings;
use MetaCPAN::Client;
my $client = MetaCPAN::Client->new( version => 'v1' );
my $modules = $client->module(
{
all => [
{ release => 'Padre-1.00' }, #
{ author => 'PLAVEN' },
{ authorized => 1 },
{ indexed => 1 }, #
{ mime => 'text/x-script.perl-module' }, #
]
}
);
while ( my $mod = $modules->next ) {
for my $module (@{ $mod->module || [] }) {
next unless $module->{authorized};
next unless $module->{indexed};
printf "%s => %s\n", $module->{name}, $module->{version} || '(undef)';
}
}
perl /tmp/repro.pl | grep -i Config
Padre::Config::Project => 1.00
Padre::Config => 1.00
Padre::Config::Patch => 1.00
Padre::Config::Human => 1.00
Padre::DB::HostConfig => 1.00
Padre::Config::Host => 1.00
Padre::Config::Setting => 1.00
Padre::Config => (undef) <----
Padre::Wx::Role::Config => 1.00
Padre::Config::Apply => 1.00
I've dug into it and found the following:
This is coming from Module::Metadata parsing.
This undef is the same on v0.
You are querying the file (module) index and you get 2 records for Padre::Config, the one with no version entry under module is specifically from the file lib/Padre/Config/Setting.pm and for module.name = "Padre::Config".
use Module::Metadata;
my $info = Module::Metadata->new_from_file( "Padre-1.00/lib/Padre/Config/Setting.pm" );
for my $pkg ( $info->packages_inside ) {
printf "%s: %s\n", $pkg, $info->version($pkg) || '(undef)';
}
output:
Padre::Config::Setting: 1.00
Padre::Config: (undef)
Its strange, because I've been running this code now for over a year, and this is the first time its had "undef" turn up instead of the version. Its possible the reason for it is simply the record order changed, and my logic just blindly takes the last of whatever, and so record order changing would do that.
Its just a bit weird seeing things change randomly when upstream hasn't shipped a new release in years.
MetaCPAN::Client can't guarantee order as it uses Search::Elasticsearch::Scroll which disables sorting (heavy on the ES server) - so I would recommend not taking the last record but choose the highest/lowest (normally) depends on your context.
In your posted output you can see the 2 records and indeed the undef is the second one.
I don't know the internal works of Module::Metadata, but if there's an issue here it should be taken up with its maintainers - I'm not even sure what the issue is there, it might be just bad configuration in the Padre archive as well.
Its just I don't know what data to use here, because both "undef" and a defined version are reporting they're indexed. It cant be indexed twice at different versions for the same module.
That's not how PAUSE works.
I would look at $mod->name or $mod->path in your code and give priority to the module that is actually matching the file you are looking through.
You can see the bad value for Padre::Config came from Setting.pm / lib/Padre/Config/Setting.pm (which to me seem odd that is even in its module list, being one directory level down and without inheritance or other obvious relation there) while the good one was from Config.pm / lib/Padre/Config.pm
@kentfredric is this still an open issue?
I don't believe so. Its a curiosity still in the data being inaccurate, but I don't know the code enough or the logic of Module::Metadata to know if/why the "duplicate provides for a single package" is something that needs fixing or not.
I don't see a MetaCPAN issue here... as this can be isolated to Module::Metadata
I don't see a corresponding issue in either:
https://rt.cpan.org/Dist/Display.html?Name=Module-Metadata or https://github.com/Perl-Toolchain-Gang/Module-Metadata/issues
And I don't understand the problem from MetaCPAN's side to file one myself.
If its a bug in MM, then this should track that issue when filed, and be closed when MM is updated to make the issue go away.
I wish to revise my stand, I don't think this has anything to do with MM either.
After rewriting this comment several times while going over the code and output, I believe that this is an issue in the way your code chooses the file to read the version value from.
Let me elaborate -
Your output changed because the order of the files you check changed. The value per file did not change. undef exists in your original data set as well, but you just picked the first match per module which happened to be good for v0. This means the code was doing the wrong thing, which happened to work due to inadvertently consistent ordering.
MetaCPAN cannot nor should it guarantee the order of files you are looping over. v0 is likely consistent because the data wasn't reindexed there. v1 is probably consistent in a different order in which the files where indexed (this would depend on internal ES data representation and handling).
I believe a proper check for the version number should be to match against the values of $mod->name or $mod->path because the correct value (I'm a bit guestimating here) comes from the file that bares the same name as the module and not one of the other ones. (This could be limited to only some cases, I don't know for sure.)Choosing the appropriate (and available) values using my suggestion guarantees that no order change will affect your code in the future.
The example of Padre::Config shows you have given priority to the version value of lib/Padre/Config/Setting.pm instead of lib/Padre/Config.pm based on previous order of appearance in results, which does not seem to be the correct thing to do.
In summary, this appears to me as assuming on ordering from the API information (where none should be provided and has changed) in code that determines values using inaccurate heuristics.
Is it possible for you to make the changes I suggested above and tell me if this resolves your problem?
Sorry, I may have not been clear enough.
The residual issue I believe is still present is unrelated to the order. I have already worked around my ordering issues, that is not what I believe needs to be followed up on.
What needs followed up on, is how that data being undef got there in the first place, because from a distributions standpoint, there being 2 instances of the same module and both being deemed "indexed" is clearly false.
I don't care for the order, its that there are two of them which is the residual issue.
If this data comes from Module::Metadata ( which you say it does ), then Module::Metadata may need hints to be more careful about what it returns.
If its something MetaCPAN are doing, then it should be handling this "indexed" bit more correctly, because its clear that it cant be indexed at PAUSE, and so the tag "indexed" loses its meaning.
Let me try to follow the code and try to explain for better understanding (mine) what I think it's doing (so feel free to correct me here) -
- iterating over the module index for matching a release name (which represent the .pm files in the release) -- those are named
$mod - for each file -- you loop over the associated packages (named modules) in that file (the .module path in ES) -- those hashes are
$module
now, for some reason (that we are trying to figure out here). some releases like Padre contain wrong associations (like Config::Settings which probably shouldn't declare it has Config module internally) which are also missing version information (the undef).
if this is the source of the problem, the list of packages within a file is provided to us by MM, I'll paste that code again to make it clear here:
use Module::Metadata;
my $info = Module::Metadata->new_from_file( "Padre-1.00/lib/Padre/Config/Setting.pm" );
for my $pkg ( $info->packages_inside ) {
printf "%s: %s\n", $pkg, $info->version($pkg) || '(undef)';
}
output:
Padre::Config::Setting: 1.00
Padre::Config: (undef)
THAT Padre::Config is extremely suspicious and doesn't make any sense (to me, at least).
My suggested checks against either $mod->name or $mod->path ( or trusting them less if $module's name is shorter and version is missing ) can help skip over such suspicious cases (with a hope this will eliminate only the bad cases)
As for MM's suspicious output - I'll open a question issue.
Ugh. Fun, its parsing it out of a HEREDOC :/
ok, since you know now more than me - i'll let you decide on the next step before I open the issue. what do you want me to do?
My suggested checks against either $mod->name or $mod->path ( or trusting them less if $module's name is shorter and version is missing ) can help skip over such suspicious cases (with a hope this will eliminate only the bad cases)
Indeed, and I have in fact done this, and this works, but the point is really that the "indexed" property, if true, should reflect the reality upstream.
So either:
- a) MM should not be reporting this thing
- b) MetaCPAN should have logic to not set
module.indexed = trueif PAUSE wouldn't.
If module.indexed says true, but PAUSE does not, what is the value of module.indexed ? Surely we should just abolish the whole thing.
@karenetheridge is it even possible to skip over HEREDOC code?
https://v1.metacpan.org/source/PLAVEN/Padre-1.00/lib/Padre/Config/Setting.pm#L108
Somebody has probably opened a bug about heredoc parsing already, I just don't think MM can solve that. ( Unless there's some other API that will weed out that scenario )
I should also point out that Module::Metadata does not exhibit this problem as such when using the provides function:
/Padre-1.00 $ perl -MModule::Metadata -MData::Dump=pp -e'print pp(Module::Metadata->provides dir => q[lib], version => 2 ))'
Which I suspect is due to https://v1.metacpan.org/source/ETHER/Module-Metadata-1.000033/lib/Module/Metadata.pm#L296
So perhaps using that to extract metadata would be better than having MetaCPAN iterate the files themselves.
- b) MetaCPAN should have logic to not set
module.indexed = trueif PAUSE wouldn't.If
module.indexedsaystrue, but PAUSE does not, what is the value ofmodule.indexed? Surely we > should just abolish the whole thing.
I think that's a separate issue. if you like - please open another one for it so we can discuss it properly.