macports-base
macports-base copied to clipboard
portindex.tcl: Add option to generate index of changed ports also
Usage: portindex -diff
Purpose: Generate an additional file along with the full PortIndex that contains ports that have changed since the last indexing and shows the status whether the port was updated, added or deleted.
I am trying to devise an efficient way to keep the "ports information" up-to-date in a web-app that would display a dynamic page for each port [demo]. I am storing data from the port index in a database- which should contain latest information about each port.
For this purpose, I have attempted to add an option to the PortIndex command which would generate a file with only the ports that have changed since last indexing. The command is working fine till now. But I believe that the code can be improved.
Any suggestions and feedbacks on the code would be really helpful.
I have gone through the PR, it looks fine at the first look but I do have some comments, mostly about nitpicking. What I do want to know is what is the previous port index file we are taking the diff with (it has nothing to do with your PR but I see it in the original code as well, I will also be learning about the codebase along with you :) ). Do we have a cron job running somewhere? ( @mojca )
Also, I don't know if it's okay to expose the diff option to users as well?
If possible, it would be great if you can link the email in the PR's description above to know why was the above feature required in the first place (or maybe a Trac ticket?).
@umeshksingla: after something gets changed in git, we need to update the database in the app. One option is to send the full portindex to the app (probably more than 20 MB file), but it would be much more efficient if the app would only receive the differences since the last import: which ports were added, deleted, or modified.
We have https://build.macports.org/builders/jobs-portindex which generates portindex. We would probably add generation of json on the same buildslave (requested by repology author). I would either run differential portindex on the same server where the app would be running, or make sure that buildbot always submits the differences to the app (after each commit). But it would probably be more reliable and efficient to run this next to the app than on some remote server. It's a pretty trivial process (but needs macports running, of course).
I assume that the previous PortIndex is simply the PortIndex in current directory (I don't know if we have that configurable), but I didn't check.
I don't know what harm a diff option could do to users, but I prefer having one "useless to users" options than implementing super strange workarounds elsewhere and repeat 90% of the code written here to get the same effect. Users should not be using -diff
for generating the regular portindex, that's for sure. And the diff option should not overwrite the default PortIndex.
If you could give it some testing etc., that would be great. If something is not yet clear, just ask.
If possible, it would be great if you can link the email in the PR's description above to know why was the above feature required in the first place (or maybe a Trac ticket?).
Thank You, will do.
Well, I built the PortIndex for the first time and I realize the pain and time it takes justifying the need for above. I don't think the size of the port index file is an issue (in context to storage) but yes, bigger file means more time to parse.
@umeshksingla: yes, it takes "forever" to build it for the first time. It's not about its size alone, it's about the fact that one might take several hundred thousand queries to the database to update from a full portindex. Differential output probably won't help you with the first run though ...
I have tested the PR by adding and modifying ports, works correctly. Need to handle the deletion cases as well now.
Some side notes:
- It'd be better to have the diff filename with a timestamp instead of just 'ChangedPorts'. "Changed" should be reported with respect to some reference.
- Is it possible to have 'U', 'A', 'D' (for updated, added, deleted) in the output file? I know this way we are able to reuse most of the existing code but having U/A/D would help while updating in application's db.
- Be sure not to delete the ports in your db when they get deleted in portindex, we'd like to keep the old statistics. Simply mark it as deleted or obsolete in a separate column.
I have tested the PR by adding and modifying ports, works correctly. Need to handle the deletion cases as well now.
Yes, I am working on deletions now. However, the approach I could think of would double the time of execution. Currently, PortIndex command traverses the entire 'macports-ports' directory and for each portfile it checks if there is a previous entry for that port in the old PortIndex (for comparison).
To handle deletions, we would have to do exactly the opposite comparison. We would have to open the old PortIndex and for each entry we will search the directory if the portfile exists, if yes fine, else mark that port as deleted. This is doable. To me it seems like the only way of detecting deletions.
- It'd be better to have the diff filename with a timestamp instead of just 'ChangedPorts'. "Changed" should be reported with respect to some reference.
That would actually be helpful and would remove the possibility of skipping any update. Thank You
- Is it possible to have 'U', 'A', 'D' (for updated, added, deleted) in the output file? I know this way we are able to reuse most of the existing code but having U/A/D would help while updating in application's db.
Yes, it is possible. Will add it with the next update. To make it consistent with portindex2json.tcl
, I will add it as a key to the portinfo
array. Something like status: "U"
- Be sure not to delete the ports in your db when they get deleted in portindex, we'd like to keep the old statistics. Simply mark it as deleted or obsolete in a separate column.
Okay! I shall update the PR if the 'detecting deletions' method is fine.
I have added the feature to detect deletions also. I have tested it with many possibilities.
On 2019-4-20 07:43 , Arjun Salyan wrote:
@umeshksingla <https://github.com/umeshksingla>: yes, it takes "forever" to build it for the first time.
That is because it has to print "Adding port XXXX" for each port. If we stop printing this- the time is only a few seconds.
This seemed unlikely to me, so I did the experiment. I commented out the lines 'puts "Adding port $portdir"' and 'puts "Adding subport $sub"' and timed generating a new PortIndex. This was the result:
portindex 1072.62s user 503.53s system 93% cpu 28:12.59 total
This seemed unlikely to me, so I did the experiment. I commented out the lines 'puts "Adding port $portdir"' and 'puts "Adding subport $sub"' and timed generating a new PortIndex. This was the result: portindex 1072.62s user 503.53s system 93% cpu 28:12.59 total
Yes, I just thought that since generating the PortIndex in second (or further) run takes quite less time, even though $fd
is rewritten, so the printing of thousand of entries might be taking up the time. But then I tried and seeing the results I had taken back my comment. It seems that the time consuming step here is picking data from portfiles, but in future runs it is not needed- we just reuse data from old PortIndex.
Sorry, I should have tried first.
@mojca just asking to avoid going through the code, is there any relation between the name of a port and the name of the directory containing that port's portfile in macports-ports?
@umeshksingla: port name and directory name should match by our policy (I'm not saying we have one officially written, but I hope this is the case for all of our ports), but of course you have subports (when this is definitely not the case).
I don't know if this is also required by the code (you may always expect user mistakes), but I faintly remember experiencing somewhat strange behaviour with portindex when the portname didn't match. Probably the portindex keeping parsing the same file.
The portindex script does use the portdir to check whether Portfiles have been indexed already and skip them if so, yes.
Since you worked on capturing diff, I just wanted to say that it's actually a standard problem in the industry dealing with a large amount of data. They call it "change data capture". I tried looking for a good read but only got this link. You can read it if you are curious, can give you a slight hint on log-based transactions and triggers.
Since you worked on capturing diff, I just wanted to say that it's actually a standard problem in the industry dealing with a large amount of data. They call it "change data capture". I tried looking for a good read but only got this link. You can read it if you are curious, can give you a slight hint on log-based transactions and triggers.
Thank You Umesh. This work on portindex
is one of the most interesting problems of the project so far. I shall read more about it.