macports-base icon indicating copy to clipboard operation
macports-base copied to clipboard

portindex.tcl: Add option to generate index of changed ports also

Open arjunsalyan opened this issue 5 years ago • 15 comments

Usage: portindex -diff Purpose: Generate an additional file along with the full PortIndex that contains ports that have changed since the last indexing and shows the status whether the port was updated, added or deleted.

I am trying to devise an efficient way to keep the "ports information" up-to-date in a web-app that would display a dynamic page for each port [demo]. I am storing data from the port index in a database- which should contain latest information about each port.

For this purpose, I have attempted to add an option to the PortIndex command which would generate a file with only the ports that have changed since last indexing. The command is working fine till now. But I believe that the code can be improved.

Any suggestions and feedbacks on the code would be really helpful.

arjunsalyan avatar Apr 06 '19 07:04 arjunsalyan

I have gone through the PR, it looks fine at the first look but I do have some comments, mostly about nitpicking. What I do want to know is what is the previous port index file we are taking the diff with (it has nothing to do with your PR but I see it in the original code as well, I will also be learning about the codebase along with you :) ). Do we have a cron job running somewhere? ( @mojca )

Also, I don't know if it's okay to expose the diff option to users as well?

If possible, it would be great if you can link the email in the PR's description above to know why was the above feature required in the first place (or maybe a Trac ticket?).

umeshksingla avatar Apr 16 '19 05:04 umeshksingla

@umeshksingla: after something gets changed in git, we need to update the database in the app. One option is to send the full portindex to the app (probably more than 20 MB file), but it would be much more efficient if the app would only receive the differences since the last import: which ports were added, deleted, or modified.

We have https://build.macports.org/builders/jobs-portindex which generates portindex. We would probably add generation of json on the same buildslave (requested by repology author). I would either run differential portindex on the same server where the app would be running, or make sure that buildbot always submits the differences to the app (after each commit). But it would probably be more reliable and efficient to run this next to the app than on some remote server. It's a pretty trivial process (but needs macports running, of course).

I assume that the previous PortIndex is simply the PortIndex in current directory (I don't know if we have that configurable), but I didn't check.

I don't know what harm a diff option could do to users, but I prefer having one "useless to users" options than implementing super strange workarounds elsewhere and repeat 90% of the code written here to get the same effect. Users should not be using -diff for generating the regular portindex, that's for sure. And the diff option should not overwrite the default PortIndex.

If you could give it some testing etc., that would be great. If something is not yet clear, just ask.

mojca avatar Apr 16 '19 06:04 mojca

If possible, it would be great if you can link the email in the PR's description above to know why was the above feature required in the first place (or maybe a Trac ticket?).

Thank You, will do.

arjunsalyan avatar Apr 16 '19 06:04 arjunsalyan

Well, I built the PortIndex for the first time and I realize the pain and time it takes justifying the need for above. I don't think the size of the port index file is an issue (in context to storage) but yes, bigger file means more time to parse.

umeshksingla avatar Apr 18 '19 04:04 umeshksingla

@umeshksingla: yes, it takes "forever" to build it for the first time. It's not about its size alone, it's about the fact that one might take several hundred thousand queries to the database to update from a full portindex. Differential output probably won't help you with the first run though ...

mojca avatar Apr 18 '19 05:04 mojca

I have tested the PR by adding and modifying ports, works correctly. Need to handle the deletion cases as well now.

Some side notes:

  1. It'd be better to have the diff filename with a timestamp instead of just 'ChangedPorts'. "Changed" should be reported with respect to some reference.
  2. Is it possible to have 'U', 'A', 'D' (for updated, added, deleted) in the output file? I know this way we are able to reuse most of the existing code but having U/A/D would help while updating in application's db.
  3. Be sure not to delete the ports in your db when they get deleted in portindex, we'd like to keep the old statistics. Simply mark it as deleted or obsolete in a separate column.

umeshksingla avatar Apr 18 '19 14:04 umeshksingla

I have tested the PR by adding and modifying ports, works correctly. Need to handle the deletion cases as well now.

Yes, I am working on deletions now. However, the approach I could think of would double the time of execution. Currently, PortIndex command traverses the entire 'macports-ports' directory and for each portfile it checks if there is a previous entry for that port in the old PortIndex (for comparison).

To handle deletions, we would have to do exactly the opposite comparison. We would have to open the old PortIndex and for each entry we will search the directory if the portfile exists, if yes fine, else mark that port as deleted. This is doable. To me it seems like the only way of detecting deletions.

  1. It'd be better to have the diff filename with a timestamp instead of just 'ChangedPorts'. "Changed" should be reported with respect to some reference.

That would actually be helpful and would remove the possibility of skipping any update. Thank You

  1. Is it possible to have 'U', 'A', 'D' (for updated, added, deleted) in the output file? I know this way we are able to reuse most of the existing code but having U/A/D would help while updating in application's db.

Yes, it is possible. Will add it with the next update. To make it consistent with portindex2json.tcl, I will add it as a key to the portinfo array. Something like status: "U"

  1. Be sure not to delete the ports in your db when they get deleted in portindex, we'd like to keep the old statistics. Simply mark it as deleted or obsolete in a separate column.

Okay! I shall update the PR if the 'detecting deletions' method is fine.

arjunsalyan avatar Apr 18 '19 16:04 arjunsalyan

I have added the feature to detect deletions also. I have tested it with many possibilities.

arjunsalyan avatar Apr 19 '19 21:04 arjunsalyan

On 2019-4-20 07:43 , Arjun Salyan wrote:

@umeshksingla <https://github.com/umeshksingla>: yes, it takes
"forever" to build it for the first time.

That is because it has to print "Adding port XXXX" for each port. If we stop printing this- the time is only a few seconds.

This seemed unlikely to me, so I did the experiment. I commented out the lines 'puts "Adding port $portdir"' and 'puts "Adding subport $sub"' and timed generating a new PortIndex. This was the result:

portindex 1072.62s user 503.53s system 93% cpu 28:12.59 total

jmroot avatar Apr 20 '19 07:04 jmroot

This seemed unlikely to me, so I did the experiment. I commented out the lines 'puts "Adding port $portdir"' and 'puts "Adding subport $sub"' and timed generating a new PortIndex. This was the result: portindex 1072.62s user 503.53s system 93% cpu 28:12.59 total

Yes, I just thought that since generating the PortIndex in second (or further) run takes quite less time, even though $fd is rewritten, so the printing of thousand of entries might be taking up the time. But then I tried and seeing the results I had taken back my comment. It seems that the time consuming step here is picking data from portfiles, but in future runs it is not needed- we just reuse data from old PortIndex. Sorry, I should have tried first.

arjunsalyan avatar Apr 20 '19 07:04 arjunsalyan

@mojca just asking to avoid going through the code, is there any relation between the name of a port and the name of the directory containing that port's portfile in macports-ports?

umeshksingla avatar May 03 '19 20:05 umeshksingla

@umeshksingla: port name and directory name should match by our policy (I'm not saying we have one officially written, but I hope this is the case for all of our ports), but of course you have subports (when this is definitely not the case).

I don't know if this is also required by the code (you may always expect user mistakes), but I faintly remember experiencing somewhat strange behaviour with portindex when the portname didn't match. Probably the portindex keeping parsing the same file.

mojca avatar May 03 '19 22:05 mojca

The portindex script does use the portdir to check whether Portfiles have been indexed already and skip them if so, yes.

jmroot avatar May 04 '19 01:05 jmroot

Since you worked on capturing diff, I just wanted to say that it's actually a standard problem in the industry dealing with a large amount of data. They call it "change data capture". I tried looking for a good read but only got this link. You can read it if you are curious, can give you a slight hint on log-based transactions and triggers.

umeshksingla avatar May 27 '19 18:05 umeshksingla

Since you worked on capturing diff, I just wanted to say that it's actually a standard problem in the industry dealing with a large amount of data. They call it "change data capture". I tried looking for a good read but only got this link. You can read it if you are curious, can give you a slight hint on log-based transactions and triggers.

Thank You Umesh. This work on portindex is one of the most interesting problems of the project so far. I shall read more about it.

arjunsalyan avatar May 28 '19 08:05 arjunsalyan