junest icon indicating copy to clipboard operation
junest copied to clipboard

junest startup slow on systems with many users

Open joernhees opened this issue 9 years ago • 5 comments

I'm running junest as a user on one of our university servers. As you can see below, there are > 20K users. On startup junest seems to spend a considerable amount of time in these commands:

$ time getent passwd | wc
  21690   50253 1341078

real	0m2.871s
user	0m0.129s
sys	0m0.103s
$ time getent group | wc
    590     590 1425507

real	0m4.265s
user	0m0.052s
sys	0m0.029s

https://github.com/fsquillace/junest/blob/6d4e5f7404da996f0f80c45c588d52eae90e9575/lib/core.sh#L423

Aside from causing long junest startup times, this may also cause problems for multiple parallel junest sessions: during startup the shared ${JUNEST_HOME}/etc/{passwd,group} might be incomplete, causing weird problems for running / simultaneously starting sessions.

Would you accept a PR that prefers an existing ${JUNEST_HOME}/etc/{passwd,group} and only clears it with a special option (e.g., --clear-caches or maybe implicitly on -f or -r)?

joernhees avatar Nov 13 '16 17:11 joernhees

Hey @joernhees

Thanks for this change! (and sorry for the delay, there are too many projects to deal with lately :smile:)

I just want to give you more context about this issue. A while ago JuNest was working by taking directly in consideration the files /etc/passwd and /etc/group of the native Linux System. This gave problems on some edge cases such as: https://github.com/fsquillace/junest/issues/81

In particular, the problem was that systems that use some name service (like LDAP or Active-Directory) to store user information, contains information that can be achieved remotely via getent command.

The proper solution for that was https://github.com/fsquillace/junest/commit/846bcc9c1f8c781bcd94b97523d8604cc601c1ce And that's the reason why JuNest is slow for your use case now.

I think that having a cache might be too dangerous (from the customer perspective I think I prefer to see JuNest to be correct and slow).

So instead, I would go to an option that falls back to the previous solution (directly map to the existing /etc/passwd and /etc/group files). WDYT?

Regarding to the concurrency issue, I still have difficulties to replicate that (due to its heisenbug nature) but I guess that if the problem is related to the conflicts on the generated files, we could create tmp files for each JuNest session in order to isolate them. So, if that the case we can treat it as a separate concern/issue.

fsquillace avatar Nov 20 '16 17:11 fsquillace

hmm, considering that systems with many users typically use some name service, that fallback option wouldn't help users in those cases at all. On our cluster /etc/passwd doesn't contain my user...

If caching is not an option at all, let me ask which passwd lines and group infos we actually need... if it's just the executing user's, one could maybe use this:

  • getent passwd $USER instead of getent passwd
  • groups is more complicated as i'm in groups with > 20000 other users, so those lines are just very very long, which makes getent for them slow. Maybe if all other groups are negligible, those lines could be reconstructed with the following, which just iterates over the user's groups with the id command: gids=( $(id -G) ) gnames=( $(id -Gn) ) ; for ((i=0; i<${#gnames[@]}; ++i)); do printf "%s:x:%s:$USER\n" "${gnames[i]}" "${gids[i]}" ; done

If the above isn't enough, maybe caching would be an option if junest -f calls could auto-clear such a cache? junest -f is used to set up the environment, so it's somewhat reasonable to expect a user to run that again after changing big parts in the host env (like user ids for example). Normal invocations on the other hand would be a lot faster...

Wrt. the concurrency issue: yes, this might be an issue of its own, but it shows more often if the getent call takes long (as on big systems). You might be able to reproduce with a sleep 10 in the getent. No matter how the speed issue above is solved, i see two ways to solve the concurrency issue:

  1. completely use tempfiles instead of a shared ~/.junest/etc (so every invocation uses its own files, cleanup will be important though)
  2. write the results of the getent calls into a tempfile, then mv $tempfile ~/.junest/etc/passwd etc. (that way the time in which a ~/.junest/etc/passwd... is incomplete is minimal).

joernhees avatar Nov 25 '16 22:11 joernhees

In systems with many names in LDAP it takes to me a negligible amount of time. So, I guess the fact that it is slow in you case might be due to some other issues like how nsswitch.conf is configured or network connectivity.

In any case I noticed that using getent passwd or getent group by passing directly the list of user and groups, make a different system call, respectively getpwnam and getgrnam which allow a faster way to retrieve such data.

So, as a fallback option we could:

  • use getent passwd $USER instead of getent passwd
  • use getent group $(id -G) instead of getent group

Can you check in your system how long getent group $(id -G) takes for you? It should be by far faster.

For the concurrency issue: https://github.com/fsquillace/junest/issues/165

fsquillace avatar Nov 27 '16 11:11 fsquillace

i could bet i tried this before, but getent group $(id -G) is indeed very fast... (75 ms)

so that's definitely a better option then.

joernhees avatar Nov 27 '16 13:11 joernhees

Edit: Ah, I guess this is what the -n switch is for... :sweat_smile:


I found that getent passwd was adding an inordinate amount of startup time so I did this prevent ${JUNEST_HOME}/etc/{passwd,group} being recreated every startup and so speed things up:

diff --git i/lib/core/common.sh w/lib/core/common.sh
index 601e46c..51c51c0 100644
--- i/lib/core/common.sh
+++ w/lib/core/common.sh
@@ -261,17 +261,23 @@ function copy_passwd_and_group(){
     # is configured.
     # Try to at least get the current user via `getent passwd $USER` since it uses
     # a more reliable and faster system call (getpwnam(3)).
-    if ! getent_cmd passwd > ${JUNEST_HOME}/etc/passwd || \
-        ! getent_cmd passwd ${USER} >> ${JUNEST_HOME}/etc/passwd
+    if ! [ -r ${JUNEST_HOME}/etc/passwd ]
     then
-        warn "getent command failed or does not exist. Binding directly from /etc/passwd."
-        copy_file /etc/passwd ${JUNEST_HOME}/etc/passwd
+        if ! getent_cmd passwd > ${JUNEST_HOME}/etc/passwd || \
+            ! getent_cmd passwd ${USER} >> ${JUNEST_HOME}/etc/passwd
+        then
+            warn "getent command failed or does not exist. Binding directly from /etc/passwd."
+            copy_file /etc/passwd ${JUNEST_HOME}/etc/passwd
+        fi
     fi
 
-    if ! getent_cmd group > ${JUNEST_HOME}/etc/group
+    if ! [ -r ${JUNEST_HOME}/etc/group ]
     then
-        warn "getent command failed or does not exist. Binding directly from /etc/group."
-        copy_file /etc/group ${JUNEST_HOME}/etc/group
+        if ! getent_cmd group > ${JUNEST_HOME}/etc/group
+        then
+            warn "getent command failed or does not exist. Binding directly from /etc/group."
+            copy_file /etc/group ${JUNEST_HOME}/etc/group
+        fi
     fi
     return 0
 }

(There may be a cleaner way of doing this)

For a network with ~37k accounts this drops startup time from ~12 seconds to about 0.1s.

jonathonf avatar Mar 16 '20 22:03 jonathonf