junest
junest copied to clipboard
junest startup slow on systems with many users
I'm running junest as a user on one of our university servers. As you can see below, there are > 20K users. On startup junest seems to spend a considerable amount of time in these commands:
$ time getent passwd | wc
21690 50253 1341078
real 0m2.871s
user 0m0.129s
sys 0m0.103s
$ time getent group | wc
590 590 1425507
real 0m4.265s
user 0m0.052s
sys 0m0.029s
https://github.com/fsquillace/junest/blob/6d4e5f7404da996f0f80c45c588d52eae90e9575/lib/core.sh#L423
Aside from causing long junest startup times, this may also cause problems for multiple parallel junest sessions: during startup the shared ${JUNEST_HOME}/etc/{passwd,group} might be incomplete, causing weird problems for running / simultaneously starting sessions.
Would you accept a PR that prefers an existing ${JUNEST_HOME}/etc/{passwd,group} and only clears it with a special option (e.g., --clear-caches or maybe implicitly on -f or -r)?
Hey @joernhees
Thanks for this change! (and sorry for the delay, there are too many projects to deal with lately :smile:)
I just want to give you more context about this issue.
A while ago JuNest was working by taking directly in consideration the files /etc/passwd and
/etc/group of the native Linux System. This gave problems on some edge cases such as: https://github.com/fsquillace/junest/issues/81
In particular, the problem was that systems that use some name service (like LDAP or Active-Directory) to store user information, contains information that can be achieved remotely via getent command.
The proper solution for that was https://github.com/fsquillace/junest/commit/846bcc9c1f8c781bcd94b97523d8604cc601c1ce And that's the reason why JuNest is slow for your use case now.
I think that having a cache might be too dangerous (from the customer perspective I think I prefer to see JuNest to be correct and slow).
So instead, I would go to an option that falls back to the previous solution (directly map to the existing /etc/passwd and /etc/group files). WDYT?
Regarding to the concurrency issue, I still have difficulties to replicate that (due to its heisenbug nature) but I guess that if the problem is related to the conflicts on the generated files, we could create tmp files for each JuNest session in order to isolate them. So, if that the case we can treat it as a separate concern/issue.
hmm, considering that systems with many users typically use some name service, that fallback option wouldn't help users in those cases at all. On our cluster /etc/passwd doesn't contain my user...
If caching is not an option at all, let me ask which passwd lines and group infos we actually need... if it's just the executing user's, one could maybe use this:
getent passwd $USERinstead ofgetent passwd- groups is more complicated as i'm in groups with > 20000 other users, so those lines are just very very long, which makes
getentfor them slow. Maybe if all other groups are negligible, those lines could be reconstructed with the following, which just iterates over the user's groups with theidcommand:gids=( $(id -G) ) gnames=( $(id -Gn) ) ; for ((i=0; i<${#gnames[@]}; ++i)); do printf "%s:x:%s:$USER\n" "${gnames[i]}" "${gids[i]}" ; done
If the above isn't enough, maybe caching would be an option if junest -f calls could auto-clear such a cache? junest -f is used to set up the environment, so it's somewhat reasonable to expect a user to run that again after changing big parts in the host env (like user ids for example). Normal invocations on the other hand would be a lot faster...
Wrt. the concurrency issue: yes, this might be an issue of its own, but it shows more often if the getent call takes long (as on big systems). You might be able to reproduce with a sleep 10 in the getent. No matter how the speed issue above is solved, i see two ways to solve the concurrency issue:
- completely use tempfiles instead of a shared
~/.junest/etc(so every invocation uses its own files, cleanup will be important though) - write the results of the
getentcalls into a tempfile, thenmv $tempfile ~/.junest/etc/passwdetc. (that way the time in which a~/.junest/etc/passwd... is incomplete is minimal).
In systems with many names in LDAP it takes to me a negligible amount of time. So, I guess the fact that it is slow in you case might be due to some other issues like how nsswitch.conf is configured or network connectivity.
In any case I noticed that using getent passwd or getent group by passing directly the list of user and groups, make a different system call, respectively getpwnam and getgrnam which allow a faster way to retrieve such data.
So, as a fallback option we could:
- use
getent passwd $USERinstead of getent passwd - use
getent group $(id -G)instead of getent group
Can you check in your system how long getent group $(id -G) takes for you? It should be by far faster.
For the concurrency issue: https://github.com/fsquillace/junest/issues/165
i could bet i tried this before, but getent group $(id -G) is indeed very fast... (75 ms)
so that's definitely a better option then.
Edit: Ah, I guess this is what the -n switch is for... :sweat_smile:
I found that getent passwd was adding an inordinate amount of startup time so I did this prevent ${JUNEST_HOME}/etc/{passwd,group} being recreated every startup and so speed things up:
diff --git i/lib/core/common.sh w/lib/core/common.sh
index 601e46c..51c51c0 100644
--- i/lib/core/common.sh
+++ w/lib/core/common.sh
@@ -261,17 +261,23 @@ function copy_passwd_and_group(){
# is configured.
# Try to at least get the current user via `getent passwd $USER` since it uses
# a more reliable and faster system call (getpwnam(3)).
- if ! getent_cmd passwd > ${JUNEST_HOME}/etc/passwd || \
- ! getent_cmd passwd ${USER} >> ${JUNEST_HOME}/etc/passwd
+ if ! [ -r ${JUNEST_HOME}/etc/passwd ]
then
- warn "getent command failed or does not exist. Binding directly from /etc/passwd."
- copy_file /etc/passwd ${JUNEST_HOME}/etc/passwd
+ if ! getent_cmd passwd > ${JUNEST_HOME}/etc/passwd || \
+ ! getent_cmd passwd ${USER} >> ${JUNEST_HOME}/etc/passwd
+ then
+ warn "getent command failed or does not exist. Binding directly from /etc/passwd."
+ copy_file /etc/passwd ${JUNEST_HOME}/etc/passwd
+ fi
fi
- if ! getent_cmd group > ${JUNEST_HOME}/etc/group
+ if ! [ -r ${JUNEST_HOME}/etc/group ]
then
- warn "getent command failed or does not exist. Binding directly from /etc/group."
- copy_file /etc/group ${JUNEST_HOME}/etc/group
+ if ! getent_cmd group > ${JUNEST_HOME}/etc/group
+ then
+ warn "getent command failed or does not exist. Binding directly from /etc/group."
+ copy_file /etc/group ${JUNEST_HOME}/etc/group
+ fi
fi
return 0
}
(There may be a cleaner way of doing this)
For a network with ~37k accounts this drops startup time from ~12 seconds to about 0.1s.