nifikop Allow NiFi properties to be configured via environment variables

Is your feature request related to a problem?

No

Describe the solution you'd like to see

Provide an alternative way to configure nifi.properties via environment variables instead of needing to supply a nifi.properties file to NiFi. The set of supported environment variables can be found in the collection of scripts here: https://github.com/apache/nifi/tree/main/nifi-docker/dockerhub/sh

Tasks:

Determine if it's possible for nifikop to configure NiFi's properties using only environment variables. If it is, then nifikop can stop maintaining a template of nifi.properties in the code base and rely on the one that ships with NiFi. Users can still override the nifi.properties via the NifiCluster CRD.
Where possible, replace nifikop's overriding of nifi.properties with environment variables. This can be done by nifikop adding environment variables to the Pod specification.

Users may already configure environment variables via the ReadOnlyConfig.AdditionalSharedEnvs configuration, so no changes are needed to enable this.

Describe alternatives you've considered

No response

Additional context

No response

Jan 11 '24 14:01 mh013370

Looking through the code of lighter i find that there is a check for "zombie". I imagine it is done to cover the case in which session is somehow killed/terminated with no intervention from lighter itself, for example when session pod is killed manually (in case of k8s).

I think that is not the case, if session/batch job was killed externally, lighter would get a not found response and could mark the session/job as failed without a zombie check.

If I remember correctly (@Minutis maybe you do?), this was originally implemented to work around the cases, when kubernetes driver pod is being re-scheduled on a different node. In our cases we were running spark batch applications on AWS spot instances, and there were some cases, when the node was taken away, and its pods were re-scheduled on a different node. For some time k8s API did not return the info about the pod.

This functionality makes sense only on this specific configuration, I think we could make it configurable and allow users to switch it off.

It looks like you have an unusual use-case for the Lighter sessions, maybe it would make sense for you to try look into our "Permanent session" feature - there is a way to configure a session that should run forever, and in cases, when it dies or is killed by external forces, Lighter will recreate that session under the same ID. We didn't have time to document this feature (we'll do it eventually).

Oct 28 '23 06:10 pdambrauskas

I think that is not the case, if session/batch job was killed externally, lighter would get a not found response and could mark the session/job as failed without a zombie check.

I can make some extra check but i think you are wrong, at least for sessions. Putting session to Error state happens in 2 places:

from launchSession triggered by processScheduledSessions
from checkZombie which is called by processApplicationIdle and processApplicationRunning -> both called from scheduled trackRunning

The first case is pretty simple and happens on error after session is launched. The second case is more interesting as in checkZombie is called when lighter can NOT get ApplicationInfo from backend - in k8s when pod is gone. Pod can be gone because node on which it has been running has been restarted/removed, k8s scheduler decided to kill pod or maybe human error. No matter what was the reason, it can happen than session state is set to ApplicationState.ERROR after at least 30 mins, not earlier... I think this 30 mins is:

not configurable,
crazy long

I did some test with lighter 0.0.50 on k8s. When i killed my session pod (kubectl delete pod ...), the state of session has been updated to 'error' after 30 mins.

I am just wondering if anything can be done with it. Of course one of option is to simply make it configurable and let users decided what shall be "zombie" time. On the other side i am just wondering if maybe it would make sense to somehow discover "error-cases" earlier and using some kind of https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/ for k8s.

My biggest concern is that if something bad happens to session (like being killed permanently or being rescheduled on another node) it can take too much time before i can find it out (via lighter rest api). If session pod is killed i can still submit statements to it which will never be executed and after 30 mins session will end up in error state.

Permanent sessions can be helpful but at some point i will need to kill them - sessions occupy too much resources and because of my trying to use custom python libraries i can not easily reuse them

Oct 28 '23 15:10 jmilkiewicz

Yes. It is how it works now. If the pod is killed (not deleted) it should mark it to “error” sooner, but if it is removed - it takes 30mins.

Zombie check was implemented without having sessions in mind and the value was set to 30min, when we were trying to work around the issue in our infrastructure. Something related to lenger spot instance start times and pod recreation, cant really remember what exactly.

I’m not saying it is configurable now, but we can make it configurable, in many cases it does not make sense to have this check at all, since we can treat “not found” pod as an error right away.

Oct 28 '23 15:10 pdambrauskas

Sorry for delayed response.

I think that zombie case was also happening in YARN backend too. When YARN as whole service is restarted - all information about running jobs is gone. That means if any of the batch applications were running and the YARN was restarted Lighter no longer could get any information about the jobs and marked them as a zombies after the delay.

I think it makes sense to make it configurable. It also can be handled differently in different backends but of course the code would be harder to manage since same functionality would need different implementation depending on the backend serving the jobs.

Nov 03 '23 11:11 Minutis

nifikop nifikop copied to clipboard

Allow NiFi properties to be configured via environment variables

Is your feature request related to a problem?

Describe the solution you'd like to see

Describe alternatives you've considered

Additional context

nifikop
nifikop copied to clipboard