azure-docs icon indicating copy to clipboard operation
azure-docs copied to clipboard

Data Factory Validation Activity Outputs?

Open abhayea opened this issue 2 years ago • 19 comments

BUG!

Use case: Do not wish to use MetaData activity, but rather use validation activity to decide whether we want to copy blobs to the destination or not based upon their existence in the destination.

Validation activity output: "exists": true is really useful. we can club it with IF or SWITCH activity to perform actions ahead.

Problem: if the output of validation activity is "exists": true then the activity succeeds. it works as it should, whatever is defined. But if the output is false ( which I am not able to see in the output of the activity) then it does nothing and runs until it times out.

Validation and IF functionality can be a good combination but the pipeline only runs when "exists": true & If the file does not exist in the destination, then nothing, it keeps running forever until it times out.

Now you may say that you can set the timeout for validation activity to be the lowest number, let it time out, use the activity's Failure dependency and move ahead in the pipeline but this will mark the pipeline run as FAILED.

Reference: if I understood the mentioned GitHub issue, this is a very old bug that should have been resolved back in 2019 itself. https://github.com/MicrosoftDocs/azure-docs/issues/31280

reference pipeline run id: 310e60f2-ed55-4a78-b61a-e957c8704f61


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

abhayea avatar Aug 02 '22 18:08 abhayea

@MartinJaffer-MSFT @mimckitt ^

abhayea avatar Aug 02 '22 18:08 abhayea

Thank you for reaching out. At this time we are reviewing the ask and will provide an update as appropriate

KranthiPakala-MSFT avatar Aug 03 '22 00:08 KranthiPakala-MSFT

Hello @abhayea . Thank you for the background.

I took a look and did a couple tests. The issue you linked to, was about the validation activity not having any output when the timeout happened. That is fixed now. This did not affect whether the activity reports the success of fail of the activity itself. It just made it so we could fetch the "exists" property after timeout.

I pointed a validation activity to a deleted blob, and captured the activity output in set variable.

{
    "name": "ff",
    "value": "{\"exists\":false,\"effectiveIntegrationRuntime\":\"AutoResolveIntegrationRuntime (West US)\",\"executionDuration\":0,\"durationInQueue\":{\"integrationRuntimeQueue\":0},\"billingReference\":{\"activityType\":\"PipelineActivity\",\"billableDuration\":[{\"meterType\":\"AzureIR\",\"duration\":0.016666666666666666,\"unit\":\"Hours\"}]}}"
}

I also tried making the validation activity use an offline Self-Hosted Integration Runtime. The result was much different, the output was null. This is because ADF couldn't even reach the runtime to attempt the check.

Now onto the difficulty you are facing. As I understand, you like how the activity tells you whether a file exists or not, but the "failure" status of activity on timeout is causing you distress due to pipeline logic saying pipeline failed.

There are multiple ways to work around this and make the pipeline say success. It is all in the organization of dependency connections between activities.

I imagine your current organization looks like this, a validation followed by on-sucess copy, and on-failure web or something else to report the unreadiness. image

I recently discovered placing on-skipped dependency from the copy to something else (in this picture, a wait activity is used), can cause the pipeline to not fail. There is a caveat. Should something upstream fail, and the validation not run, this skipped dependency will still trigger. image

A more complex, but more familiar pattern replaces the success dependency with the combination of on-completion and on-skip. The pipeline reports failure because the success wasn't run. So I don't have to use success. image a completion dependency runs on either success or fail. success and failure are mutually exclusive outcomes of an activity. So here I connect the recovery activity to the copy activity by an on-skip dependency. The copy activity is connected to the validation by on-completion. Since the recovery only runs when validation fails, recovery will not run when validation succeeds. So by limiting the copy to run only when recovery does NOT run, I have created a success condition without a success dependency. This does not result in pipeline failure outcome.

MartinJaffer-MSFT avatar Aug 03 '22 19:08 MartinJaffer-MSFT

@MartinJaffer-MSFT thank you for a quick turnaround. Appreciate it.

image

My use case: I will explain it in a simpler manner

validation activity checks if the file exists or not in an azure blob storage account( SINK). Using validation activity output as IF activity input.

if the output is "exists":false ->a copy activity gets invoked and sends an alert saying fresh file detected. if the output is "exists":true -> send an alert saying duplicate file detected.

Quick questions

  1. I tried multiple scenarios and in none of the cases did I see the output: "exists":false
  2. If I have to use a combination of skip, complete, and success how do you suggest I use the activity dependencies so that work is also executed and I don't see a pipeline failure?

Note: if the output is "exists":true -> this condition is working as expected and sends out an alert.

but I never see the case: "exists":false and hence IF activity false never gets invoked and keeps running until it timeout

Ideally, if the file does not exist in the destination, validation should give an output exists":false

abhayea avatar Aug 03 '22 20:08 abhayea

image

I tried this and in this case, only complete activity dependency(web activity-alert_duplicate_file_detected) gets executed regardless I upload a blob that already exists or does not exist.

but please look into why validation activity does not produce any output if the file does not exist, i.e. exists":false.

abhayea avatar Aug 03 '22 21:08 abhayea

but I never see the case: "exists":false and hence IF activity false never gets invoked and keeps running until it timeout

Ahh, here is the confusion @abhayea . Shorten the timeout. After the activity times out, THEN the output exists = false.

The idea of validation activity, is to make the pipeline wait until data is ready. Personally, I feel the behavior and name do not match, and are kind of misleading. So if the data is never ready, then it is failure.

effectively, validation activity packages a lookup and a wait inside an until loop.

validation is one of the few activities I really pay attention to timeout on.

{ "name": "pipeline21", "properties": { "activities": [ { "name": "Validation1", "type": "Validation", "dependsOn": [], "userProperties": [], "typeProperties": { "dataset": { "referenceName": "NonexistantBlob", "type": "DatasetReference" }, "timeout": "0.00:00:20", "sleep": 10, "minimumSize": 5 } }, { "name": "Get output", "type": "SetVariable", "dependsOn": [ { "activity": "Validation1", "dependencyConditions": [ "Completed" ] } ], "userProperties": [], "typeProperties": { "variableName": "validation_output", "value": { "value": "@activity('Validation1').output.exists", "type": "Expression" } } } ], "variables": { "validation_output": { "type": "Boolean" } }, "annotations": [] } }

MartinJaffer-MSFT avatar Aug 04 '22 17:08 MartinJaffer-MSFT

You were very close on the logic, @abhayea . See image below.

image

Validate -> copy (complete) Validate -> duplicate_detected (fail) duplicate_detected -> copy (skipped) copy -> fresh_detected (success)

MartinJaffer-MSFT avatar Aug 04 '22 17:08 MartinJaffer-MSFT

actually its the opposite. @MartinJaffer-MSFT

since 'validation' activity checks if a file exists or not, and only succeeds if the file does exist giving an outcome of: "exists":true .

So, if the file does exist, the outcome is : "exists":true which is correct and validation activity does what it's supposed to. Since the file already exists, it should send out an alert saying Duplicate file detected & do not copy the file over.

Now, if the file does not exist, as per the mentioned solution of timing out validation activity ( which I do not like) and moving ahead in the pipeline, we should copy the file over, and once the file is copied over, it publishes an alert saying Fresh file detected and file was copied to the destination.

Now, this is the solution that worked for me! image

in the above scenario, the pipeline does not fail in any of the scenarios, given whether the file exists or does not exist. it's doing the job perfectly.

Now, I did not use Completion dependency, am I doing something wrong here? If yes, how should I architect my pipeline to use Completion and executes the job correctly?

BUT

My questions still remain the same, why do we need to timeout validation activity and why does it NOT give the outcome "exists":false if the file does not exist in the destination?

abhayea avatar Aug 04 '22 21:08 abhayea

Oh! I take my case back.

So I copied 4 files, which already existed in the destination. 3 out of 4 pipeline runs performed correctly, giving a response of Duplicate file detected

event 1: worked as it should event 2: worked as it should event 3: Strange Behaviour! event 4: worked as it should

image

As you will notice from the above image, my timeout is set to be 11 seconds and sleep is 10 seconds.

now see the pipeline run

image

Despite the Validation activity outcome: "exists": true it should ideally have succeeded and moved on to sending an alert duplicate file detected and not copy the file over.

BUT

Validation activity timed out after 55 seconds, despite the outcome: "exists": true. Anyway, the activity fails, moved on to copy the file and send an alert, fresh file is detected.

Isn't this strange behavior?

abhayea avatar Aug 04 '22 22:08 abhayea

and also please clarify why do we need to timeout validation activity and why does it NOT give the outcome "exists":false if the file does not exist in the destination.

Because it's clearly working if the file exist in the destination dataset but NOT if the file does not exist in the destination. isn't this a BUG with validation activity?

abhayea avatar Aug 05 '22 14:08 abhayea

You are right, this is bizarre behavior, and does not match what I experience. Is there a chance you did a trigger run and not a debug run? Trigger runs use the last published version. Debug runs use what is on the screen. So maybe an old version was ... wait no that can't be the case, you screenshotted everything together, from monitoring screen. Urghh. This is turning into a deeper issue than I expected.

The source is azure blob, not adls gen2 right? The timeout is a missing blob, right? This is published or is this debug? I'm going to repro all the details I can from your screenshots @abhayea .

Timing out with exists:true could only conceivably happen with a file existing, but smaller in size than you tell it to get. Anything else must be a bug.

The duration being larger than expected, well there is a LOT of slop in activities. I expected at maximum 30 seconds, 10 for the first check, 10 for the second check, 10 for the time in queue or other slop.

Well, there is another approach. I give you a support ticket, and a support engineer can take your pipeline run ID and dig through the back-end logs to find out precisely what is going on. My access to logs expired, so I can't look even if I do have your consent and the pipeline run id and activity run id.

MartinJaffer-MSFT avatar Aug 05 '22 16:08 MartinJaffer-MSFT

By the way, why did you not want to use Get Metadata? That can return the same exists information without a timeout @abhayea .

MartinJaffer-MSFT avatar Aug 05 '22 16:08 MartinJaffer-MSFT

I actually used the trgger and not debug.

Destination Azure blob storage account is not a Data Lake Gen2. Account kind : StorageV2 (general purpose v2)

Validation activity is seriously spiked with bugs and it should be raised internally asap! This can cause catastrophic impact if someone starts using it with production data without even realizing it. Timing out an activity to have a failure dependency is just a hack and should not be used in real data environments. The product team should take this seriously.

The way Validation activity runs properly if the file DOES EXIST, then it should work properly if the file does NOT exist as well.

Please create a support ticket so that it is followed up.

I was not intending to use Get Metadata because it is slower in performance in comparison to Validation but now I am left with no choice.

abhayea avatar Aug 05 '22 18:08 abhayea

its getting funnier and funnier

I ended up using metadata

this is the output of get metadata WHEN the file does exist in the destination

{ "exists": true, "size": 7, "effectiveIntegrationRuntime": "newtom (East US 2)", "executionDuration": 0, "durationInQueue": { "integrationRuntimeQueue": 11 }, "billingReference": { "activityType": "PipelineActivity", "billableDuration": [ { "meterType": "AzureIR", "duration": 0.016666666666666666, "unit": "Hours" } ] } }

& this is the output of get metadata WHEN the file does not exist in the destination

{ "exists": false, "effectiveIntegrationRuntime": "newtom (East US 2)", "executionDuration": 0, "durationInQueue": { "integrationRuntimeQueue": 0 }, "billingReference": { "activityType": "PipelineActivity", "billableDuration": [ { "meterType": "AzureIR", "duration": 0.016666666666666666, "unit": "Hours" } ] } }

The difference is when exists": false then get metadata does not produce the output of size. Another BUG ?

quick help: I was intending to use

@activity('Get Metadata1').output.size @activity('Get Metadata1').output.exists

both the conditions in my IF activity to check if the file exist or not.

Check is file name AND file size should be the same, if both conditions are yes then send out an alert: Duplicate file detected.

image

How can I use both the checks in my IF activity Expression? can you help me form the expression?

abhayea avatar Aug 05 '22 18:08 abhayea

one more interesting fact! I tried using a variable to store FIle size from Get Metadata activity and got this error.

Operation on target Set variable1 failed: The expression 'activity('Get Metadata1').output.size' cannot be evaluated because property 'size' doesn't exist, available properties are 'exists, effectiveIntegrationRuntime, executionDuration, durationInQueue, billingReference'.

so if the "exists": false, then get metadata does not even evaluate file size ?

abhayea avatar Aug 05 '22 19:08 abhayea

Okay, so if I file does not exist, then it cannot have a size. You may argue that the size is zero, but note that you can also have a file of size 0.

So what you want to use is the null-safe operator.

@activity('Get Metadata1').output.?size @activity('Get Metadata1').output?.size

it is one of the two above. I get them confused. The ? null-safe operator makes it so no error is thrown when the property is missing.

to make the behavior more like what you expect, let's coalesce so the value 0 is chosen instead of returning null. @coalesce(activity('Get Metadata1').output.?size , 0)

I'm not certain why validation would be slower than get metadata. Both have to fetch the same information @abhayea . Behind everything, a rest call is made to the storage api, to get the metadata on the target.

MartinJaffer-MSFT avatar Aug 08 '22 15:08 MartinJaffer-MSFT

that is of no help to me because I want to use size property regardless. This limitation of get metadata activity that when exists": false then get metadata does not produce the output of size. I want to have a condition that if file size is more than 500 Mb then do NOT copy anything to the destination, and if the file size if less than 500 mb then copy to the destination.

If the file size is 0 or not 0, get metadata should produce the file size. Can you confirm with the product team why is this behaviour happening? Open a support ticket for this as well. I guess since no one creates a support ticket, or chooses to accept this, it does not reach the product team for improvement.

Also have you opened the support ticket to validation?

and can you help me with this?

I am intending to use both conditions using AND logical operations. How can I use 2 properties in one expression?

@activity('Get Metadata1').output.size @activity('Get Metadata1').output.exists

so it would be like @activity('Get Metadata1').output.size is less than 500 Mb and @activity('Get Metadata1').output.exists=False.

disregard that get metadata won't evaluate file size if false. I have another way to get the file size. I want to use 2 parameters in 1 IF expression.

abhayea avatar Aug 08 '22 17:08 abhayea

Here I use both of them inside a Set Variable. The output of the set variable is a boolean on whether your stated conditions are fulfilled. @abhayea

@if(
  and(
    activity('Get Metadata1').output.exists,
    less(
      activity('Get Metadata1').output?.size,
      pipeline().parameters.max_file_size
      )
    ),
  true,false
)

Full pipeline json:

{
    "name": "pipeline11",
    "properties": {
        "activities": [
            {
                "name": "Get Metadata1",
                "type": "GetMetadata",
                "dependsOn": [],
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "dataset": {
                        "referenceName": "Binary2",
                        "type": "DatasetReference"
                    },
                    "fieldList": [
                        "exists",
                        "size"
                    ],
                    "storeSettings": {
                        "type": "AzureBlobStorageReadSettings",
                        "enablePartitionDiscovery": false
                    },
                    "formatSettings": {
                        "type": "BinaryReadSettings"
                    }
                }
            },
            {
                "name": "Should I copy",
                "type": "SetVariable",
                "dependsOn": [
                    {
                        "activity": "Get Metadata1",
                        "dependencyConditions": [
                            "Succeeded"
                        ]
                    }
                ],
                "userProperties": [],
                "typeProperties": {
                    "variableName": "do_copy",
                    "value": {
                        "value": "@if(\n  and(\n    activity('Get Metadata1').output.exists,\n    less(\n      activity('Get Metadata1').output?.size,\n      pipeline().parameters.max_file_size\n      )\n    ),\n  true,false\n)",
                        "type": "Expression"
                    }
                }
            }
        ],
        "parameters": {
            "max_file_size": {
                "type": "int",
                "defaultValue": 5000
            }
        },
        "variables": {
            "do_copy": {
                "type": "Boolean",
                "defaultValue": false
            }
        },
        "annotations": []
    }
}

MartinJaffer-MSFT avatar Aug 09 '22 15:08 MartinJaffer-MSFT

@chez-charlie @ssabat Can you confirm on the expected behavior of Validation activity and Get Metadata activity? Thanks

MartinJaffer-MSFT avatar Aug 09 '22 15:08 MartinJaffer-MSFT