Resilient Azure Data Lake Analytics (ADLA) Jobs with Azure Functions

Azure Data Lake Analytics is an on-demand analytics job service that allows writing queries to transform data and grab insights efficiently. The analytics service can handle jobs of any scale instantly by setting the dial for how much power you need.

JObs

In many organizations, these jobs could play a crucial role and reliability of these job executions could be business critical. Lately I have encountered a scenario where a particular USQL job has failed with following error message:

Usql – Job failed due to internal system error – NM_CANNOT_LAUNCH_JM

A bit of research on Google revealed, it’s a system error, which doesn’t leave a lot of diagnostic clue to reason out. Retrying this job manually (by button clicking on portal) yielded success! Which makes it a bit unpredictable and uncertain. However, uncertainty like this is sort of norm while developing Software for Cloud. We all read/heard about Chaos Monkeys of Netflix.

What is resiliency?

Resiliency is the capability to handle partial failures while continuing to execute and not crash. In modern application architectures — whether it be micro services running in containers on-premises or applications running in the cloud — failures are going to occur. For example, applications that communicate over networks (like services talking to a database or an API) are subject to transient failures. These temporary faults cause lesser amounts of downtime due to timeouts, overloaded resources, networking hiccups, and other problems that come and go and are hard to reproduce. These failures are usually self-correcting. (Source)
Today I will present an approach that mitigated this abrupt job failure.

The Solution Design

Basically, I wanted to have a job progress watcher, waiting to see a failed job and then resubmit that job as a retry-logic. Also, don’t want to retry more than once, which has potential to repeat a forever-failure loop. I can have my watcher running at a frequency – like every 5 minutes or so.

Azure Functions

Azure Functions continuously impressing me for its lightweight built and consumption-based pricing model. Functions can run with different triggers, among them time schedule trigger- that perfectly fits my purpose.

Prerequisites

The function app needs to retrieve failed ADLA jobs and resubmit them as needed. This can be achieved with the Microsoft.Azure.Management.DataLake.Analytics, Version=3.0.0.0 NuGet package. We will also require Microsoft.Rest. ClientRuntime.Azure.Authentication, Version=2.0.0.0 NuGet package for Access Token retrievals.

Configuration

We need a Service Principal to be able to interact with ADLA instance on Azure. Managed Service Identity (written about it before) can also be used to make it secret less. However, in this example I will use Service Principal to keep it easier to understand. Once we have our Service Principal, we need to configure them in Function Application Settings.

Hacking the function

[FunctionName("FN_ADLA_Job_Retry")]

public static void Run([TimerTrigger("0 0 */2 * * *")]TimerInfo myTimer, TraceWriter log)

{

var accountName = GetEnvironmentVariable("ADLA_NAME");

var tenantId = GetEnvironmentVariable("TENANT_ID");

var clientId = GetEnvironmentVariable("SERVICE_PRINCIPAL_ID");

var clientSecret = GetEnvironmentVariable("SERVICE_PRINCIPAL_SECRET");

 

ProcessFailedJobsAsync(tenantId, clientId, clientSecret, accountName).Wait();

}

That’s our Azure Function scheduled to be run every 2 hours. Once we get a trigger, we retrieve the AD tenant ID, Service Principal ID, secret and the account name of target ADLA.

Next thing we do, write a method that will give us a ADLA REST client – authenticated with Azure AD, ready to make a call to ADLA account.

private static async Task GetAdlaClientAsync(

string clientId, string clientSecret, string tenantId)

{

var creds = new ClientCredential(clientId, clientSecret);

var clientCreds = await ApplicationTokenProvider

.LoginSilentAsync(tenantId, creds);

 

var adlsClient = new DataLakeAnalyticsJobManagementClient(clientCreds);

return adlsClient;

}

The DataLakeAnalyticsJobManagementClient class comes from Microsoft.Azure.Management.DataLake.Analytics, Version=3.0.0.0 NuGet package that we have already installed into our project.

Next, we will write a method that will get us all the failed jobs,

private static async Task<Microsoft.Rest.Azure.IPage>

GetFailedJobsAsync(string accountName, DataLakeAnalyticsJobManagementClient client)

{

// We are ignoring the data pages that has older jobs

// If that's important to you, use CancellationToken to retrieve those pages

return await client.Job

.ListAsync(accountName,

new ODataQuery(job => job.Result == JobResult.Failed));

}

We have now the capability to retrieve failed jobs, great! Now we should write the real logic that will check for failed jobs that never been retried and resubmit them.

private const string RetryJobPrefix = "RETRY-";

public static async Task ProcessFailedJobsAsync(

string tenantId, string clientId, string clientSecret, string accountName)

{

var client = await GetAdlaClientAsync(clientId, clientSecret, tenantId);

 

var failedJobs = await GetFailedJobsAsync(accountName, client);

 

foreach (var failedJob in failedJobs)

{

// If it's a retry attempt we will not kick this off again.

if (failedJob.Name.StartsWith(RetryJobPrefix)) continue;

 

// we will retry this with a name prefixed with a RETRY

var retryJobName = $"{RetryJobPrefix}{failedJob.Name}";

 

// Before we kick this off again, let's check if we already have retried this before..

if (!(await HasRetriedBeforeAsync(accountName, client, retryJobName)))

{

var jobDetails = await client.Job.GetAsync(accountName, failedJob.JobId.Value);

var newJobID = Guid.NewGuid();

 

var properties = new USqlJobProperties(jobDetails.Properties.Script);

var parameters = new JobInformation(

retryJobName,

JobType.USql, properties,

priority: failedJob.Priority,

degreeOfParallelism: failedJob.DegreeOfParallelism,

jobId: newJobID);

 

// resubmit this job now

await client.Job.CreateAsync(accountName, newJobID, parameters);

}

}

}

private async static Task HasRetriedBeforeAsync(string accountName,

DataLakeAnalyticsJobManagementClient client, string name)

{

var jobs = await client.Job

.ListAsync(accountName,

new ODataQuery(job => job.Name == name));

 

return jobs.Any();

}

This is it all!

Final thoughts!

We can’t avoid failures, but we can respond in ways that will keep our system up or at least minimize downtime. In this example, when one Job fails unpredictably, its effects can cause the system to fail.

We should build our own mitigation against these uncertain factors – with automation.