Zero-Secret application development with Azure Managed Service Identity

Committing the secrets along with application codes to a repository is one of the most commonly made mistakes by many developers. This can get nasty when an application is developed for Cloud deployment. You probably have read the story of checking in AWS S3 secrets to GitHub. The developer corrected the mistake in 5 mins, but still received a hefty invoice because of bots that crawl open source sites, looking for secrets. There are many tools that can scan codes for potential secret leakages, they can be embedded in CI/CD pipeline. These tools do a great job in finding out deliberate or unintentional commits that contains secrets before they get merged to a release/master branch. However, they are not absolutely protecting all potential secrets leaks. Developers still need to be carefully review their codes on every commits.

Azure Managed Service Instance (MSI) can address this problem in a very neat way. MSI has the potential to design application that are secret-less. There is no need to have any secrets (specially secrets for database connection strings, storage keys etc.) at all application codes.

Secret management in application

Let’s recall how we were doing secret management yesterday. Simplicity’s sake, we have a web application that is backed by a SQL server. This means, we almost certainly have a configuration key (SQL Connection String) in our configuration file. If we have storage accounts, we might have the Shared Access Signature (aka SAS token) in our config file.

As we see, we’re adding secrets one after another in our configuration file – in plain text format. We need now, credential scanner tasks in our pipelines, having some local configuration files in place (for local developments) and we need to mitigate the mistakes of checking in secrets to repository.

Azure Key Vault as secret store

Azure Key Vault can simplify these above a lot, and make things much cleaner. We can store the secrets in a Key Vault and in CI/CD pipeline, we can get them from vault and write them in configuration files, just before we publish the application code into the cloud infrastructure. VSTS build and release pipeline have a concept of Library, that can be linked with Key vault secrets, designed just to do that. The configuration file in this case should have some sort of String Placeholders that will be replaced with secrets during CD execution.

The above works great, but you still have a configuration file with all the placeholders for secrets (when you have multiple services that has secrets) – which makes it difficult to manage for local development and cloud developments. An improvement can be keep all the secrets in Key Vault, and let the application load those secrets runtime (during the startup event) directly from the Key vault. This is way easier to manage and also pretty clean solution. The local environment can use a different key vault than production, the configuration logic becomes extremely simpler and the configuration file now have only one secret. That’s a Service Principal secret – which can be used to talk to the key vault during startup.

So we get all the secrets stored in a vault and exactly one secret in our configuration file – nice! But if we accidentally commit this very single secret, all other secrets in vault are also compromised. What we can do to make this more secure? Let’s recap our knowledge about service principals before we draw the solution.

What is Service Principal?

A resource that is secured by Azure AD tenant, can only be accessed by a security principal. A user is granted access to a AD resource on his security principal, known as User Principal. When a service (a piece of software code) wants to access a secure resource, it needs to use a security principal of a Azure AD Application Object. We call them Service Principal. You can think of Service Principals as an instance of an Azure AD Application.applicationA service principal has a secret, often referred as Client Secret. This can be analogous to the password of a user principal. The Service Principal ID (often known as Application ID or Client ID) and Client Secret together can authenticate an application to Azure AD for a secure resource access. In our earlier example, we needed to keep this client secret (the only secret) in our configuration file, to gain access to the Key vault. Client secrets have expiration period that up to the application developers to renew to keep things more secure. In a large solution this can easily turn into a difficult job to keep all the service principal secrets renewed with short expiration time.

Managed Service Identity

Managed Service Identity is explained in Microsoft Documents in details. In layman’s term, MSI literally is a Service Principal, created directly by Azure and it’s client secret is stored and rotated by Azure as well. Therefore it is “managed”. If we create a Azure web app and turn on Manage Service Identity on it (which is just a toggle switch) – Azure will provision an Application Object in AD (Azure Active Directory for the tenant) and create a Service Principal for it and store the client secret somewhere – that we don’t care. This MSI now represents the web application identity in Azure AD.msi

Managed Service Identity can be provisioned in Azure Portal, Azure Power-Shell or Azure CLI as below:

az login
az group create --name myResourceGroup --location westus
az appservice plan create --name myPlan --resource-group myResourceGroup
       --sku S1
az webapp create --name myApp --plan myPlan
       --resource-group myResourceGroup
az webapp identity assign
       --name myApp --resource-group myResourceGroup

Or via Azure Resource Manager Template:

{
"apiVersion": "2016-08-01",
"type": "Microsoft.Web/sites",
"name": "[variables('appName')]",
"location": "[resourceGroup().location]",
"identity": {
"type": "SystemAssigned"
},
"properties": {
"name": "[variables('appName')]",
"serverFarmId": "[resourceId('Microsoft.Web/serverfarms', variables('hostingPlanName'))]",
"hostingEnvironment": "",
"clientAffinityEnabled": false,
"alwaysOn": true
},
"dependsOn": [
"[resourceId('Microsoft.Web/serverfarms', variables('hostingPlanName'))]"
]}

Going back to our key vault example, with MSI we can now eliminate the client secret of Service Principal from our application code.

But wait! We used to read keys/secrets from Key vault during the application startup, and we needed that client secret for that. How we are going to talk to Key vault now without the secret?

Using MSI from App service

Azure provides couple of environment variables for app services that has MSI enabled.

  • MSI_ENDPOINT
  • MSI_SECRET

The first one is a URL that our application can make a request to, with the MSI_SECRET as parameter and the response will be a access token that will let us talk to the key vault. This sounds a bit complex, but fortunately we don’t need to do that by hand.

Microsoft.Azure.Services.AppAuthentication  library for .NET wraps these complexities for us and provides an easy API to get the access token returned.

We need to add references to the Microsoft.Azure.Services.AppAuthentication and Microsoft.Azure.KeyVault NuGet packages to our application.

Now we can get the access token to communicate to the key vault in our startup like following:


using Microsoft.Azure.Services.AppAuthentication;
using Microsoft.Azure.KeyVault;

// ...

var azureServiceTokenProvider = new AzureServiceTokenProvider();

string accessToken = await azureServiceTokenProvider.GetAccessTokenAsync("https://management.azure.com/");

// OR

var kv = new KeyVaultClient(new KeyVaultClient
.AuthenticationCallback
(azureServiceTokenProvider.KeyVaultTokenCallback));

This is neat, agree? We now have our application configuration file that has no secrets or keys whatsoever. Isn’t it cool?

Step up – activating zero-secret mode

We have managed deploying our web application with zero secret above. However, we still have secrets for SQL database, storage accounts etc. in our key vault, we just don’t have to put them in our configuration files. But they are still there and loaded in startup event of our web application. This is a great improvement, of course. But MSI allows us to take this even better stage.

Azure AD Authentication for Azure Services

To leverage MSI’s full potentials we should use Azure AD authentication (RBAC controls). For instance, we have been using Shared Access Signatures or SQL connection strings to communicate Azure Storage/Service Bus and SQL servers. With AD authentication, we will use a security principal that has a role assignment with Azure RBAC.

Azure gradually enabling AD authentication for resources. As of today (time of writing this blog) the following services/resources supports AD authentication with Managed Service Identity.

Service Resource ID Status Date Assign access
Azure Resource Manager https://management.azure.com/ Available September 2017 Azure portal
PowerShell
Azure CLI
Azure Key Vault https://vault.azure.net Available September 2017
Azure Data Lake https://datalake.azure.net/ Available September 2017
Azure SQL https://database.windows.net/ Available October 2017
Azure Event Hubs https://eventhubs.azure.net Available December 2017
Azure Service Bus https://servicebus.azure.net Available December 2017
Azure Storage https://storage.azure.com/ Preview May 2018

Read more updated info here.

AD authentication finally allows us to completely remove those secrets from Key vaults and directly access to the storage account, Data lake stores, SQL servers with MSI tokens. Let’s see some examples to understand this.

Example: Accessing Storage Queues with MSI

In our earlier example, we talked about the Azure web app, for which we have enabled Managed Service Identity. In this example we will see how we can put a message in Azure Storage Queue using MSI. Assuming our web application name is:

contoso-msi-web-app

Once we have enabled the managed service identity for this web app, Azure provisioned an identity (an AD Application object and a Service Principal for it) with the same name as the web application, i.e. contoso-msi-web-app.

Now we need to set role assignment for this Service Principal so that it can access to the storage account. We can do that in Azure Portal. Go to the Azure Portal IAM blade (the access control page) and add a role for this principal to the storage account. Of course, you can also do that with Power-Shell.

If you are not doing it in Portal, you need to know the ID of the MSI. Here’s how you get that: (in Azure CLI console)


az resource show -n $webApp -g $resourceGroup
--resource-type Microsoft.Web/sites --query identity

You should see an output like following:

{
"principalId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"tenantId": "xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx",
"type": null
}

The Principal ID is what you are after. We can now assign roles for this principal as follows:

$exitingRoleDef = Get-AzureRmRoleAssignment `
                -ObjectId `
                -RoleDefinitionName "Contributor"  `
                -ResourceGroupName "RGP NAME"
            If ($exitingRoleDef -eq $null) {
                New-AzureRmRoleAssignment `
                    -ObjectId  `
                    -RoleDefinitionName "Contributor" `
                    -ResourceGroupName "RGP NAME"
            }

You can run these commands in CD pipeline with Azure Inline Power Shell tasks in VSTS release pipelines.

Let’s write a MSI token helper class.

We will use the Token Helper in a Storage Account helper class.

Now, let’s write a message into the Storage Queue.

Isn’t it awesome?

Another example, this time SQL server

As of now, Azure SQL Database does not support creating logins or users from service principals created from Managed Service Identity. Fortunately, we have workaround. We can add the MSI principal an AAD group as member, and then grant access to the group to the database.

We can use the Azure CLI to create the group and add our MSI to it:

az ad group create --display-name sqlusers --mail-nickname 'NotNeeded'az ad group member add -g sqlusers --member-id xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx

Again, we are using the MSI id as member id parameter here.
Next step, we need to allow this group to access SQL database. PowerShell rescues again:

$query = @"CREATE USER [$adGroupName] FROM EXTERNAL PROVIDER
GO
ALTER ROLE db_owner ADD MEMBER [$adGroupName]
"@
sqlcmd.exe -S "tcp:$sqlServer,1433" `
-N -C -d $database -G -U $sqlAdmin.UserName `
-P $sqlAdmin.GetNetworkCredential().Password `
-Q $query

Let’s write a token helper class for SQL as we did before for storage queue.

We are almost done, now we can run SQL commands from web app like this:

Voila!

Conclusion

Managed Service Identity is awesome and powerful, it really drives application where security of the application are easy to manage over longer period. Specially when you have lots of applications you end up with huge number of service principals. Managing their secrets over time, keeping track of their expiration is a nightmare. Managed Service makes it so beautiful!

 

Thanks for reading!

Secure Azure Web sites with Web Application Gateway wtih end-to-end SSL connections

The Problem

In order to met higher compliance demands and often as security best practices, we want to put an Azure web site behind an Web Application Firewall (aka WAF). The WAF provides known malicious security attack vectors mitigation’s defined in OWASP top 10 security vulnerabilities. Azure Application Gateway is a layer 7 load balancer that provides WAF out of the box. However, restricting a Web App access with Application Gateway is not trivial.
To achieve the best isolation and hence protection, we can provision Azure Application Service Environment (aka ASE) and put all the web apps inside the virtual network of the ASE. The is by far the most secure way to lock down a web application and other Azure resources from internet access. But ASE deployment has some other consequences, it is costly, and also, because the web apps are totally isolated and sitting in a private VNET, dev-team needs to adopt a unusual deployment pipeline to continuously deploy changes into the web apps. Which is not an ideal solution for many scenarios.
However, there’s an intermediate solution architecture that provides WAF without getting into the complexities that AES brings into the solution architecture, allowing sort of best of both worlds. The architecture looks following:

The idea is to provision an Application Gateway inside a virtual network and configure it as a reverse proxy to the Azure web app. This means, the web app should never receive traffics directly, but only through the gateway. The Gateway needs to configure with the custom domain and SSL certificates. Once a request receives, the gateway then off-load the SSL and create another SSL to the back-end web apps configured into a back-end pool. For a development purpose, the back-end apps can use the Azure wildcard certificates (*.azurewebsites.net) but for production scenarios, it’s recommended to use a custom certificate. To make sure, no direct traffic gets through the azure web apps, we also need to white-list the gateway IP address into the web apps. This will block every requests except the ones coming through the gateway.

How to do that?

I have prepared an Azure Resource Manager template into this Github repo, that will provision the following:

  • Virtual network (Application Gateway needs a Virtual network).
  • Subnet for the Application Gateway into the virtual network.
  • Public IP address for the Application Gateway.
  • An Application Gateway that pre-configured to protect any Azure Web site.

How to provision?

Before you run the scripts you need the following:
  • Azure subscription
  • Azure web site to guard with WAF
  • SSL certificate to configure the Front-End listeners. (This is the Gateway Certificate which will be approached by the end-users (browsers basically) of your apps). Typically a Personal Information Exchange (aka pfx) file.
  • The password of the pfx file.
  • SSL certificate that used to protect the Azure web sites, typically a *.cer file. This can be the *.azurewebsites.net for development purpose.
You need to fill out the parameters.json file with the appropriate values, some examples are given below:
        "vnetName": {
            "value": "myvnet"
        },
        "appGatewayName": {
            "value": "mygateway"
        },
        "azureWebsiteFqdn": {
            "value": "myapp.azurewebsites.net"
        },
        "frontendCertificateData": {
            "value": ""
        },
        "frontendCertificatePassword": {
            "value": ""
        },
        "backendCertificateData": {
            "value": ""
        }
Here, frontendCertificateData needs to be Base64 encoded content of your pfx file.
Once you have the pre-requisites, go to powershell and run:
    $> ./deploy.ps1 `
        -subscriptionId "" `
        -resourceGroupName ""
This will provision the Application Gatway in your resource group.

Important !

The final piece of work that you need to do, is to whitelist the IP address of the Application Gatway into your Azure Web App. This is to make sure, nobody can manage a direct access to your Azure web app, unless they come through the gateway only.

Contribute

Contribution is always appreciated.

CQRS and ES on Azure Table Storage

Lately I was playing with Event Sourcing and command query responsibility segregation (aka CQRS) pattern on Azure Table storage. Thought of creating a lightweight library that facilitates writing such applications. I ended up with a Nuget package to do this. here is the GitHub Repository.

A lightweight CQRS supporting library with Event Store based on Azure Table Storage.

Quick start guide

Install

Install the SuperNova.Storage Nuget package into the project.

Install-Package SuperNova.Storage -Version 1.0.0

The dependencies of the package are:

  • .NETCoreApp 2.0
  • Microsoft.Azure.DocumentDB.Core (>= 1.7.1)
  • Microsoft.Extensions.Logging.Debug (>= 2.0.0)
  • SuperNova.Shared (>= 1.0.0)
  • WindowsAzure.Storage (>= 8.5.0)

Implemention guide

Write Side – Event Sourcing

Once the package is installed, we can start sourcing events in an application. For example, let’s start with a canonical example of UserController in a Web API project.

We can use the dependency injection to make EventStore avilable in our controller.

Here’s an example where we register an instance of Event Store with DI framework in our Startup.cs

// Config object encapsulates the table storage connection string
services.AddSingleton(new EventStore( ... provide config ));

Now the controller:

[Produces("application/json")]
[Route("users")]
public class UsersController : Controller
{
public UsersController(IEventStore eventStore)
{
this.eventStore = eventStore; // Here capture the event store handle
}

... other methods skipped here
}

Aggregate

Implementing event sourcing becomes way much handier, when it’s fostered with Domain Driven Design (aka DDD). We are going to assume that we are familiar with DDD concepts (especially Aggregate Roots).

An aggregate is our consistency boundary (read as transactional boundary) in Event Sourcing. (Technically, Aggregate ID’s are our partition keys on Event Store table – therefore, we can only apply an atomic operation on a single aggregate root level.)

Let’s create an Aggregate for our User domain entity:

using SuperNova.Shared.Messaging.Events.Users;
using SuperNova.Shared.Supports;

public class UserAggregate : AggregateRoot
{
private string _userName;
private string _emailAddress;
private Guid _userId;
private bool _blocked;

Once we have the aggregate class written, we should come up with the events that are relevant to this aggregate. We can use Event storming to come up with the relevant events.

Here are the events that we will use for our example scenario:

public class UserAggregate : AggregateRoot
{

... skipped other codes

#region Apply events
private void Apply(UserRegistered e)
{
this._userId = e.AggregateId;
this._userName = e.UserName;
this._emailAddress = e.Email;
}

private void Apply(UserBlocked e)
{
this._blocked = true;
}

private void Apply(UserNameChanged e)
{
this._userName = e.NewName;
}
#endregion

... skipped other codes
}

Now that we have our business events defined, we will define our commands for the aggregate:

public class UserAggregate : AggregateRoot
{
#region Accept commands
public void RegisterNew(string userName, string emailAddress)
{
Ensure.ArgumentNotNullOrWhiteSpace(userName, nameof(userName));
Ensure.ArgumentNotNullOrWhiteSpace(emailAddress, nameof(emailAddress));

ApplyChange(new UserRegistered
{
AggregateId = Guid.NewGuid(),
Email = emailAddress,
UserName = userName
});
}

public void BlockUser(Guid userId)
{
ApplyChange(new UserBlocked
{
AggregateId = userId
});
}

public void RenameUser(Guid userId, string name)
{
Ensure.ArgumentNotNullOrWhiteSpace(name, nameof(name));

ApplyChange(new UserNameChanged
{
AggregateId = userId,
NewName = name
});
}
#endregion


... skipped other codes
}

So far so good!

Now we will modify the web api controller to send the correct command to the aggregate.

public class UserPayload 
{
public string UserName { get; set; }
public string Email { get; set; }
}

// POST: User
[HttpPost]
public async Task Post(Guid projectId, [FromBody]UserPayload user)
{
Ensure.ArgumentNotNull(user, nameof(user));

var userId = Guid.NewGuid();

await eventStore.ExecuteNewAsync(
Tenant, "user_event_stream", userId, async () => {

var aggregate = new UserAggregate();

aggregate.RegisterNew(user.UserName, user.Email);

return await Task.FromResult(aggregate);
});

return new JsonResult(new { id = userId });
}

And another API to modify existing users into the system:

//PUT: User
[HttpPut("{userId}")]
public async Task Put(Guid projectId, Guid userId, [FromBody]string name)
{
Ensure.ArgumentNotNullOrWhiteSpace(name, nameof(name));

await eventStore.ExecuteEditAsync(
Tenant, "user_event_stream", userId,
async (aggregate) =>
{
aggregate.RenameUser(userId, name);

await Task.CompletedTask;
}).ConfigureAwait(false);

return new JsonResult(new { id = userId });
}

That’s it! We have our WRITE side completed. The event store is now contains the events for user event stream.

EventStore

Read Side – Materialized Views

We can consume the events in a seperate console worker process and generate the materialized views for READ side.

The readers (the console application – Azure Web Worker for instance) are like feed processor and have their own lease collection that makes them fault tolerant and resilient. If crashes, it catches up form the last event version that was materialized successfully. It’s doing a polling – instead of a message broker (Service Bus for instance) on purpose, to speed up and avoid latencies during event propagation. Scalabilities are ensured by means of dedicating lease per tenants and event streams – which provides pretty high scalability.

How to listen for events?

In a worker application (typically a console application) we will listen for events:

private static async Task Run()
{
var eventConsumer = new EventStreamConsumer(
... skipped for simplicity
"user-event-stream",
"user-event-stream-lease");

await eventConsumer.RunAndBlock((evts) =>
{
foreach (var @evt in evts)
{
if (evt is UserRegistered userAddedEvent)
{
readModel.AddUserAsync(new UserDto
{
UserId = userAddedEvent.AggregateId,
Name = userAddedEvent.UserName,
Email = userAddedEvent.Email
}, evt.Version);
}

else if (evt is UserNameChanged userChangedEvent)
{
readModel.UpdateUserAsync(new UserDto
{
UserId = userChangedEvent.AggregateId,
Name = userChangedEvent.NewName
}, evt.Version);
}
}

}, CancellationToken.None);
}

static void Main(string[] args)
{
Run().Wait();
}

Now we have a document collection (we are using Cosmos Document DB in this example for materialization but it could be any database essentially) that is being updated as we store events in event stream.

Conclusion

The library is very light weight and havily influenced by Greg’s event store model and aggreagate model. Feel free to use/contribute.

Thank you!

Azure template to provision Docker swarm mode cluster

What is a swarm?

The cluster management and orchestration features embedded in the Docker Engine are built using SwarmKit. Docker engines participating in a cluster are running in swarm mode. You enable swarm mode for an engine by either initializing a swarm or joining an existing swarm. A swarm is a cluster of Docker engines, or nodes, where you deploy services. The Docker Engine CLI and API include commands to manage swarm nodes (e.g., add or remove nodes), and deploy and orchestrate services across the swarm.

I was recently trying to come up with a script that generates the docker swarm cluster – ready to take container work loads on Microsoft Azure. I thought, Azure Container Service (ACS) should already have supported that. However, I figured, that’s not the case. Azure doesn’t support docker swarm mode in ACS yet – at least as of today (25th July 2017). Which forced me to come up with my own RM template that does the help.

What’s in it?

The RM template will provision the following resources:

  • A virtual network
  • An availability set for manager nodes
  • 3 virtual machines with the AV set created above. (the numbers, names can be parameterized as per your needs)
  • A load balancer (with public port that round-robins to the 3 VMs on port 80. And allows inbound NAT to the 3 machine via port 5000, 5001 and 5002 to ssh port 22).
  • Configures 3 VMs as docker swarm mode manager.
  • A Virtual machine scale set (VMSS) in the same VNET.
  • 3 Nodes that are joined as worker into the above swarm.
  • Load balancer for VMSS (that allows inbound NATs starts from range 50000 to ssh port 22 on VMSS)

The design can be visualized with the following diagram:

There’s a handly powershell that can help automate provisioing this resources. But you can also just click the “Deploy to Azure” button below.

Thanks!

The entire scripts can be found into this GitHub repo. Feel free to use – as needed!

IAC – Using Azure RM templates

As cloud Software development heavily leverages virtualized systems and developers have started using Continuous Integration (CI), many things have started to change. The number of environment developers have to deal with has gone up significantly. Developers now release much frequently, in many cases, multiple times in a single day. All these releases has to be tested, validated. This brings up a new requirement to spin up an environment fast, which is identical to production.

The need for an automated way of provisioning such environments fast (in a repeatable manner) become obvious and hence IAC (stands for Infrastructure as Code) kicked in.

There are numerous tools (Puppet, Ansible, Vagrant etc.) that help building such coded-environment. Azure Resource Manager Template brings a new way of doing IAC when an application is targeted to build and run on Azure. Most of these tools (including RM template) are even idempotent, which ensures that you can run the same configuration multiple times while achieving the same result.

From Microsoft Azure web site:

Azure applications typically require a combination of resources (such as a database server, database, or website) to meet the desired goals. Rather than deploying and managing each resource separately, you can create an Azure Resource Manager template that deploys and provisions all of the resources for your application in a single, coordinated operation. In the template, you define the resources that are needed for the application and specify deployment parameters to input values for different environments. The template consists of JSON and expressions which you can use to construct values for your deployment.

I was excited the first time I saw this in action in one of the Channel9 Videos. Couldn’t wait to give it a go. The idea of having a template that describes all the Azure resources (Service Bus, SQL Azure, VMs, WebApps etc.) in a template file and having the capability to parameterized it with different values that varies over different environments could be very handy for a CI/CD scenarios. The templates can be nested, which also makes them more modularized and more manageable.

Lately I had the pleasure to dig deeper in Azure RM templates, as we are using it for the project I am working these days. I wanted to come up with a sample template that shows how to use RM template to construct resources that allows me to share my learnings. The Scripts can be found into this GitHub Repo.

One problem that I didn’t know how to handle yet, was the credentials that needed in order to provision the infrastructures. For instance, the VM passwords, SQL passwords etc. I don’t think anybody wants to check-in their passwords, into the source control systems visible in Azure RM parameter JSON files. To address this issue, the solution I came up with for now is, I uploaded the RM parameter JSON files into a private container of a Blob Storage (Note that, the storage account is into the same Azure Subscription where the Infrastructure I intend to provision in). A PowerShell script then download the Shared Access Signature (SAS) token for that Blob storage container and uses that to download the parameters JSON Blob into a PSCustomObject and removes the locally downloaded JSON file. Next step, it converts the PSCustomObject into a Hash Table which is passed through the Azure RM Cmdlet to kick of the provision process. That way, there is no need to have a file checked in to the Source control system that has credentials. Also the Administrators who manages the Azure subscription can Crete a private Blob storage and use the Azure Storage Explorer to create and update his credentials into the RM parameters JSON file. A CI process can download the parameters files just in time before provisioning infrastructures.

Production-ready mesosphere cluster on Azure with single command

Lately I have been busy exploring how a production ready mesosphere cluster can be built on top of Azure. The journey was interesting because it went through quite a few technologies that I was almost oblivious before I started. But at the same time excited and amazed by their capabilities that I feel I should share this experience. This article is aimed to explain these technologies to a beginner rather than any DevOps ninjas.

Before I go forward, let’s set some very basic description up for few technologies/components that are being used into the process. What else can be better than starting it with the Docker?

Docker

From Docker site, Docker is an open platform for developing, shipping, and running applications. Docker is designed to deliver your applications faster. With Docker you can separate your applications from your infrastructure and treat your infrastructure like a managed application. Docker helps you ship code faster, test faster, deploy faster, and shorten the cycle between writing code and running code. Docker uses LinuxContainers – (LXC) and the AuFS file system. One can easily be confused with virtual machines (in fact there are few questions in Stackoverflow about this), but docker differs in many aspects from virtual machines. It is significantly lightweight compare to a vm. More importantly it can work with delta changes. Let’s try to understand what that means with an example scenario:

Let’s say we have an application runs in a web server (Apache for example) and serves a HTML document, JavaScripts. We can now define a script (DSL) file that describes how the application was constructed in a Dockerfile. Dockerfile is a Docker file that describes the application. Specifying that it needs an OS (let’s say Ubuntu) and then it needs Apache web server and then HTML, JS files should be copied into certain directory. And it needs a port to be opened for Apache, etc.

With that Dockerfile, we can instruct Docker (which is a Daemon process running after installing Docker) to build the image from this file. Once the image is build we can ask Docker to run that image (like an instance) and that’s it! The application is running. The image can be metaphorically seen as a VHD for Virtual machines.

It gets more interesting, when Docker registry (a.k.a hub) comes into the picture. Notice, in our Dockerfile we first said we need Ubuntu as OS. So how that become part of our image during the Docker build? There is a public registry (Docker hub pretty much like GitHub) where plenty of images are made available by numerous contributors. There is a base image that only build an image with the OS Ubuntu. And in our Dockerfile we simply mentioned that image is our base image. On top of that image we added Apache web server (like a layer) and then our HTMLs (second layer). When Docker daemon builds the image, it will look for the local cache for base Ubuntu image and when not found it will fetch it from the public Docker registry. Then it will create the other layers on top of it to compose the image we want. Now if we change our HTMLs (add/remove them) and ask Docker daemon to build it again, it will be significantly faster. Because it recognizes the deltas and doesn’t download the Ubuntu or Apache again. It only change the layer that has changes and delivers a new image. Which we can run and our changes will be reflect as expected. One can also define their own private Docker registry, in that case their images will not be publicly exposed- suitable for enterprise business applications.

This feature makes it really powerful for Continuous deployment process. Where the build pipelines can output Docker image of the application component, push it to the registry (public or private hub) and in the production do a pull (as it recognizes deltas it will be faster) and run that new image. Pretty darn cool! In order to know more about Docker, visit their site.

Vagrant

Vagrant is a tool for building complete development environments, sandboxed in a virtual machine. It helps enforce good practices by encouraging the use of automation so that development environments are as close to production as possible.

It’s kind of a tool that address the infamous works in my machine problem. A developer can build an environment and create a vagrant file for the environment, vagrant makes sure that the same vagrant file allows other developers get the exact same environment to run the same application.

Vagrant file is like a Dockerfile (described above) where VMs are defined (with their network needs, port forwarding, file sharing etc). Once vagrant executed such a file with a

vagrant up

command on console, it uses a virtual machine provider (Oracle VirtualBox for example) to provision the VMs and once the machine is booted, it will also allow us to write scripts in ansible, puppet, chef, Terraform or even plain and old bash that will be executed into the guest VMs to prepare them as needed. Although bash isn’t idempotent out of the box. However tools like, ansible, Terraform are idempotent, which makes them really the tool of choice. Vagrant in conjunction with these system configuration technologies can provide true Infrastructure as code.

It’s over a year now, MSOpenTech developed Azure provider for vagrant. Which allows not to manage infrastructure in a vagrant file and possibly use the same file to provision identical infrastructure both on a developer’s local machine and on Azure production area, exactly the same way and easily (possibly with a single command).

So, now we know that Docker ensures that we can containerize and ship an application exactly the way we like into production, and vagrant with/without ansible, puppet etc can build the required infrastructure, we can run few application instance nice and smooth in production. But the problem gets little complicated when we want to scale up/out/down our applications. In a Microservice scenario the problem gets amplified quite far. An application can easily end up having numerous dockerized containers running on multiple machines. Managing or even keeping track of those application instances can easily become a nightmare. It’s obvious that there need some automation, to manage the container instances, scale few of them up/out as needed (based on demands), allocating resources (CPU, RAMs un-evenly to these application based on their need), spread them over multiple machines to achieve high availability, making them fault-tolerant-spining up new instances in case of a failure.

Hell lot of works! Good news is we don’t need to develop that beast. There are solutions to address such scenarios. Mesosphere is one of them.

Mesosphere

Mesosphere – as their site described it,

it’s like a new kind of operating system. The Mesosphere Datacenter Operating System (DCOS) is a new kind of operating system that spans all of the servers in a physical or cloud-based datacenter, and runs on top of any Linux distribution.

It’s a big thing-as it sounds. It indeed is. The Mesosphere DCOS includes a rich ecosystem of components. The components that needs to be focused on this articles are as follows:

Apache ZooKeeper

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there are lot of works that go into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them, which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

Mesos

Mesos site says:

Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. It is an open source software originally developed at the University of California at Berkeley. It can run many applications on a dynamically shared pool of nodes.

It is battle tested, prominent users of Mesos include Twitter, Airbnb etc.

Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elastic Search) with API’s for resource management and scheduling across entire datacenter and cloud environments. It can scale out to massive clusters like 10,000 of nodes. Its Fault-tolerant replicated master and slaves using ZooKeeper, and supports docker containers.

Mesos has one “leader” mesos-master (with multiple standby masters managed by ZooKeeper- which makes it resilient), and multiple mesos slaves- which is like the worker nodes. The worker nodes issue “offers” (the capabilities of the machines) to Mesos. Mesos also supports “frameworks” which can play with the offers that made available to the master. These frameworks can actually be a scheduler that decides what workloads can be assigned to which worker based on the offers it receives from Mesos. One such framework we will be looking at is Marathon.

Marathon

Marathon is a cluster-wide init and control system for services in cgroups or Docker containers.

Marathon a roughly like a scheduler framework (actually more than that- but we will see it later) that works together with Chronos and sits of top of Mesos.

Marathon provides a REST API for starting, stopping, and scaling applications. Marathon can run in highly-available mode by running multiple copies. The state of running tasks gets stored in the Mesos state abstraction.

Marathon is a meta framework: It can start other Mesos frameworks such as Chronos or Storm with it to ensure they survive machine failures. It can launch anything that can be launched in a standard shell (thus, Docker images too).

See them in action

We now have some basic understanding about these components, especially the mesosphere cluster, let’s build a vagrant configuration that will build a mesosphere cluster on our local windows machine (laptop is sufficient, I used a windows 8.1 machine as playground). We will be using three mesos masters and all of them also installed ZooKeeper and Marathon on them. And we will have three mesos slave machines to run workloads. To prepare the laptop we need to download and install vagrant first. Next step would be creating the vagrantfile that contains the infrastructure as coded. Here is the script snippet that defines the master Vms, the entire vagrant file can be found here.

https://gist.github.com/MoimHossain/a4a52bbd729170715a4d.js

As we can see here, we are defining the master machines with ip address starts from 192.0.2.1, and goes like 192.0.2.2, 192.0.2.3. (vagrant file is a Ruby file- therefore it’s absolutely programmable script). We can literally now go to this directory from command prompt and run


$> vagrant up

This should create three VMs in local Oracle VirtualBox (that’s the default provider here). However, once the machines get created we need to install mesos, marathon and Zookeeper on them and also need to configure them on those machines. Here comes the provision part. The code snippet here shows at the end we tell vagrant to provision the guest OS by a bash command file. This is not the best option in my opinion (because it’s not idempotent), ansible, Terraform would be best options, but bash is easy to understand the stuffs.

The master provisioning script is also into the same GitHub repo.

Let’s quickly walkthrough some crucial part from the script.


sudo apt-get -y install mesosphere
Setting up ZooKeeper configuration with all the master machine ip addresses:
sudo sed -i -e s/localhost:2181/192.0.2.101:2181,192.0.2.102:2181,192.0.2.103:2181/g /etc/mesos/zk

The script in GitHub has comments that explains what these configuration does. So I will not repeat them here. The basic idea is, installing and configuring the mesos masters and marathons for the cluster.

The vagrant file also creates three slave machines, these are the machines where workloads will be executed. The slave machines are also configured with mesos slave software components in the same way we provisioned the master machines. There is a slave script into the above mentioned GitHub repo.

Now we are pretty much ready to kick it off. Just vagrant up, and your laptop has now a virtual cluster that is conceptually production ready! Of course no one should use Oracle Virtual Box to build a cluster on a single hardware, doesn’t make sense. But the code and idea is absolutely ready to use with a different provider Like Azure or AWS or any cloud vendor or even our proprietary bare-metal data center.

Taking it one step further

Let’s build the same cluster on Microsoft Azure. MSOpentech has very recently created azure provider for vagrant. We will be using that here. However there are some limitations that took a while for me to work around. The first problem is Vagrant provisioning scripts need to know and use the ip address of the VMs that are created by the provider. For VirtualBox it’s not an issue. We can define the ip address upfront in our vagrant file. But in Azure, the ip addresses will be assigned to them dynamically. Also we need to use the internal ip addresses of the machines, not the virtual public ip addresses. Using virtual ip addresses will cause the master servers communicate each other going out and then coming in to the Azure load balancer, costly and slow. Using Azure virtual network though we can define ip ranges, but we never can guarantee which machine has got exact what ip address. I managed to work around this issue by using Azure CLI and powershell.

The work around is like following, a power shell script boots the entire provision process (light.ps1), it uses vagrant to do the VM provisioning (creating a cloud service for all six machines), creating and attaching disks for them. Once the vagrant finished booting up machines, the powershell script gets control back. It then uses Azure cmdlet to read the machine metadata from the cloud service that was just provisioned.

These metadata returns the internal ip addresses of the machines. The script then creates some bash files into a local directory- to configure the mesos, marathon and zookeeper etc, using the ip addresses retrieved earlier.

Once these provision files are available in disk, the powershell script calls vagrant again to provision each machine by using those dynamically created bash files. The process finally creates the Azure endpoints to the appropriate servers so that we can access the mesos and marathon console from our local machine to administer and monitor the cluster we have just created. The entire scripts and vagrant files can be found into this GitRepo.

The process takes about 25 to 30 minutes based on internet speed, but it ends up having a production ready mesos cluster up and running on Windows Azure. All we need to do is get the powershell script and vagrant file and launch the “Light.ps1” form powershell command line. Which is kind of cool!

The script already created end points for Mesos and Marathon into the VM. We can now visit Mesos management console by following an url like http://cloudservicename.cloudapp.net:5050. It may be the case that a different master is leading the cluster, in that case, the port may be 5051 or 5052. But the console will display that message too.

Similarly the Marathon management console can be located at http://cloudservicename.cloudapp.net:8080. Where we can monitor, scale tasks with button clicks. But it has power full REST API which can be leveraged to take to even further.

Summary

It’s quite a lot of stuffs going on here. Specially for someone who is new to this territory. But I can say is, the possibilities it offer probably pays off the effort of learning and dealing them.

RabbitMQ High-availability clusters on Azure VM

Background

Recently I had to look into a reliable AMQP solution (publish-subscribe queue model) in order to build a message broker for a large application. I started with the Azure service bus and RabbitMQ. It didn’t took long to understand that RabbitMQ is much more attractive over service bus because of their efficiency and cost comparisons when there are large number of messages. See the image taken from Mariusz Wojcik’s blog.

Setting up RabbitMQ on a windows machine is relatively easy. RabbitMQ web site nicely documented how to do that. However, when it comes to install RabbitMQ cluster on some cloud VMs, I found Linux (Ubuntu) VMs are handier for their faster booting. For quite a long time I haven’t used the *nix OS, so found the journey really interested to write a post about it.

Spin up VMs on Azure

We need two Linux VMs, both will have RabbitMQ installed as server and they will be clustered. The high level picture of the design looks like following:

Login to the Azure portal and create two VM instances based on the Ubuntu Server 14.04 LTS images on Azure VM depot.

I have named them as MUbuntu1 and MUbuntu2. The VMs need to be in the same cloud service and the same availability set, to achieve redundancy and high availability. The availability set ensures that Azure Fabric Controller will recognize this scenario and will not take all the VMs down together when it does maintenance tasks, i.e. OS patch/updates for example.

Once the VM instances are up and running, we need to define some endpoints for RabbitMQ. Also they need to be load balanced. We go to the MUbuntu1 details in management portal and add two endpoints-port 15672 and port 5672 one for RabbitMQ connection from client applications another for RabbitMQ management portal application. Scott Hanselman has described the details how to create load balanced VMs. Once we create them it will look like following:

Now we can SSH into both of these machines, (Azure already mapped the SSH port 22 to a port which can be found on the right side of the dashboard page for the VM).

Install RabbitMQ

Once we SSH into the terminals of both of the machines we can install RabbitMQ by executing the following commands:



sudo add-apt-repository 'deb http://www.rabbitmq.com/debian/ testing main'
sudo apt-get update
sudo apt-get -q -y --force-yes install rabbitmq-server

The above apt-get will install the Erlang and RabbitMQ server on both machines. Erlang nodes use a cookie to determine whether they are allowed to communicate with each other – for two nodes to be able to communicate they must have the same cookie. Erlang will automatically create a random cookie file when the RabbitMQ server starts up. The easiest way to proceed is to allow one node to create the file, and then copy it to all the other nodes in the cluster. On our VMs the cookie will be typically located in /var/lib/rabbitmq/.erlang.cookie

We are going to create the cookie in both machines by executing the following commands



echo 'ERLANGCOOKIEVALUE' | sudo tee /var/lib/rabbitmq/.erlang.cookie
sudo chown rabbitmq:rabbitmq /var/lib/rabbitmq/.erlang.cookie
sudo chmod 400 /var/lib/rabbitmq/.erlang.cookie
sudo invoke-rc.d rabbitmq-server start

Install Management portal for RabbitMQ

Now we can also install the RabbitMQ management portal so we can monitor the Queue from a browser. Following commands will install the management plugin:



sudo rabbitmq-plugins enable rabbitmq_management
sudo invoke-rc.d rabbitmq-server stop
sudo invoke-rc.d rabbitmq-server start

So far so good. Now we create a user that we want to use to connect the queue from the clients and monitoring. You can manage users anytime later too.



sudo rabbitmqctl add_user
sudo rabbitmqctl set_user_tags administrator
sudo rabbitmqctl set_permissions -p / '.*' '.*' '.*'

Configuring the cluster

So far we have two RabbitMQ server up and running, it’s time to connect them as cluster. To do so, we need to go to one of the machines and join the cluster. The following command will do that:


sudo rabbitmqctl stop_app
sudo rabbitmqctl join_cluster rabbit@MUbuntu1
sudo rabbitmqctl start_app
sudo rabbitmqctl set_cluster_name RabbitCluster

We can verify if the cluster is configured properly via RabbitMQ management portal:

Or from SSH terminal:

Queues within a RabbitMQ cluster are located on a single node by default. They need to be made mirrored across multiple nodes. Each mirrored queue consists of one master and one or more slaves, with the oldest slave being promoted to the new master if the old master disappears for any reason. Messages published to the queue are replicated to all slaves. Consumers are connected to the master regardless of which node they connect to, with slaves dropping messages that have been acknowledged at the master. Queue mirroring therefore enhances availability, but does not distribute load across nodes (all participating nodes each do all the work). This solution requires a RabbitMQ cluster, which means that it will not cope seamlessly with network partitions within the cluster and, for that reason, is not recommended for use across a WAN (though of course, clients can still connect from as near and as far as needed). Queues have mirroring enabled via policy. Policies can change at any time; it is valid to create a non-mirrored queue, and then make it mirrored at some later point (and vice versa). More on this are documented in RabbitMQ site. For this example, we will replicate all queues by executing this on SSH:


rabbitmqctl set_policy ha-all "" '{"ha-mode":"all","ha-sync-mode":"automatic"}'

That should be it. The cluster is now up and running, we can create a quick .NET console application to test this. I have created 2 console applications and a library that has one class as the message contract. VS Solution looks like this:

We will use EasyNetQ to connect to the RabbitMQ, which we can nuget in publisher and subscriber project.

In the contract project (class library), we have following classes in a single code file


namespace Contracts
{
public class RabbitClusterAzure
{
public const string ConnectionString =
@"host=;username=;password=";
}


public class Message
{
public string Body { get; set; }
}
}

The publisher project has the following code in program.cs


namespace Publisher
{
class Program
{
static void Main(string[] args)
{
using (var bus = RabbitHutch.CreateBus(RabbitClusterAzure.ConnectionString))
{
var input = "";
Console.WriteLine("Enter a message. 'Quit' to quit.");
while ((input = Console.ReadLine()) != "Quit")
{
Publish(bus, input);
}
}
}

private static void Publish(IBus bus, string input)
{
bus.Publish(new Contracts.Message
{
Body = input
});
}
}
}

Finally, the subscriber project has the following code in the program.cs


namespace Subscriber
{
class Program
{
static void Main(string[] args)
{
using (var bus = RabbitHutch.CreateBus(RabbitClusterAzure.ConnectionString))
{
var retValue = bus.Subscribe("Sample_Topic", HandleTextMessage);

Console.WriteLine("Listening for messages. Hit to quit.");
Console.ReadLine();
}
}

static void HandleTextMessage(Contracts.Message textMessage)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine("Got message: {0}", textMessage.Body);
Console.ResetColor();
}
}
}

Now we can run the Publisher and multiple instance of subscriber and it will dispatch messages in round-robin (direct exchange). We can also take one of the VM down and it will not lose any messages.

We can also see the traffics to the VMs (cluster instance too) directly from Azure portal.

Conclusion

I have to admit, I found it extremely easy and convenient to configure up and run RabbitMQ clusters. The steps are simple and setting it up just works.