AWS and other interesting stuff

OpsWorks - Part 2

OpsWorks For DevOps Certification - Part 2 - Introduction

This is the second post of a two-part blog. In part 1, I had an initial look at OpsWorks by setting up a stack to run MediaWiki in a PHP layer and RDS layer. In this post I’ll record my findings as I dive-deep into the OpsWorks interface and concepts.

Components And Features Summary

  • DataBags (custom JSON)

    • can be set for
      • Stack
      • Layer
      • Deployment
      • App
  • Stack

    • Region Selection: OpsWorks is a Global Service; you can specify the region to use when you create a stack.
    • Windows and Linux cannot be mixed in a stack
    • Stack Defaults - Can be overwritten per-layer. They include:
      • Default Subnet
      • Default Operating System
      • Default SSH key
      • Default Root Device Type
      • Default Instance Profile (IAM Role)
    • Stack Custom JSON: will be available in the stack data bag
    • Regional API endpoint: reduces API latencies, improves instance response times, and limits impact from cross-region dependency failures.
    • Use Custom Cookbooks: allows you to supply a repository URL (Git, Subversion, S3 Archive, HTTP Archive)
      • Chef 11.10+ support Berkshelf, a dependency manager. If this is enabled, your repository can have a /Berkshelf file in it that defines the default source and other cookbooks. This overcomes a previous limitation where you could only specify one custom cookbook repository i.e. you’d need to duplicate recipes into your repository to use them.
  • Layer

    • Interesting: Once you add a layer of a certain type you can’t add another of the same type e.g. PHP App
  • Instance

    • Type can be:
      • 247
      • Time-based
      • Remember to disable the schedule for an instance when you want to delete it otherwise the schedule will start it again :-)
      • Load-based
    • When you first add an instance it is in stopped state and you need to manually start it
    • Instance is named -
    • You can’t edit an instance when it is running
  • App

    • Type
      • PHP, Ruby on Rails, Static, Java, AWS Flow (Ruby)
    • Data source type:
      • RDS
      • OpsWorks - e.g. MySQL Layer
      • None
    • Application source
      • Git, Subversion, HTTP Archive, S3 Archive
    • Environment Variables
    • Domain Name
    • Enable SSL - you need to provide the key and cert
  • Deployments

    • Deploy App
    • Run Command: This is where you run all commands from, even non-deployment related ones


  • Stacks are a logical container for related resources e.g. create 3 stacks, one for development, staging and production
  • Stacks can be cloned


Must be associated with at least one layer. An example of having a instance belong to 2 layers would be: 1 layer installs Nginx and 1 layer install php-fpm.

Instance States:

  • requested
  • booting
  • running_setup
  • online
  • stopping
  • stopped

  • Instances can have multiple Apps deployed.
    • When you deploy an app you can select specific instances or a whole layer.
    • Remember, the recipes for the layer will be run for the new app too e.g. database configuration etc. That may mean a new layer with the same set of instances is more appropriate.
  • Instances can belong to one or more layers.

There are multiple types of instances. Using them in combination is often the best option e.g. use 247 for a baseline of performance, time-based for predicted spikes, and load-based for unpredicted spikes.

Time-based instances

You configure these just like you would a 24 / 7 instance, but once you’ve done that you can selected a time schedule when those instances are on, either every day of the week or specific days.

The time is marked as dark green where it occurs every day and light green where it is a specific day.

Load-based instances

You can add load-based instances. The “Scaling configuration” option needs to be on:

You can scale up and down based on Layer averages:

  • CPU
  • Memory
  • Load
  • or, CloudWatch Alarms

You specify:

  • Batch size
  • Time that thresholds are exceeded or undershot
  • and, the time to ignore metrics after a scaling event

I caused a load on one of the servers using this command:

for i in $(seq 1 5); do dd if=/dev/urandom | bzip2 -9 >> /dev/null & done

The 3 load-based instances were started as expected:

I then reduced the load …

killall dd

… and the 3 instances stopped as expected.


You are only able to stop/start 24 / 7 instances

The Setup lifecycle hook is run on every instance boot

EC2 or On-premise

EC2 or On-premise instances need to be associated with a custom layer.


Elastic IPs

You can register an EIP with a stack …

… and then with a specific instance:

As expected, the EIP can be moved between instances without reboot.


You can register a volume with a stack …

… and then associate it with a stopped instance:

And it is mounted:

$ df -h /mnt/myvolume/
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdi      1014M   33M  982M   4% /mnt/myvolume

You have to stop the instance to disassociate it.


You can register an RDS instance with a stack …

… and then with an app

… but an app can only have one data source:

You can use an OpsWorks Data Source i.e. a MySQL layer configured by OpsWorks:

Registering RDS with the stack creates an RDS layer:

When you change an apps data source you need to do a redeploy.


Instances are grouped by layer:

You can change the layer an instance is associated with when it is stopped:

You need to stop and delete all instances before you can delete a layer:


Runs on all instances in a stack when an instance:

  • enters or leaves the online state
  • when an EIP is associated/disassociated an instance
  • when an ELB is attached/detached from a layer

This is a good example

# write out opsworks.php
template "#{deploy[:deploy_to]}/shared/config/opsworks.php" do
  cookbook 'php'
  source 'opsworks.php.erb'
  mode '0660'
  owner deploy[:user]
  group deploy[:group]
    :database => deploy[:database],
    :memcached => deploy[:memcached],
    :layers => node[:opsworks][:layers],
    :stack_name => node[:opsworks][:stack][:name]
  only_if do

… that configures /srv/www/<stack-name>/shared/config/opsworks.php with sections depending on the layers setup e.g. RDS and Memcached

class OpsWorksDb {
  public $adapter, $database, $encoding, $host, $username, $password, $reconnect;

  public function __construct() {
    $this->type = 'mysql';
    $this->data_source_provider = 'rds';
    $this->adapter = '';
    $this->database = 'mediawiki';
    $this->encoding = 'utf8';
    $this->host = 'opsworksmediawiki.<REDACTED>';
    $this->username = 'mediawiki';
    $this->password = '<REDACTED>';
    $this->reconnect = 'true';

class OpsWorksMemcached {
  public $host, $port;

  public function __construct() {
    $this->host = '';
    $this->port = '11211';

My Arbitrary Example


The deploy hook supplies variables for use in recipes e.g.

:host =>     (deploy[:database][:host] rescue nil),


Used to represent and configure components of a stack e.g. a layer for web servers, a layer for the database and a layer for the load balancer

OpsWorks has built-in layers for common configurations that we can customise, or we can create our own layers.

General Settings

  • Shutdown timeout: the time the shutdown recipes are allowed to run for.
  • Auto healing is enabled by default. This restart a layer’s instances in-case of failure.
  • Custom JSON

Auto Healing

Instances perform ongoing heartbeat style health checks with the OPSWorks orchestration engine. If the heartbeat fails, OpsWorks treats the instance as unhealthy and will perform an auto-heal.

To test this, I put the instance into single user mode …

telinit 1

… so that the Instance Status Check failed.

OpsWorks duly rebooted the instance as expected.

Exam Notes:

  • an auto-healed instance will return with the same OS as the previous instance, even if the OS was changed at a stack level.
  • auto-healing is only triggered by agent to orchestration communication failure; it won’t fix performance issues. i.e. full crash is required.
  • an EBS backed instance may end-up in a start_failed state, and manual intervention will be required. ref


  • OpsWorks loses communication for 5+ minutes
  • EBS backed instance is stopped
    • online > stopping > stopped
    • Then started
    • requested > pending > booting > online
    • Or start_failed state which requires manual remediation. This is more likely on an EBS backed instance.
  • Instance backed instance is terminated
    • The instance is terminated, root volume deleted, a new instance is launched with the same hostname, config, and layer membership. Old instance’s EBS volumes are attached, new public and private IPs are assigned. Old instance’s EIPs are associated.
      • online > shutting_down
      • Then launch
      • requested > pending > booting > online
  • OpsWorks triggers a configure event on all instances


There are Built-in Chef Recipes and Custom Chef Recipes that can be set for these lifecycle events:

Lifecycle events:

  • Setup
  • Configure
  • Deploy
  • Undeploy
  • Shutdown

You can also add OS Packages


  • Elastic Load Balancer
    • Each layer can have an ELB associated with it
    • Any existing instances will be removed from the load balancer
    • Instance Shutdown Behavior
    • Wait for an instance’s connections to drain from the ELB before shutting down (you need to enable connection draining on the ELB)
    • Shut down an instance without waiting for connections to drain
  • Automatically Assign IP Addresses
    • Public IP address
    • Elastic IP address - assigned on first boot
    • If both are selected a public IP is only assigned if an EIP can’t be

EBS Volumes

You can attach volumes - including RAID 0, 1 and 10 - to all instances in a layer.

When you delete an instance with EBS volumes attached you need to tick the checkbox to delete the volumes too.


You can set the security groups and the instance role (overriding the default for a stack)

Note: security groups are not applied to running instances, only new instances.

Layer Types


A predefined layer …

  • Load balancer - HAProxy
  • App Server - Static, Rails, PHP, Node.js

… or a custom one.

ECS Layer

  • Each stack can have one ECS cluster, and each ECS cluster can have one stack.
  • OpsWorks allows you to add 24 / 7, time-based and load-based instances to an ECS cluster. It does not manage containers.
    • OpsWorks sets up the instance by installing Docker and the Amazon ECS agent and registering it with the cluster.


You can associate and disassociate an RDS instance or cluster with a stack and it will be presented as an RDS layer.

A stack can have multiple RDS layers, but an application can only have 1 data source (RDS or OpsWorks (MySQL))


You can set a name, data source, application source, environment variables, domains and SSL.

Data Source

The Data Source can have a /Berksfile defined.

Exam Notes:

Berkshelf overcomes a Chef limitation in that before version 11.10 you could only specify one custom cookbook repository. That meant you had to duplicate community cookbooks into your own repository. The Berksfile is a dependency manager that lets you reference remote sources, dependencies and options.

Environment Variables

Protected Environment Variables - users can not see the variable value via the console, SDK or CLI; only the application can see the value. If the user has the appropriate permissions they can change the protected variable’s value.



Application Version Consistency

  • With Git, use tags to explicitly define the approved source version instead of the master branch
  • Use S3 archives. You can use versioning on the bucket and tag the object with the release version.

Deployment Methods

Rollback Command - OpsWorks keeps the 5 most recent deployments (we can keep more ourselves if we use a versioned S3 bucket)

  • Manual Deploy
    • Use the “Deploy” command for Apps and the “Update Custom Cookbooks” command for cookbooks. Note, Deploy runs automatically on new instances.
    • Pros
      • Fast
    • Cons
      • Updates all instances at the same time
      • A problem with the new App version can cause downtime
  • Rolling Deployments
    • Pros
      • Can prevent downtime
      • Does not require doubling resource capacity
    • Cons
      • Failed deployments reduce capacity and require re-deployment to affected instances
    • Implementation
      • de-register instance from load balancer (enable connection draining to de-register after connections have drained)
      • if deploy succeeds re-register with load balancer, else rollback the instance
      • repeat until all instances have been updated
  • Use separate stacks (A/B and Blue/Green deployments)
    • One stack for each environment e.g. Dev, Staging, Prod
    • Once ready for production, we can switch traffic from the current production stack to the stack that has our latest version.
      • e.g. clone Dev to new Staging stack, then switch traffic from Prod to Staging
    • Pros
      • Prevents downtime
      • Failed deployments can be rolled back by swapping environments
      • A small subset of users are affected by failed deployments since we use weighted routing
    • Cons
      • Doubles capacity while both environments are running
      • Uses DNS changes which are not always reliable
    • Implementation
      • Make sure Green environment is ready for production (security groups etc)
      • Create a new load balancer for the Green environment, and pre-warm if necessary
      • Once the ELB and instances are healthy
        • Change Weighted Routing Polices in Route53 to gradually switch load from Blue to Green
        • Monitor your application and the Green environment and gradually increase traffic to it (this helps with pre-warming the ELB too)

Managing Database Updates - We need to ensure:

  • Every transaction is recorded, and there are no race conditions between the old and new versions
  • The transition does not impact performance

Approach 1 - Both application connect to the same database:

  • Pros
    • Prevents downtime
    • Prevents having to sync data
  • Cons
    • Both Apps have access to the database so we need to manage access and prevent data corruption
    • Changing the database schema can prevent the old app from working

Approach 2 - Use a clone of the database:

  • Pros
    • We don’t run into the cons of approach 1
  • Cons
    • We have to find a way to sync data
    • We have to sync data without performance issues or downtime


  • RDS instances can only be registered with 1 stack at a time
  • A stack can have multiple RDS instances registered with it
  • An RDS database does not have to be attached to a stack in order to use it with the application in that stack


Defining Stacks in CloudFormation makes it easy to create multiple environments

Pre-baked AMIs and Docker images:

We can bake custom AMIs and Docker images to speedup deployments and updates

Deploy App

  • Deploy
  • Undeploy
    • When you delete an app this is run on all instances.
    • For example, to get the vhost working again on each instance IP I had to manually undeploy the second app and deploy the first app again.
  • Rollback
  • Start Web Server
  • Stop Web Server
  • Restart Web Server


opsworks-cookbooks deploy recipes don’t have a php-rollback.rb recipe, so doing a rollback results in this error:

ERROR: could not find recipe php-rollback for cookbook deploy

In theory though, OpsWorks keeps 5 versions, so you can do a rollback 4 times.

Run Command

  • Update Dependencies - Upgrades OS packages
  • Install Dependencies - Installs selected OS packages
  • Update Custom Cookbooks
  • Execute Recipes
  • Setup
  • Configure
  • Upgrade Operating System - Allow Reboot (defaults to yes)


DataBags are stored in /var/lib/aws/opsworks/data/nodes

The DataBags can be accessed within the recipes:

app = search(:aws_opsworks_app).first
git app_path do
  repository app["app_source"]["url"]
  revision app["app_source"]["revision"]
instance = search("aws_opsworks_instance").first"********** The instance's hostname is '#{instance['hostname']}' **********")"********** The instance's ID is '#{instance['instance_id']}' **********")
search("aws_opsworks_instance").each do |instance|"********** The instance's hostname is '#{instance['hostname']}' **********")"********** The instance's ID is '#{instance['instance_id']}' **********")

Other DataBags include aws_opsworks_[…]:

  • app
  • command
  • elastic_load_balancer
  • rds
  • stack
  • user
  • etc

Custom JSON

Custom JSON will appear in the DataBag.

Custom JSON can be set for:

  • Stack
  • Layer
  • Deploy

For the stack I set:

{"CustomStack": "custom"}

For the layer I set:

{"CustomLayer": "custom"}

The variables set are available within Ruby in the node map. i.e. node[:CustomStack] node[:CustomLayer]

And the entries exist in when you get the json using the opsworks-agent-cli:

opsworks-agent-cli get_json | jq 'with_entries(select(.key | startswith("Custom")))'
  "CustomStack": "custom",
  "CustomLayer": "custom"

FYI: jq with_entires is short for:

jq 'to_entries | map(select(.key | startswith("Custom"))) | from_entries'

i.e. convert object to array of key value pairs, filter, then convert back to an object

For example, using the arbitrary configure example above, and some Custom Chef JSON …

… the output file is:

# cat /srv/www/mediawiki/current/peers.txt
php-app: '', '', ''; memcached: ''
customStack: custom
customLayer: custom
customCmd: Custom Run Command Message


You can import users into the OpsWorks region from IAM or from OpsWorks in another region.

Users can have an SSH key set.

Then, on a per-stack basis you can set options to SSH/RDP and sudo/admin and their permissions for specific stacks …

… which can be:

  • IAM policies only - based on the IAM policies
  • Show - combine IAM with show only permissions
  • Deploy - combine IAM with deploy only permissions
  • Manage - combine IAM with full stack control permissions
  • Deny

In this way, they’re similar to resource permissions e.g. Bucket Policy


Security Group Cleanup

The OpsWorks Security Groups have various dependencies on each other, so if you try and delete one that is referred to by another you get an error:

An error occurred (DependencyViolation) when calling the DeleteSecurityGroup operation: resource sg-xyzddd has a dependent object

To get around this, I ran the following multiple times. It deletes the security groups that it can, removing the dependent ones and allowing more to be deleted on each pass.

for sg in $(aws ec2 describe-security-groups | jq '.SecurityGroups[] | select(.GroupName | startswith("AWS-OpsWorks")) | .GroupId' | tr \" " "); do aws ec2 delete-security-group --group-id $sg; done