AWS and other interesting stuff

CloudWatch

In this post I’ll investigate various features of CloudWatch in preparation for the DevOps Professional exam

Concepts

  • Metrics
    • A time-ordered set of data points
    • Only exist in the region in which they are created
    • They can’t be deleted
    • They expire after 14 days
    • Update 2016
      • 1 minute datapoints available 15 days
      • 5 minute datapoints available for 63 days
      • 1 hour datapoints available for 455 days
    • Services can have multiple different metrics, and you can have metrics for applications, devices, or services outside of AWS
    • Each metric has data points that are organised by time and has unique identifiers:
    • Name
    • Namespace
    • One or more dimensions
    • Datapoints have a timestamp and an optional unit of measurement
    • There are 5 statistics - aggregations of our data over specific periods of time
    • Average
    • Minimum
    • Maximum
    • Sum
    • SampleCount
    • Periods - 1 minute, 5 minutes, 15 minutes, 1 hour, 6 hours, 1 day
    • When we create an alarm we specify the time period that we want to compare a threshold value to
  • Dimensions
    • A name / value pair that uniquely identifies a metric
    • e.g. InstanceId, ImageId, LoadBalancerName
  • Namespaces
    • A name used to isolate different application and service metrics
    • e.g. AWS/EBS, AWS/ELB, AWS/EC2
    • We can create custom namespaces for custom metrics
  • Logs
    • Log Event - an event has a timestamp and a raw message
    • Log Stream - a sequence of log events from the same source (e.g. the same application)
    • Log Group - a grouping of log streams that have the same properties, policies and access controls
    • Metric Filters - define which metrics to extract and send to CloudWatch
    • Retention Polices - logs don’t expire by default
    • Log Agent - the agent we can install on EC2 instances to automatically publish log events to CloudWatch

CloudWatch Metrics

Elastic Load Balancing - Classic Load Balancer

For example, if the RequestCount increases and SurgeQueueLength does too, but CPU usage is constant then RequestCount is a better metric to use for scale up events.

  • BackendConnectionErrors
  • HealthyHostCount, UnHealthyHostCount
  • HTTPCode_Backend_XXX - a count of the 2XX, 3XX, 4XX or 5XX responses
  • HTTPCode_ELB_4XX
  • HTTPCode_ELB_5XX - no healthy instances or the request rate is more than the instances (or load balancer) can handle
  • Latency - the time from when the ELB sends the request to when the backend responds with headers
  • RequestCount
    • HTTP listener - the number of requests received and routed (including errors)
    • TCP listener - the number of requests made to the instances
  • SurgeQueueLength - up to 1024
  • SpilloverCount - rejected due to the surge queue being full

Statistics - aggregations of metrics over a specified time

  • Example uses:
    • Average Min and Max
      • You can compare the maximum value to the average and if there are large regular spikes it indicates there may be scheduled tasks causing problems.
    • Sum
      • Useful for metrics like RequestCount, SpilloverCount, HTTPCode*
    • SampleCount
      • The number of samples measured in a time period

You can filter on load balancer dimensions:

  • AvailabilityZone
  • LoadBalancerName

Auto Scaling and EC2 Metrics

Exam Tip: the exam will try and trick you by referencing a metric that does not exist, or a metric from another service.

Auto Scaling Metrics:

  • GroupMinSize - The minimum size of the Auto Scaling group.
  • GroupMaxSize - The maximum size of the Auto Scaling group.
  • GroupDesiredCapacity - The number of instances that the Auto Scaling group attempts to maintain.
  • GroupInServiceInstances - The number of instances that are running as part of the Auto Scaling group. This metric does not include instances that are pending or terminating.
  • GroupPendingInstances - The number of instances that are pending. A pending instance is not yet in service. This metric does not include instances that are in service or terminating.
  • GroupStandbyInstances - The number of instances that are in a Standby state. Instances in this state are still running but are not actively in service.
  • GroupTerminatingInstances - The number of instances that are in the process of terminating. This metric does not include instances that are in service or pending.
  • GroupTotalInstances - The total number of instances in the Auto Scaling group. This metric identifies the number of instances that are in service, pending, and terminating.

EC2 Instances:

  • CPUUtilzation - the percentage of allocated EC2 compute units
  • DiskReadOps - the completed read operations from all instance store values available to an instance
  • DiskWriteOps - the completed write operations from all instance store values available to an instance
  • DiskReadBytes - bytes read from all instance store volumes available to an instance
  • DiskWriteBytes - bytes written from all instance store volumes available to an instance
  • NetworkIn - bytes received on all network interfaces of a single instance
  • NetworkOut - bytes sent out on all network interfaces of a single instance
  • NetworkPacketsIn - number of packets sent on all network interfaces of a single instance
  • NetworkPacketsOut - number of packets received all network interfaces of a single instance
  • StatusCheckFailed_Instance
  • StatusCheckFailed_System
  • StatusCheckFailed - sum of the above 2

Detailed monitoring gives you metrics with 1 minute periods instead of the default 5.

If you use detailed monitoring you get dimensions for metrics:

  • AutoScalingGroupName (available with basic monitoring)
  • ImageId
  • InstanceId
  • InstanceType

Using The Metrics

  • We can create CloudWatch alarms around the metrics and those alarms trigger when certain conditions are met
  • If the alarms are associated with scaling policies then those polices are carried out
  • Scaling polices are Auto Scaling properties that specify whether to scale a group up or down, and by how much.

Auto Scaling adjustment types:

  • ChangeInCapacity
  • PercentChangeInCapacity
  • ExactCapacity

Examples:

  • SQS queue with worker nodes polling for jobs. The number of jobs fluctuates, and we need to dynamically change the number of instances.
    • Set an alarm on the ApproximateNumberOfMessagesVisible metric and scaling policies to change the number of instances based on this.
  • You have an Auto Scaling group of EC2 instances. The ASG is adding too many instances. What can you do?
    • Publish custom metrics on the time elapsed between launch and then instances start responding
    • Adjust the CoolDown property to be greater than the value of that metric. Perform this adjustment periodically.
      • Remembering to adjust the PauseTime in the UpdatePolicy in CloudFormation too if you use it.

Monitoring

  • Basic Monitoring for Amazon EC2 instances: Seven pre-selected metrics at five-minute frequency and three status check metrics at one-minute frequency, for no additional charge.

  • Detailed Monitoring for Amazon EC2 instances: All metrics available to Basic Monitoring at one-minute frequency, for an additional charge. Instances with Detailed Monitoring enabled allows data aggregation by Amazon EC2 AMI ID and instance type.

If you use Auto Scaling or Elastic Load Balancing, Amazon CloudWatch will also provide Amazon EC2 instance metrics aggregated by Auto Scaling group and by Elastic Load Balancer, regardless of whether you have chosen Basic or Detailed Monitoring.

Amazon CloudWatch automatically monitors Elastic Load Balancers for metrics such as request count and latency; Amazon EBS volumes for metrics such as read/write latency; Amazon RDS DB instances for metrics such as freeable memory and available storage space; Amazon SQS queues for metrics such as number of messages sent and received; and Amazon SNS topics for metrics such as number of messages published and delivered. No additional software needs to be installed to monitor other AWS resources.

  • Auto Scaling groups: seven pre-selected metrics at one-minute frequency, optional and for no additional charge.
  • Elastic Load Balancers: thirteen pre-selected metrics at one-minute frequency, for no additional charge.
  • Amazon Route 53 health checks: One pre-selected metric at one-minute frequency, for no additional charge.

Custom metrics

Benefits for troubleshooting:

  • We don’t have to SSH into an instance to check logs
  • If an instance is terminated we still have access to the logs
  • We can create alarms and plug-in third-party tools for reporting and visualising

No Dimensions

Metrics with no dimensions are grouped together

aws cloudwatch put-metric-data --namespace my-awesome-metric --metric-name WidgetCount --value=2

With Dimensions

Metrics with the same dimensions are grouped together

i=0
while true; do
  for factory in North South East West; do
    aws cloudwatch put-metric-data --namespace my-awesome-metric \
      --dimensions Country="Fiji",Factory=$factory --metric-name WidgetCount \
      --value=$(echo $RANDOM | sed 's/\(^[0-9]\).*/\1/');
  done;
  i=$((i+1))
  echo "Loop count: $i"
  sleep 60;
done

The graph shows SUM at 5 minute intervals, so the line doesn’t always go up; if a subsequent interval has a lower SUM it’ll go down. e.g. to see the SUM for a day, change to a daily interval.

Statistics Set

You can upload statistics as a set by providing SampleCount, Sum, Minimum and Maximum values.

$ aws cloudwatch put-metric-data --namespace my-awesome-metric --dimensions Country=France,Factory=Central --metric-name WidgetCount --statistic-value SampleCount=10,Sum=50,Minimum=2,Maximum=9

$ aws cloudwatch put-metric-data --namespace my-awesome-metric --dimensions Country=France,Factory=Central --metric-name WidgetCount --statistic-value SampleCount=100,Sum=500,Minimum=1,Maximum=10

...

CloudWatch Alarms

Create an alarm:

$ aws cloudwatch put-metric-alarm --namespace my-awesome-metric \
  --dimensions '[{"Name":"Country","Value":"Canada"},{"Name":"Factory","Value":"North"}]' \
  --metric-name WidgetCount --alarm-name WidgetAlarm \
  --period 60 --evaluation-periods 1 --threshold 7 \
  --comparison-operator GreaterThanOrEqualToThreshold --statistic Maximum
  • statistic: the aggregation calculation to do on the metric
  • threshold: the value to compare the statistic with
  • comparison-operator: how to compare the threshold with the statistic
  • period: time in seconds to check
  • evaluation-periods: number of periods the threshold needs to be breached for

The alarm will have an INSUFFICIENT_DATA state as no metrics are being set.

To get it to OK state I created 5 minutes of 0 history …

$ for i in $(seq 5 -1 1); do
  TIMESTAMP=$(docker run -it --rm ubuntu date -d $i' minute ago');
  aws cloudwatch put-metric-data --namespace my-awesome-metric \
    --dimensions Country="Canada",Factory="North" \
    --metric-name WidgetCount --value 0 \
    --timestamp "$TIMESTAMP"
  echo $TIMESTAMP;
done

… then ongoing 0 values:

$ while true; do
  aws cloudwatch put-metric-data --namespace my-awesome-metric \
    --dimensions Country="Canada",Factory="North" \
    --metric-name WidgetCount --value 0
  sleep 60;
done
$ aws cloudwatch put-metric-data --namespace my-awesome-metric \
  --dimensions Country="Canada",Factory="North" \
  --metric-name WidgetCount --value 5

Note: a line graph incorrectly suggests non-zero values at some times. i.e. they’re discrete points-in-time (1 min intervals) rather than continuous values.

Now trigger the alarm:

$ aws cloudwatch put-metric-data --namespace my-awesome-metric \
  --dimensions Country="Canada",Factory="North" \
  --metric-name WidgetCount --value 8

Note: The state goes back to OK for the next period as the value is zero. You’d configure another alarm to catch the recovery and trigger another action.

Auto Scaling Example

Trigger auto scaling group scaling on WidgetCount change.

Launch Configuration:

$ aws autoscaling create-launch-configuration \
  --launch-configuration-name widget-launch-configuration \
  --image-id ami-db704cb8 --key-name SHTestKey --instance-type t2.micro

Auto Scaling Group:

$ aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name widget-auto-scaling-group \
  --launch-configuration-name widget-launch-configuration \
  --min-size 1 --max-size 3 \
  --availability-zones ap-southeast-2a ap-southeast-2b ap-southeast-2c

There aren’t any actions for the alarm yet:

$ aws cloudwatch describe-alarms | jq '.MetricAlarms[] | select(.AlarmName=="WidgetAlarm") | with_entries(select(.key | endswith("Actions")))'
{
  "AlarmActions": [],
  "InsufficientDataActions": [],
  "OKActions": []
}
  • AlarmActions = actions for transition to ALARM state
  • InsufficientDataActions = actions for transition to INSUFFICIENT_DATA state
  • OKActions = actions for transition to OK state

Let’s add one:

Get the ASG ARN:

$ aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names widget-auto-scaling-group \
  | jq '.AutoScalingGroups[].AutoScalingGroupARN'
"arn:aws:autoscaling:ap-southeast-2:<REDACTED>:autoScalingGroup:f2085569-ddd2-444a-a2c8-9905841f32bd:autoScalingGroupName/widget-auto-scaling-group"

Create a simple scaling policy:

$ aws autoscaling put-scaling-policy --auto-scaling-group-name widget-auto-scaling-group --policy-name widget-scale-out-policy --adjustment-type ChangeInCapacity --scaling-adjustment 1 --cooldown 30
{
    "PolicyARN": "arn:aws:autoscaling:ap-southeast-2:<REDACTED>:scalingPolicy:17c0fa2b-65d8-4934-83b3-04cc6d56e16b:autoScalingGroupName/widget-auto-scaling-group:policyName/widget-scale-out-policy"
}
$ aws cloudwatch put-metric-alarm --namespace my-awesome-metric \
  --dimensions '[{"Name":"Country","Value":"Canada"},{"Name":"Factory","Value":"North"}]' \
  --metric-name WidgetCount --alarm-name WidgetAlarm \
  --period 60 --evaluation-periods 1 --threshold 7 \
  --comparison-operator GreaterThanOrEqualToThreshold --statistic Maximum \
  --alarm-actions "arn:aws:autoscaling:ap-southeast-2:<REDACTED>:scalingPolicy:17c0fa2b-65d8-4934-83b3-04cc6d56e16b:autoScalingGroupName/widget-auto-scaling-group:policyName/widget-scale-out-policy"

Note: the only change to the original alarm setting is the –alarm-actions option.

Create a scale in policy and new WidgetOK alarm:

$ aws autoscaling put-scaling-policy --auto-scaling-group-name widget-auto-scaling-group --policy-name widget-scale-in-policy --adjustment-type ChangeInCapacity --scaling-adjustment -1 --cooldown 30
$ aws cloudwatch put-metric-alarm --namespace my-awesome-metric \
  --dimensions '[{"Name":"Country","Value":"Canada"},{"Name":"Factory","Value":"North"}]' \
  --metric-name WidgetCount --alarm-name WidgetOK \
  --period 60 --evaluation-periods 3 --threshold 7 \
  --comparison-operator LessThanThreshold --statistic Maximum \
  --alarm-actions "arn:aws:autoscaling:ap-southeast-2:<REDACTED>:scalingPolicy:6154ea5b-c01a-47e7-a006-6dd9c694070c:autoScalingGroupName/widget-auto-scaling-group:policyName/widget-scale-in-policy"

The alarm is in state ALARM:

State changed to ALARM at 2016/12/14. Reason: Threshold Crossed: 3 datapoints were less than the threshold (7.0). The most recent datapoints: [0.0, 0.0].

$ aws cloudwatch put-metric-data --namespace my-awesome-metric   --dimensions Country="Canada",Factory="North"   --metric-name WidgetCount --value 9
Reason for change in state State
WidgetOK is ALARM as the condition is true: WidgetCount < 7 for 3 minutes
WidgetOK is OK as the condition is no longer true
WidgetAlarm is ALARM as the condition is true: WidgetCount >= 7 for 1 minute
WidgetAlarm is OK as condition is no longer true
WidgetOK is ALARM as the condition is true again: WidgetCount < 7 for 3 minutes

Notes:

  • You want scale in alarms to be in ALARM state i.e. invert the logic
    • I’d be inclined to rename these so the normal state would be:
      • WidgetLowUsage: ALARM
      • WidgetHighUsage: OK
  • This uses simple scaling. I’ll investigate stepped scaling when I review Auto Scaling in the future
    • e.g. –cooldown vs –estimated-instance-warmup + –metric-aggregation-type
  • Note: Adding a auto scaling action from the console wouldn’t work as the group dropdown didn’t have any values. The solution was to add the scaling actions via the EC2’s console rather than CloudWatch’s.

CloudWatch Logs

Introduction

CloudWatch Logs allows you to:

  • Send your existing logs to CloudWatch
  • Create patterns to look for in your Logs
  • Alert based on the finding of those patterns

There is a free agent for Ubuntu, Amazon Linux and Windows

You can:

  • Monitor logs from EC2 instances in realtime
  • Monitor AWS CloudTrail logged events
  • Archive logs to S3

Terminology

  • Log Events: a record sent to CloudWatch logs to be stored. Timestamp and message
  • Log Streams: this is a sequence of log events that share the same source. Streams are automatically deleted when the last piece of data in the stream is 2 months old.
  • Log Groups: groups of log streams that share the same retention, monitoring and access control settings. Log Streams have to belong to a group.
  • Metric Filters - define how a service extracts metric observations from events and turn them into data points for a CloudWatch metric. Metric Filters are applied to Log Groups and their Log Streams.
    • They only work on data after they’ve been created
    • They will only return the first 50 results
    • They’re made of:
      • Filter pattern
      • Metric Name, Metric Namespace, Metric Value
        • For example, to count 404s we could use a value of “1” for each 404 found
        • For example, we could extract the error message the application sent to the log
    • Once we have the metrics in CloudWatch we can:
    • Retrieve statistics
    • Stream log data into Amazon Elasticsearch in near real-time with CloudWatch Log Subscriptions
    • Stream log data into Amazon Kinesis for processing
    • Send the log data to AWS Lambda for custom processing
  • Retention Settings - applied to a Log Group and in-turn their Log Streams.

e.g.

  • Count the number of 404 errors our webserver returns
  • Report how many jobs failed on an instance

Install

User Data:

#!/bin/bash
curl https://s3.amazonaws.com/aws-cloudwatch/downloads/latest/awslogs-agent-setup.py -O
chmod +x ./awslogs-agent-setup.py
./awslogs-agent-setup.py -n -r ap-southeast-2 -c s3://myawsbucket/my-config-file

my-config-file

[general]
state_file = /var/awslogs/state/agent-state

[/var/log/messages]
file = /var/log/messages
log_group_name = /var/log/messages
log_stream_name = {instance_id}
datetime_format = %b %d %H:%M:%S

OpsWorks:

Use a Chef recipe to install, configure and run the log agent

CloudFormation:

CloudFormation can install the agent, and create Log Groups and Alarms

Examples

EC2 Logs

Setup

Build EC2 CloudWatchLogs Role And An Instance Profile

Create the a new role an EC2 instance can assume:

$ cat > ec2-cloudwatch-logs-role-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
$ aws iam create-role --role-name EC2-CloudWatch-Logs-Role --assume-role-policy-document file://ec2-cloudwatch-logs-role-trust-policy.json

Get the ARN for the managed policies …

$ aws iam list-policies \
  | jq '.Policies[] | select(.PolicyName | startswith("CloudWatchLogs")) | .Arn'
"arn:aws:iam::aws:policy/CloudWatchLogsReadOnlyAccess"
"arn:aws:iam::aws:policy/CloudWatchLogsFullAccess"

… and attach it to the role …

$ aws iam attach-role-policy --role-name EC2-CloudWatch-Logs-Role --policy-arn "arn:aws:iam::aws:policy/CloudWatchLogsFullAccess"

… create an instance profile …

$ aws iam create-instance-profile --instance-profile-name EC2-CloudWatch-Logs-Instance-Profile

… and add the role to it:

$ aws iam add-role-to-instance-profile --instance-profile-name EC2-CloudWatch-Logs-Instance-Profile --role-name EC2-CloudWatch-Logs-Role
EC2 Instance Setup

For the EC2 instance, I already have a security group I want to use:

$ aws ec2 describe-security-groups --group-names SSHFromMyIp | jq '.SecurityGroups[].GroupId'
"sg-3d76975a"

Setup the CloudWatch Logs agent on the EC2 instance using user data:

cat > ec2-cloudwatch-agent-user-data.txt <<EOF
#!/bin/bash

yum update -y
yum install -y awslogs

echo "[plugins]
cwlogs = cwlogs
[default]
region = ap-southeast-2" > /etc/awslogs/awscli.conf

echo "[/var/log/secure]
datetime_format = %b %d %H:%M:%S
file = /var/log/secure
buffer_duration = 5000
log_stream_name = {instance_id}
initial_position = start_of_file
log_group_name = /var/log/secure " >> /etc/awslogs/awslogs.conf

chkconfig awslogs on
service awslogs start

EOF

Launch an instance:

$ aws ec2 run-instances --image-id ami-db704cb8 --key-name SHTestKey \
  --instance-type t2.micro --security-group-ids sg-3d76975a \
  --iam-instance-profile Name=EC2-CloudWatch-Logs-Instance-Profile \
  --user-data file://ec2-cloudwatch-agent-user-data.txt

Metric Filter - Invalid Login

I setup another instance with the same configuration.

Metric Filters apply to a Log Group …

$ aws logs put-metric-filter --log-group-name /var/log/secure \
  --filter-name invalid-login --filter-pattern "ssh Invalid user" \
  --metric-transformations metricName=InvalidLogin,metricNamespace=LogMetrics,metricValue=1

… so that means even though there are 2 Log Streams (for the 2 instances), the metric is for both combined as they belong to the same /var/log/secure group.

I created a random word docker image to generate usernames.

Dockerfile:

FROM ubuntu
RUN apt-get update && apt-get install -y wamerican
ENTRYPOINT shuf -n 1 /usr/share/dict/american-english

Then did 100 invalid logins, selecting an instance randomly:

i=0
SERVERS[0]="13.55.21.61"
SERVERS[1]="13.55.16.151"

while [[ $i -lt 100 ]]; do
  RAND=$[ $RANDOM % 2 ]
  SERVER=${SERVERS[$RAND]}

  USER=$(docker run stevehogg/random-word | sed "s/'//");
  echo "User: "$USER;
  ssh -i ~/Downloads/SHTestKey.pem $USER@$SERVER

  i=$((i+1));

  SLEEP=$(echo $RANDOM | sed 's/\(^[0-9]\).*/\1/');
  echo "Sleeping for: "$SLEEP
  sleep $SLEEP;
done

Then the InvalidLogin metric, 1 minute period, SUM statistic graph looks like this:

Important: The Minimum, Maximum and Average aggregate values would all be 1 as each metric value is always 1. i.e. The aggregations apply to each event in an interval.

Metric Filters That Extract Fields - Web Log

Using the example to extract bytes transferred from a web server log file.

Create a Log Group:

$ aws logs create-log-group --log-group-name MyApp/access.log

Create a Log Stream:

$ aws logs create-log-stream --log-group-name MyApp/access.log --log-stream-name A-New-Stream

Create a filter on the Log Group …

$ aws logs put-metric-filter \
  --log-group-name MyApp/access.log \
  --filter-name BytesTransferred \
  --filter-pattern '[ip, id, user, timestamp, request, status_code=4*, size]' \
  --metric-transformations \
  metricName=BytesTransferred,metricNamespace=MyNamespace,metricValue=\$size

… note how the filter-pattern breaks the line into fields, only looking for 400 errors.

Create a log line script with random status, timestamp and size …

cat > create-log-line.sh <<EOF
STATUSES=(200 303 404)
STATUS_INDEX=\$[RANDOM % \${#STATUSES[@]}]
STATUS=\${STATUSES[\$STATUS_INDEX]}
TS=\$(date +%s)
DATE=\$(date +%d/%m/%Y:%H:%M:%S)
LINE='127.0.0.1 - - ['\$DATE' +1200] \\"GET /index.html HTTP/1.1\\" '\$STATUS' '\${RANDOM:0:3}
echo "{\"timestamp\": \${TS}000, \"message\": \"\$LINE\"}"
EOF

… and use it to PUT a fake log entry:

NEXT_TOKEN=SET-THE-TOKEN-HERE

while true; do
  echo \[ > fake-apache.log && \
    ./create-log-line.sh >> fake-apache.log && \ echo \] >> fake-apache.log
  NEXT_TOKEN=$(aws logs put-log-events --log-group-name MyApp/access.log --log-stream-name A-New-Stream --log-events file://fake-apache.log --sequence-token $NEXT_TOKEN | jq '.nextSequenceToken' | sed 's/"//g');
  echo $NEXT_TOKEN
  sleep ${RANDOM:0:1}
done

That gives a metric with spiky data samples:

… and 1 minute SUMs.

As expected, changing to 15 minute SUMs smooths the graph:

Metric Filters That Extract Fields - JSON

The documentation uses this example:

{
  "eventType": "UpdateTrail",
  "sourceIPAddress": "111.111.111.111",
  "arrayKey": [
        "value",
        "another value"
  ],
  "objectList": [
       {
         "name": "a",
         "id": 1
       },
       {
         "name": "b",
         "id": 2
       }
  ],
  "SomeObject": null,
  "ThisFlag": true
}

You can then filter like so…

{ $.eventType = "UpdateTrail" }

… or …

{ $.sourceIPAddress != 123.123.* }

… or:

{ $.arrayKey[0] = "value" }

etc…

Real-Time Processing Of Log Data With Subscriptions

You can do Real-Time Processing of log data by creating subscriptions. The subscriptions can be to Kinesis Streams, Lambda and ElasticSearch. Subscriptions can also be cross-account.

Note: You can only have one subscription filter per log group

Create a stream:

$ aws kinesis create-stream --stream-name WebLogs --shard-count 1

Create a new role for CloudWatch to use to get data into Kinesis. First the trust policy …

$ cat > CloudWatchLogsTrustPolicy.json <<EOF
{
  "Statement": {
    "Effect": "Allow",
    "Principal": { "Service": "logs.ap-southeast-2.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }
}
EOF

… then the role:

$ aws iam create-role --role-name CloudWatch-Logs-To-Kinesis-Role --assume-role-policy-document file://CloudWatchLogsTrustPolicy.json

Get the ARNs for the Kinesis Stream and the role:

$ KINESIS_ARN=$(aws kinesis describe-stream --stream-name WebLogs | jq '.StreamDescription.StreamARN' | sed 's/"//g')
$ ROLE_ARN=$(aws iam list-roles | jq '.Roles[] | select(.RoleName=="CloudWatch-Logs-To-Kinesis-Role") | .Arn' | sed 's/"//g')

Define the permissions for the role:

$ cat > Permissions.json <<EOF
{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "kinesis:PutRecord",
      "Resource": "$KINESIS_ARN"
    },
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "$ROLE_ARN"
    }
  ]
}
EOF

Note: The iam:PassRole permission allows the CloudWatch Logs to pass that role on to another service. I’m not sure why in this context, but it is a requirement. If the PassRole permission is set for a IAM user for example, that user is able to “pass” the defined role or wildcard role to EC2 for use in an instance profile. i.e. it is a security mechanism that restricts the roles a user can assign, so prevent privilege escalation.

Attach the permissions to the role:

$ aws iam put-role-policy --role-name CloudWatch-Logs-To-Kinesis-Role --policy-name CloudWatch-Logs-To-Kinesis-Permissions-Policy --policy-document file://Permissions.json

Create a subscription filter:

$ aws logs put-subscription-filter \
  --log-group-name /var/log/secure \
  --filter-name invalid-login-subscription \
  --filter-pattern "ssh Invalid user" \
  --destination-arn "$KINESIS_ARN" \
  --role-arn "$ROLE_ARN"

The Kinesis Stream name “WebLogs” is shown in the console:

Generate some login failures:

i=0
SERVERS[0]="13.54.169.68"
SERVERS[1]="13.55.33.5"

while [[ $i -lt 30 ]]; do
  RAND=$[ $RANDOM % 2 ]
  SERVER=${SERVERS[$RAND]}

  USER=$(docker run stevehogg/random-word | sed "s/'//");
  echo "User: "$USER;
  ssh -i ~/Downloads/SHTestKey.pem $USER@$SERVER

  i=$((i+1));

  SLEEP=$(echo $RANDOM | sed 's/\(^[0-9]\).*/\1/');
  echo "Sleeping for: "$SLEEP
  sleep $SLEEP;
done

Get the first iterator for the shard …

$ SHARD_ITERATOR=$(aws kinesis get-shard-iterator --stream-name WebLogs --shard-id shardId-000000000000 --shard-iterator-type TRIM_HORIZON | jq '.ShardIterator' | sed 's/"//g')

… then loop through the records:

$ while true; do
  LINE=( $(aws kinesis get-records --limit 1 --shard-iterator $SHARD_ITERATOR | jq '.Records[].Data+" "+.NextShardIterator' | sed 's/"//g') )
  if [[ "${LINE[0]}" == "" ]] || [[ "${LINE[1]}" == "" ]]; then
    break
  fi
  echo ${LINE[0]} | base64 -D | zcat | jq '.logEvents'  # Data
  SHARD_ITERATOR=${LINE[1]}                 # NextShardIterator
done
[
  {
    "id": "",
    "timestamp": 1481751148411,
    "message": "CWL CONTROL MESSAGE: Checking health of destination Kinesis stream."
  }
]
[
  {
    "id": "33044178415667403998847152428148375368631703406701903873",
    "timestamp": 1481752207000,
    "message": "Dec 14 21:50:07 ip-172-31-1-174 sshd[8487]: Invalid user Morison from 202.180.123.151"
  }
]
[
  {
    "id": "33044178504870384792969645000411188046733309766248300545",
    "timestamp": 1481752211000,
    "message": "Dec 14 21:50:11 ip-172-31-1-174 sshd[8489]: Invalid user audibilitys from 202.180.123.151"
  }
]
... etc

Creating more login failures and running the iterator loop again (without the initial SHARD_ITERATOR setting) shows the new logs:

[
  {
    "id": "33044216326934241500906495094622893769935692875572051969",
    "timestamp": 1481753907000,
    "message": "Dec 14 22:18:27 ip-172-31-1-174 sshd[9097]: Invalid user cadaverous from 202.180.123.151"
  }
]
awslabs/cloudwatch-logs-subscription-consumer

After removing the subscription filter and the Kinesis Stream above, I setup awslabs/cloudwatch-logs-subscription-consumer. It creates an ElasticSearch cluster and Kibana dashboard:

Exporting Logs To S3

You can create an S3 bucket to export logs to. The bucket needs a policy like this:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Principal": {
				"Service": "logs.ap-southeast-2.amazonaws.com"
			},
			"Action": "s3:GetBucketAcl",
			"Resource": "arn:aws:s3:::h4-tmp"
		},
		{
			"Effect": "Allow",
			"Principal": {
				"Service": "logs.ap-southeast-2.amazonaws.com"
			},
			"Action": "s3:PutObject",
			"Resource": "arn:aws:s3:::h4-tmp/*",
			"Condition": {
				"StringEquals": {
					"s3:x-amz-acl": "bucket-owner-full-control"
				}
			}
		}
	]
}

Then you can export for a desired time range:

$ aws logs create-export-task --log-group-name /var/log/secure --from $(date -v-24H +"%s000") --to $(date +"%s000") --destination h4-tmp --destination-prefix secure

Note: the $(date) needs padding with 000 on OSX

The awslogs Python script is even better! https://github.com/jorgebastida/awslogs

Exporting to S3 is also a good way to share logs with people that don’t have access to your Cloud Watch.

$ awslogs get /var/log/secure --start '1d ago'

CloudTrail Logs

A CloudTrail Trail can be configured to send logs to CloudWatch Logs:

It requires a role to be set that allows logs:CreateLogStream and logs:PutLogEvents for the log stream e.g.

arn:aws:logs:ap-southeast-2:<REDACTED>:log-group:CloudTrail/DefaultLogGroup:log-stream:<REDACTED>_CloudTrail_ap-southeast-2*

When you configure this, you are prompted to use a CloudFormation template to confirm alarms for security events:

AWS CloudTrail API Activity Alarm Template for CloudWatch Logs

For example, changing a Network ACL results in a metric being recorded …

… an alarm being triggered:

You can see the logs that triggered the alarm:

$ awslogs get CloudTrail/DefaultLogGroup --start '1d ago' | \
    grep -i acl-3e19d95a | cut -d \{ -f 2- | awk '{print "{"$0}' | \
    jq '. | {eventName, requestParameters}'
{
  "eventName": "DeleteNetworkAclEntry",
  "requestParameters": {
    "networkAclId": "acl-3e19d95a",
    "ruleNumber": 900,
    "egress": false
  }
}
{
  "eventName": "CreateNetworkAclEntry",
  "requestParameters": {
    "networkAclId": "acl-3e19d95a",
    "ruleNumber": 100,
    "egress": false,
    "ruleAction": "allow",
    "icmpTypeCode": {},
    "portRange": {
      "from": -1,
      "to": -1
    },
    "aclProtocol": "-1",
    "cidrBlock": "0.0.0.0/0"
  }
}
{
  "eventName": "CreateNetworkAclEntry",
  "requestParameters": {
    "networkAclId": "acl-3e19d95a",
    "ruleNumber": 100,
    "egress": true,
    "ruleAction": "allow",
    "icmpTypeCode": {},
    "portRange": {
      "from": -1,
      "to": -1
    },
    "aclProtocol": "-1",
    "cidrBlock": "0.0.0.0/0"
  }
}
{
  "eventName": "DeleteNetworkAclEntry",
  "requestParameters": {
    "networkAclId": "acl-3e19d95a",
    "ruleNumber": 900,
    "egress": true
  }
}

CloudWatch Events

Introduction

Similar to CloudTrail, but FASTER. AWS refer to it as the Central Nervous System of AWS. A near real-time stream of events that can be routed to Lambda, Kinesis, SNS streams and other built in targets (Snapshot EBS Volume, Stop, Start, Terminate an EC2 instance)

Terminology

  • Events - created in 3 ways
    • State change e.g. EC2 goes from pending to running
    • API call (delivered via CloudTrail)
    • Your own code - your application generates an event which you publish
  • Rules
    • Matching the incoming events and route them to one or more targets for processing
    • They’re not ordered
    • Rules can customise JSON and elect to only pass certain keys, or a literal string
  • Targets
    • Lambda function
    • Kinesis streams
    • SNS topics
    • SQS queue
    • Built-in (Snapshot EBS Volume, Stop, Start, Terminate an EC2 instance)

Event rule options:

On the next page you specify a unique name.

Example

Then, when I create a cloudwatch-events-s3-bucket I get an email:

cat email.json | jq '{"detail": .["detail-type"], "eventName": .detail.eventName, "bucketName": .detail.requestParameters.bucketName}'
{
  "detail": "AWS API Call via CloudTrail",
  "eventName": "CreateBucket",
  "bucketName": "cloudwatch-events-s3-bucket"
}