AWS and other interesting stuff

DevOps And Responsibility

· by Steve Hogg · Read in about 8 min · (1550 Words)
AWS DevOps Serverless Concepts

Responsibility - What Does It Mean?

The words responsible and fault are often used interchangeably, for example:

Who is responsible for this mess?

Whose fault is this?

While fault is used to attribute blame, the definition I like for responsible is the ability to respond to a given situation.

It is worth being accurate with our language, as it allows us to take a more powerful position. When asking the question:

Who is responsible?

I hear:

Who is able to respond?

In a team, I may not be at fault for a situation, but I am responsible.

For example:

  • If my child is misbehaving at school, my child is responsible, as am I, as are his teachers.
  • If someone falls over in front of me, I’m responsible for helping.

Once we clarify responsibility is not about fault-finding, it allows those responsible to take aligned action.

DevOps And Responsibility

How does responsibility fit in with a DevOps culture?

In the bad-old-days, developers were responsible for writing the software, not for running it. Operations were responsible for running the software, not for writing it. This was inefficient and frustrating; a problem that could have been simple to fix in development required complicated workarounds for operations.

We learned that developers “throwing software over the fence” to operations does not lead to a quality product and service.

The goal of DevOps is to break down the traditional development and operations team silos; to create a cohesive team that is responsible for a common purpose: reliable software running in production.

DevSecOps expands this approach to include the security team. The thinking being that security should never be an afterthought: it should be baked-in during development, and be a continued focus with operations.

A core tenet for DevSecOps is empathy within a cross-functional team, so why not expand the team across other business units?

Mark Schwartz talks about DevSecFinBizOps, where the Biz part indicates that all units should be incentivised and responsible for achieving business objectives. With each intersection between functions in the team, there are common goals. For example, with FinOps, operations, and finance are responsible for setting and achieving IT budgets.

People that can work effectively in this sort of environment are described as being T-shaped: they have broad knowledge across a team and deep knowledge within one area.

DevOps Tools For Success

Cross-functional teams need the right conditions for success, for example working in the same office, socialising together, a clear mission and values. They also need the right tools to facilitate working together.

With the invention of Infrastructure-as-Code, Operations has learned from successful Development practice. Infrastructure-as-Code is a declarative way of describing and updating infrastructure. Now that infrastructure is code, we can take advantage of other proven tools in the developers’ toolkit like Git version control. Infrastructure can now be updated, with change approval, versioning and a complete history, just like any other code. Infrastructure-as-Code has been followed by Network-as-Code, and Security-as-Code etc.

Now that we have Everything-as-Code, it makes sense to continue to learn from and apply other DevOps successes. With Continuous Integration (CI), changes are automatically tested for quality. Developers use Unit Tests for their software, to catch problems before software reaches production. All as-Code functionality should benefit from the same automated testing. I have included a practical example of this at the end of this blog post.

Lessons From React - Separation Of Concerns and Separation Of Duties

Separation Of Concerns is a computer science concept, where you organise software into modules that are focussed on one thing. It allows humans to more easily reason about software, as they only need to focus on one bit of functionality at a time.

We used to think that keeping HTML, JS, and CSS in different files was Separation Of Concerns. In reality, this was just separation of technology; a single bit of functionality was split across 3 file types, meaning you have to jump between files when making a single edit.

React combined HTML, JS, and CSS together for a single component, making it a single place to go to make an edit. React makes the concern the component, not the technology used.

In a similar vein, for DevOps, a microservice may be thought of a combination of development, operations, security, finance, and business objectives. While these could all be defined separately, I think it makes sense to have them defined together, in a single place to coordinate changes. Using infrastructure-as-code you can define Operational resources like monitoring, Security resources like access policies, and Finance resources like billing tags. User Stories that achieve Business objectives can be included as BDD (Behaviour Driven Development) tests.

With this single coordination point, there is still a Separation of Duties: The T-shaped people from before have a broad knowledge and empathy across the functional teams, but their duty/responsibility is to the function they have a deep knowledge for.

A Practical Example

Operations

This is an AWS CloudFormation template with a single resource.

AWSTemplateFormatVersion: '2010-09-09'
Description: DevOps Example 1

Resources:
  JobInitQueue:
    Type: "AWS::SQS::Queue"
    Properties:
      QueueName: !Sub "${Env}InitQueue"
      VisibilityTimeout: 30

For each queue, it is best practice to have a secondary Dead Letter Queue (DLQ). The DLQ is where messages are transferred to if they fail to be processed from the primary queue. The idea is that if there is something wrong with the message, then it should be taken out of the primary queue to make space for other messages. For Amazon SQS, the setting that controls this is the RedrivePolicy.

Operations decide to enforce this rule. They lint the template using cfn-python-lint and a custom rule. The rule runs in the CI pipeline, and errors if the RedrivePolicy setting does not exist:

➜ cfn-lint -a cfn-rules/operations -t 1.template.yaml
E9002 Missing RedrivePolicy Property for Resources/JobInitQueue/Properties
1.template.yaml:16:5

The default policies also pick-up a parameter is not used:

W2001 Parameter ServiceName not used.
1.template.yaml:5:3

While the pipeline will enforce this rule, they also document how to add cfn-lint support to your favorite editor and announce the change to the team. Using in-editor linting gives the team more timely feedback when rules fail.

The template is modified to add a DLQ. The custom rule introduces a team convention that queues with a DLQ suffix in their name do not themselves need to have a DLQ.

AWSTemplateFormatVersion: '2010-09-09'
Transform: 'AWS::Serverless-2016-10-31'
Description: DevOps Example 2

Parameters:
  ServiceName:
    Type: String
    Default: devops-example
    Description: The service name.
  Env:
    Type: String
    Description: The environment name. e.g. dev, prod.

Globals:
  Function:
    Runtime: nodejs8.10
    Timeout: 30
    MemorySize: 128
    Environment:
      Variables:
        SERVICE_NAME: !Ref ServiceName
        ENV: !Ref Env
        REGION: !Ref AWS::Region

Resources:
  JobInitQueue:
    Type: "AWS::SQS::Queue"
    Properties:
      QueueName: !Sub "${Env}InitQueue"
      VisibilityTimeout: 30
      RedrivePolicy:
        deadLetterTargetArn:
          Fn::GetAtt: [ JobInitDLQ, Arn ]
        maxReceiveCount: 5

  JobInitDLQ:
    Type: "AWS::SQS::Queue"
    Properties:
      QueueName: !Sub "${Env}InitDLQ"
      VisibilityTimeout: 30

Operations extend the custom rules to require a CLoudWatch Alarm on all DLQ; if there are messages in the DLQ this is an error condition that they need to investigate. Running cfn-lint against the template now generates an error:

➜ cfn-lint -a cfn-rules/operations -t 2.template.yaml
E91001 Missing CloudWatch Alarm for Resources/JobInitDLQ
2.template.yaml:36:3

The alarm is added, and the rule succeeds:

...
JobInitDLQAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmDescription: "Job init DLQ has messages"
    Namespace: "AWS/SQS"
    MetricName: ApproximateNumberOfMessagesVisible
    Dimensions:
      - Name: QueueName
        Value:
          !GetAtt JobInitDLQ.QueueName
    Statistic: Sum
    Period: 300
    Threshold: 0
    EvaluationPeriods: 1
    ComparisonOperator: GreaterThanThreshold
    AlarmActions:
      - !Ref AlarmSNSTopic
    OKActions:
      - !Ref AlarmSNSTopic
    TreatMissingData: missing

AlarmSNSTopic:
  Type: AWS::SNS::Topic
  Properties:
    Subscription:
      - Endpoint: steve@gmail.com
        Protocol: email
➜ cfn-lint -a cfn-rules/operations -t 3.template.yaml
➜ echo $?
0

Finance

cfn-lint includes example rules that require resources to have tags. Using these rules, the template above fails:

E9001 Missing Tags Properties for Resources/JobInitQueue/Properties
3.template.yaml:28:5

E9001 Missing Tags Properties for Resources/JobInitDLQ/Properties
3.template.yaml:38:5

Adding tags stops these errors showing, but the rules require tags that are specific for tracking billing, so additional errors are shown:

E9000 Missing Tag CostCenter at Resources/JobInitQueue/Properties/Tags
4.template.yaml:35:7

E9000 Missing Tag ApplicationName at Resources/JobInitQueue/Properties/Tags
4.template.yaml:35:7

E9000 Missing Tag CostCenter at Resources/JobInitDLQ/Properties/Tags
4.template.yaml:48:7

E9000 Missing Tag ApplicationName at Resources/JobInitDLQ/Properties/Tags
4.template.yaml:48:7

Adding these tags to all resources makes the lint errors go away, and ensures their charges are associated with the application they are a part of.

Security

Here is a simple example rule that requires, as a convention, that public buckets end with a Public suffix.

For example:

...
JobBucket:
  Type: AWS::S3::Bucket
  Properties:
    BucketName: my-domain.nz
    WebsiteConfiguration:
      IndexDocument: index.html
      ErrorDocument: 404.html
      RoutingRules:  # single page app configuration
        - RedirectRule:
            ReplaceKeyWith: index.html
          RoutingRuleCondition:
            KeyPrefixEquals: /
    Tags:
      - Key: Env
        Value: !Ref Env
      - Key: ApplicationName
        Value: !Ref ServiceName
      - Key: CostCenter
        Value: !Ref CostCenter

JobBucketPolicy:
  Type: AWS::S3::BucketPolicy
  Properties:
    Bucket: !Ref JobBucket
    PolicyDocument:
      Version: "2012-10-17"
      Statement:
        - Sid: PublicReadAccess
          Effect: Allow
          Principal: "*"
          Action:
            - s3:GetObject
          Resource:
            !Ref JobBucket
➜ cfn-lint -a cfn-rules/security -t 6.template.yaml
E92001 Bucket Policy for non-Public suffix bucket Resources/JobBucket open to the world
6.template.yaml:88:3

In addition to linting, the security team can add additional security checks via Infrastructure-as-Code. For example, they can define AWS Config rules that trigger CloudWatch rules when liberal permissions are set. Though, catching the errors before they happen with linting is preferable.

If required, they can set a blanket ban on public buckets.

Conclusion

Successful development patterns and tooling can apply to all changes a cross-functional team makes. Version control and CI/CD can act as a centralised place to coordinate, document, test, and notify-about changes; this shared tooling encourages openness, inclusiveness, and team responsibility.

Comments