Featured image of post AWS Managed Prometheus goes GA - Playing with alertmanager

AWS Managed Prometheus goes GA - Playing with alertmanager

On 29-09-2021, AWS announced the general availability of its managed Prometheus service named Amazon Managed Service for Prometheus or AMP. We already had a look at this service on this blog when it was still in preview.

Now that it is GA, the most visible addition to the service is the capability to define recording and alerting rules. You can read the announcement in this blog post.

In this post, we will evaluate the new alerting feature by configuring an alert rule and notifications to a Slack channel.

Architecture

figure 01 - architecture

The schema above describes the components we will use in this test.

  1. Metrics are scraped (using OpenTelemetry-Collector) and pushed to AMP workspace.

  2. An alerting rule is set up in the AMP workplace to alert on a specific condition based on metrics collected. We will see how to deploy the alerting rule using AWS CLI (unfortunately there is no public API to do this yet)

  3. When an alert is triggered, a notification is sent to a specific AWS SNS Topic. This is the only notification channel available in the alertmanager available together with an AMP workspace.

  4. A Lambda function is triggered when a notification arrives in the SNS topic and sends it to a Slack channel using a web hook.

Deploy the components

Prerequisites

  • A valid AWS account where to deploy the AWS components (AMP workspace, SNS Topic and Lambda function).
  • A source of metrics. In this case I will use an Azure Kubernetes Services cluster.
  • A Slack channel to post the notifications. I will use a personal test Slack workspace / channel to do this.

The following sections describe how to create the components as Terraform resources. For ease of use, put the Terraform code in a main.tf file.

Create the AMP workspace

Create the following components

  • Amazon Managed Service for Prometheus Workspace
  • AWS IAM users for query, remote write and rules management of the workspace and their access keys (3 users + 3 keys)
  • AWS IAM groups for query, remote write and rules management of the workspace (3 groups)
  • AWS IAM policies for query, remote write access and rules management on the workspace (3 policies)
  • Assign policies to corresponding groups (3 policy to group assignments)
  • Assign users to corresponding groups (3 user to group assignments)
# Configure the AWS provider
terraform {
  required_providers {
    aws = {
      source = "hashicorp/aws"
      version = "3.61.0"
    }
  }
}

provider "aws" {
  # Configuration options
}

# Create the AMP workspace
resource "aws_prometheus_workspace" "test-amp-alerts" {
  alias = "test-amp-alerts"
}

Add some IAM resources to try to handle security with a minimum of concerns

# Create IAM policy for read access to the AMP workspace
resource "aws_iam_policy" "query-test-amp-policy" {
  name        = "AMP-query-test-amp-alerts"
  path        = "/amp/"
  description = "IAM policy to enable query access to AMP workspace test-amp-alerts"

  # Terraform's "jsonencode" function converts a
  # Terraform expression result to valid JSON syntax.
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
            "Action": [
                "aps:GetLabels",
                "aps:GetMetricMetadata",
                "aps:GetSeries",
                "aps:QueryMetrics",
            ],
            "Effect": "Allow",
            "Resource": "${aws_prometheus_workspace.test-amp-alerts.arn}"
      }
    ]
  })
}

resource "aws_iam_policy" "remotewrite-test-amp-policy" {
  name        = "AMP-remotewrite-test-amp-alerts"
  path        = "/amp/"
  description = "IAM policy to enable remote write access to AMP workspace test-amp-alerts"

  # Terraform's "jsonencode" function converts a
  # Terraform expression result to valid JSON syntax.
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
            "Action": [
                "aps:RemoteWrite",
            ],
            "Effect": "Allow",
            "Resource": "${aws_prometheus_workspace.test-amp-alerts.arn}"
      }
    ]
  })
}

# Create IAM policy for alerting and recording rules access to the AMP workspace
resource "aws_iam_policy" "rules-test-amp-policy" {
  name        = "AMP-rules-test-amp-alerts"
  path        = "/amp/"
  description = "IAM policy to enable rules management of AMP workspace test-amp-alerts"

  # Terraform's "jsonencode" function converts a
  # Terraform expression result to valid JSON syntax.
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
            "Action": [
                "aps:CreateAlertManagerAlerts",
                "aps:CreateAlertManagerDefinition",
                "aps:CreateRuleGroupsNamespace",
                "aps:DeleteAlertManagerDefinition",
                "aps:DeleteAlertManagerSilence",
                "aps:DescribeAlertManagerDefinition",
                "aps:DescribeRuleGroupsNamespace",
                "aps:GetAlertManagerSilence",
                "aps:GetAlertManagerStatus",
                "aps:ListAlertManagerAlertGroups",
                "aps:ListAlertManagerAlerts",
                "aps:ListAlertManagerReceivers",
                "aps:ListAlertManagerSilences",
                "aps:ListAlerts",
                "aps:ListRules",
                "aps:ListRuleGroupsNamespaces",
                "aps:PutAlertManagerDefinition",
                "aps:PutAlertManagerSilences",
                "aps:PutRuleGroupsNamespace",
            ],
            "Effect": "Allow",
            "Resource": "${aws_prometheus_workspace.test-amp-alerts.arn}"
      }
    ]
  })
}

# Create IAM groups
resource "aws_iam_group" "amp-remotewrite-group" {
  name = "amp-rw"
  path = "/amp/"
}

resource "aws_iam_group" "amp-query-group" {
  name = "amp-query"
  path = "/amp/"
}

resource "aws_iam_group" "amp-rules-group" {
  name = "amp-rules"
  path = "/amp/"
}

# Assign policies to groups
resource "aws_iam_group_policy_attachment" "amp-remotewrite-attach" {
  group      = aws_iam_group.amp-remotewrite-group.name
  policy_arn = aws_iam_policy.remotewrite-test-amp-policy.arn
}

resource "aws_iam_group_policy_attachment" "amp-query-attach" {
  group      = aws_iam_group.amp-query-group.name
  policy_arn = aws_iam_policy.query-test-amp-policy.arn
}

resource "aws_iam_group_policy_attachment" "amp-rules-attach" {
  group      = aws_iam_group.amp-rules-group.name
  policy_arn = aws_iam_policy.rules-test-amp-policy.arn
}

# Create IAM users and access keys and assign them to correct groups

# Create user for remote write
resource "aws_iam_user" "user-amp-remotewrite" {
    name = "test-amp-user-remotewrite"
    path = "/amp/"
}

resource "aws_iam_access_key" "user-amp-remotewrite" {
  user = aws_iam_user.user-amp-remotewrite.name
}

resource "aws_iam_group_membership" "amp-remotewrite-assign" {
    name = "test-amp-remotewrite-membership"
    users = [
        aws_iam_user.user-amp-remotewrite.name,
    ]
    group = aws_iam_group.amp-remotewrite-group.name
}

# Create user for query
resource "aws_iam_user" "user-amp-query" {
    name = "test-amp-user-query"
    path = "/amp/"
}

resource "aws_iam_access_key" "user-amp-query" {
  user = aws_iam_user.user-amp-query.name
}

resource "aws_iam_group_membership" "amp-query-assign" {
    name = "test-amp-remotewrite-membership"
    users = [
        aws_iam_user.user-amp-query.name,
    ]
    group = aws_iam_group.amp-query-group.name
}
# Create user for recording and alerting rules management
resource "aws_iam_user" "user-amp-rules" {
    name = "test-amp-user-rules"
    path = "/amp/"
}

resource "aws_iam_access_key" "user-amp-rules" {
  user = aws_iam_user.user-amp-rules.name
}

resource "aws_iam_group_membership" "amp-rules-assign" {
    name = "test-amp-rules-membership"
    users = [
        aws_iam_user.user-amp-rules.name,
    ]
    group = aws_iam_group.amp-rules-group.name
}

Create the SNS Topic

# Create SNS topic
resource "aws_sns_topic" "test-amp-alerts" {
  name = "test-amp-alerts-topic"
}

Configure Slack WebHook

Follow the official Slack documentation on Sending messages using Incoming Webhooks.

Create the Lambda function

The following sample Python code is deployed in a Lambda function.

#!/usr/bin/python3.9
import urllib3
import json
import yaml
http = urllib3.PoolManager()
def lambda_handler(event, context):
    url = "<webhook_url>"
    msg = yaml.safe_load(event['Records'][0]['Sns']['Message'])
    encoded_msg = json.dumps(msg).encode('utf-8')
    resp = http.request('POST',url, body=encoded_msg)
    print({
        "SNS": event['Records'][0]['Sns'],
        "message": event['Records'][0]['Sns']['Message'], 
        "status_code": resp.status, 
        "response": resp.data
    })

PS: this sample comes from the blog post how to integrate amp with Slack.

As we use the pyyaml Python module, we need to install it in a virtual environment and package it for deployment to Lambda.

# Create a folder to contain your python code
mkdir test-amp-alerts
cd test-amp-alerts

# Create virtual environment
python3 -m venv amp_alerts_venv

# Activate the virtual environment
chmod +x amp_alerts_venv/bin/activate
source amp_alerts_venv/bin/activate

# Install pyyaml
pip3 install pyyaml

# Deactivate the virtual environment
deactivate

# Create a zip file with the virtual environment dpenedencies
cd amp_alerts_venv/lib/<your python version here>/site-packages
zip -r ../../../../test-amp-alerts-slack.zip .

Next add the python code of your function to the zip file to prepare it for deployment through Terraform.

cd ../../../../
zip -g test-amp-alerts-slack test-amp-alerts-slack.py

Enrich the Terraform main.tf file to add the deployment of the Lambda function and give SNS topic permission to invoke the function.


# Give permission to invoke Lambda function to SNS Topic
resource "aws_lambda_permission" "with_sns" {
  statement_id  = "AllowExecutionFromSNS"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.func.function_name
  principal     = "sns.amazonaws.com"
  source_arn    = aws_sns_topic.test-amp-alerts.arn
}

# Create Lambda IAM role
resource "aws_iam_role" "iam_for_lambda" {
  name = "iam_for_lambda"

  # Terraform's "jsonencode" function converts a
  # Terraform expression result to valid JSON syntax.
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Sid    = ""
        Principal = {
          Service = "lambda.amazonaws.com"
        }
      },
    ]
  })
}

# Create the Lambda function
resource "aws_lambda_function" "amp-alerts-to-slack" {
  filename      = "lambdatest.zip"
  function_name = "lambda_called_from_sns"
  role          = aws_iam_role.iam_for_lambda.arn
  handler       = "exports.handler"
  runtime       = "python3.9"
}

# Assign AWSLambdaBasicExecutionRole to Lambda function to allow writing CloudWatch Logs
data "aws_iam_policy" "lambda-logs" {
  name = "AWSLambdaBasicExecutionRole"
}

resource "aws_iam_role_policy_attachment" "lambda-logs" {
  role       = aws_iam_role.iam_for_lambda.name
  policy_arn = data.aws_iam_policy.lambda-logs.arn
}

# Subscribe Lambda to Topic
resource "aws_sns_topic_subscription" "lambda" {
  topic_arn = aws_sns_topic.test-amp-alerts.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.amp-alerts-to-slack.arn
}

terraform apply

Run terraform apply to deploy the components in your AWS account. It should only take a few seconds to deploy.

Metrics shipping and visualization

I won’t go deep in this step as I already talked about these in other posts. I configured an opentelemetry-collector deployment running in an AKS test cluster to send metrics to the AMP workspace using the IAM access keys specific for remote write. Then I configured a Grafana instance to connect to the AMP workspace using the IAM access key for query only.

Got 2 simple queries to check Memory used bytes per container and CPU used seconds per hosts.

figure 02 - sample queries

Create alert rule

An alert rule defines alert conditions based on PromQL and a threshold. When the threshold is triggered, a notification is sent to alert manager which forwards it next to an Amazon SNS Topic.

Remark : Amazon SNS is the only alert notification receiver available in the alert manager of an AMP.

Alert rules are described in rule files using YAML syntax. For more info on alert rules, read the Prometheus documentation on alerting rules.

Create a basic alert rule file test-alert.yaml that triggers an alert on container memory usage.

groups:
  - name: sample
    rules:
      - alert: PodHighMemoryUsage
        expr: sum(container_memory_usage_bytes{container!="POD",container!=""}) by (pod) > 600000000
        for: 1m
        labels:
          severity: error
        annotations:
          summary: "Pod memory consumption error."
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} uses more than 580 MiB memory."

This alert rule file must be uploaded to the AMP workspace. You can do this either through the AWS console or with AWS CLI.

Remark: you could also use the API CreateRuleGroupsNamespace to upload the rule file.

We will use the AWS CLI because it’s easier to include in a CI / CD pipeline.

First we need to encode the yaml file in base64.

base64 input-file > output-file

Then run the following command to upload it to your workspace (this is AWS CLI version 2 command)

# Upload the file
aws amp create-rule-groups-namespace --data file://path_to_base_64_output_file \ 
--name namespace-name  --workspace-id my-workspace-id --region region

# Check the status
aws amp describe-rule-groups-namespace --workspace-id workspace_id --name namespace-name --region region

Something important to have in mind while designing the rule groups is the service limits of an AMP workspace. Have a look at the official documentation to correctly design your alert rules.

Create alert manager configuration

The last part of the setup is to configure alert manager to send notifications to SNS topic.

alertmanager_config: |
  route:
    receiver: 'default'
  receivers:
    - name: 'default'
      sns_configs:
      - topic_arn: arn:aws:sns:eu-west-1:123456789012:My-Topic # Put your SNS Topic here
        sigv4:
          region: eu-west-1
        attributes:
          key: severity
          value: test  

Encode the file in base64 and upload it with AWS CLI

base64 input-file > output-file

aws amp create-alert-manager-definition --data file://path_to_base_64_output_file \
--workspace-id my-workspace-id --region region

After a few seconds, the alert manager configuration is active. If everything runs fine, an alert should pop in the coming 5 minutes in the Slack channel …

And indeed a few minutes later the following message came in the Slack channel :

fig 02 - Slack alert notification

Conclusion

This is once again quite a long post with a lot of code and it was also a “premiere” for me because it’s the first time I play with SNS, Lambda functions and deploy all the stuff through Terraform.

I am pretty pleased with the outcome of the test and there are some improvements still needed to make this ready for production usage like using neat alert template, setup proper logging of SNS et maybe also configure a dead letter queue for the Lambda function in case of issues.

Built with Hugo
Theme Stack designed by Jimmy