The Best re:Invent CloudWatch Launches

The Best re:Invent CloudWatch Launches

There were a lot of CloudWatch launches this year. In the launch session, they were summarized as:

  1. More Coverage

  2. Easier Correlation

  3. Less silos, more analytics

  4. Deeper distributed tracing

  5. Aided investigations

We’ve put them a bit into our own categories, let’s go through all of them.

Infographic about the top 10 CloudWatch launches

CloudWatch Unified Navigation (easier correlation)

Let’s start with something very cool, the CloudWatch Unified Navigation.

This feature aims to bring CloudWatch into almost every service pane that is available on AWS. Basically, it is a new sidebar that you can trigger.

You will mostly see this feature with the name of the explore related button (naming is hard yes).

The new feature should help you find things that belong together. Often you will find yourself looking at certain traces and you know that something else belongs to that as well. E.g. another trace, log, or metrics. This is what this is meant for.

Finding this feature was harder than I thought. In the documentation, it states that it is available on different pages of CloudWatch. In the launch session, there was also a compass icon with the name “explore related” available. Somehow, that wasn’t the case for me.

You need to look for it in the top right corner. It is not the compass icon described in the documentation 🤷🏽‍♂️ but it is a laptop with a wrench - I already submitted feedback.

Screenshot of a dashboard interface with an "Untitled graph" showing data, a toolbar for time selection, and various actions. A sidebar titled "Operational troubleshooting" includes a warning about retaining context in an AWS service console.

The pages you can access it from:

  • CloudWatch Metrics (navigation, legend, data points)

  • Console toolbar

  • In different services (e.g. Lambda → Monitoring → … → Explore related)

A screenshot showing a CloudWatch dashboard with various metrics for a Lambda function. The left side displays graphs for duration, error count, success rate, and other metrics. The right side features an operational troubleshooting interface with a topology map of related AWS services.

Once you open up this pane, you will see additional information. This is quite neat! First of all the tracing overview page got a nice overhaul. Let’s hope this comes to the general trace map as well.

From this pane, you can see all related metrics, logs, and traces. You can also go further by clicking on the connected resources. For example, on another service or API that is used from these services. Then you can see the metrics, logs, and traces of this resource.

For everybody who knows how hard it can be to even find the correct log group name, this can be a lifesaver.

Here is a list of supported services within the explore related page. For some services that are mentioned,it somehow doesn’t work anyway. For example, for my Step Function.

Overall, a very cool feature in my opinion. Especially, to find fast-related logs, traces, and components.

Logs Insights News (less silos, more analytics)

We love logs insights. And if you use CloudWatch as your main observability solution, you will use logs insights daily. There were a couple of launches for Logs Insights itself. I’ll summarize them here.

New Languages to analyze logs - SQL and PPL

You can now use two more languages to analyze logs. Piped Processing Language (PPL) and SQL.

PPL follows a typical Pipe approach like you’re used to it in Linux:

fields `@timestamp`, `@message`, `@logStream`, `@log`, `logLevel`, `event.path` 
| where `correlationIds.httpMethod` = "POST"
| stats count() by `event.path`

And SQL, well is SQL.

SELECT A.`message` as Message, A.`event.path` as Path, A.`lambdaFunction.coldStart` as IsColdStart, B.`status` as ApiGwStatus
FROM `/aws/lambda/dev-ApiStack-RestApiConstructRestApiHandler8040241-nX7hVTvM1SMY` as A
INNER JOIN `dev-ApiStack-RestApiConstructRestApiAccessLogGroup6FB97884-Jru7iSV5QAsj` as B
ON A.`correlationIds.requestId` = B.`requestId`

In SQL you have the ability to use cool SQL functions like

  • join

  • aggregations

and all the other stuff SQL has to offer 😉

I’m not sure if I will use PPL a lot, but I definitely start using SQL to analyze my logs. In the example query above, I join my Lambda logs with my API Gateway logs based on the request ID to get some further data like the integration status 😎

I like that a lot!

10,000 Log Groups

There was a limitation of having 50 log groups in one query. This was changed if you search for log groups by a prefix or use all log groups available

A dropdown menu for selecting log groups with options for "Log group name" and "Name prefix," along with a selected option for "All log groups."

Field Indexes

You can now also index fields of logs that you are analyzing. This will improve the performance of queries and hence reduce the costs.

Screenshot of an interface for configuring index policy details. The section includes fields for policy name and log group selection, with options for "All standard log groups" or "Select log group(s) by prefix match." A text box for entering a prefix name is filled with "/aws/lambda/dev." It also shows field index details, with "correlationIds.requestId" as the field path. Options to add or remove field paths are present. Buttons for "Cancel" and "Save changes" are at the bottom.

For example, here I’ve created a new index on all my Lambda log groups (/aws/lambda/dev prefix) on the request ID in my correlation IDs.

OpenSearch ❤️ CloudWatch (less silos, more analytics)

OpenSearch now natively integrates with CloudWatch. You can create dashboards for some pre-defined use cases like:

  • VPC Flow Logs

  • CloudTrail Logs

  • WAF Logs

The idea is quite cool. You can use it everywhere where you can use OpenSearch Direct Query. This is kind of a serverless variant of OpenSearch. You only pay for the usage (but not too little).

Their pricing still seems a bit harsh and hard to calculate. Here is a pricing example from their landing page:

The total monthly charges = $732

  • $3 (Direct Query OCU)

  • $350 (Serverless Indexing)

  • $29 (Serverless Storage)

  • $350 (Serverless Search)

This is with a monthly ingest of over 1 TB!

Great feature, especially for getting an ELK stack-like experience. Let’s see if we can build dashboards ourselves soon without the need to use a pre-defined dashboard.

Transaction Search (deeper, distributed tracing)

Transaction search is another very interesting piece! Once you enable it it will transform your X-Ray traces into Open Telemetry spans. These spans help you gain visibility into your application.

For me, this simply looks like distributed tracing for now. But maybe this is the way of AWS to support more Open Telemetry instead of only supporting X-Ray. Maybe this will even replace X-Ray at some point? 🤔

View of the visual editor for spans

We’ve enabled transaction search for our GitHub repository tracker (our example CloudWatch Book application) and got a few spans:

Screenshot of a log analysis interface displaying spans with filters applied. The table lists 15 records for the duration, environment, status code, and service related to API requests. HTTP status code 200 is highlighted with visualizations on the right.

Once you open one of those you will be redirected to the actual X-Ray trace.

You can also do some basic aggregations:

A screenshot of a web interface showing span query results with a visualization. It includes search filters, a query section, and a horizontal bar graph displaying counts of HTTP response status codes 403 and 202.

But for us some services are missing, so that needs to be further investigated.

Application Signals

With this one, I needed to think first. Because Application Signals already exist as a category of services.

Services like Evidently (RIP), RUM, and Synthetics fall into the category of Application Signals. However, this launch also describes the service or feature Application Signals. Yes, naming things is hard.

This feature already existed and was launched last year at re:invent.

Application Signals wants to give you an overall view of your application and give you the whole visibility. The launch post promises three main features for developers

  1. Developers can answer any question related to performance through an interactive visual editor

  2. Developers can diagnose rarely occurring issues

  3. Logs offer advanced features for transaction spans

With Application Signals, you can also define Service Level Objectives (SLO). These can help you understand if you meet the goals you’ve set for yourself or not. These can for example be availability, latency, errors, etc.

Application Signals are there for whole services. You can enable it for:

  • ECS

  • EKS

  • Lambda

But you can also enable it (I think) for everything that the CloudWatch agent can run on. You need to enable them by installing the CloudWatch Agent or AWS Distro for OpenTelemetry.

AWS Distro for OpenTelemetry Lambda layer diagram flow converting X-Ray tracing to OpenTelemetry and X-Ray again.

We’ve activated Transaction Search for our example web application for the CloudWatch Book and an Application Signal Service was automatically created as well:

Dashboard displaying metrics for "dev-ApiStack-WebsocketApiConstructwshandlerC4E7E85-JGOU7NayoSKs" over a 3-hour period. Sections include operations, dependencies, Synthetics Canaries, and client pages. Graphs show metrics such as latency, request count, availability, fault rate, and error rate by time. No faults or unhealthy states are reported.

The canaries (we have one) are not connected yet, but we already get an overview like that.

If you want to learn more about Application Signals, make sure to check out the amazing One Observability workshop.

X-Ray to OTEL

I think one main insight into all of these launches is that AWS supports more and more OpenTelemetry now! It seems that AWS is basing its new services on OTEL data spans instead of their format. This is quite cool because it allows you to use third-party software for traces as well.

AI Investigation

Investigations is the first 👆🏽 AI feature of CloudWatch in this re:invent. The idea is to help you debug and investigate any issues you have. You can connect it with your chat applications via connecting it to SNS. And it also allows you to connect your ticketing system like Linear, Jira, or whatever you use.

You can trigger a sample investigation to get an idea how what it looks like:

Dashboard showing a sample investigation in Amazon Q. The left pane contains a feed with observations noting high latency in PutItem operations on DynamoDB. A chart shows availability and latency over time, indicating possible throttling. The right pane includes suggestions and observations related to DynamoDB deployment and traffic throttling.

There are different panes you can see:

  • Feed: The feed is the overview you are often used to in a ticketing system. You can see what you’re other developers posted to this investigation.

  • Suggestions: Suggestions are auto-generated by Q. It looks at recent deployments, configs, and much more to give you an idea of how you can improve. This looks quite nice!

Overall, the idea is amazing. It hardly depends on how well it will work. I’m amazed by it and will make use of it. Let’s see how good it will work in a production app with lots of traffic!

Auditing Tracing Configuration

CloudWatch gives you a new overview of your tracing settings. You can turn it on for your whole account or organization. Once activated it will search for resources in your account.

It then shows you an overview of activated traces of the following resource types:

  • EC2 Instances

  • VPCs

  • Lambda Functions

The idea here is to give you an overview of all the different tracing settings within your infrastructure. You don’t want to miss traces of a crucial application. Especially, since for the OTEL spans they clearly recommend to sample 100% of your traces, this will help you with that!

Screenshot of a dashboard showing resource metrics. It includes sections for AWS EC2 Instances, VPC, and Lambda Functions, all with 0% coverage in logs, metrics, and traces. The update was 0 minutes ago.

Unfortunately, for our accounts, it didn’t work yet and it couldn’t find any resources.

Synthetics

Synthetics also got two minor updates. With Synthetics you can build E2E web tests. Typically, you use a headless browser for that. That is a browser that you can control from code. There is now a new runtime, playwright for that. This is quite nice! What comes with that as well is that you can store your logs directly in CloudWatch instead of storing them as text files in S3. That’s quite cool!

Synthetics will now also finally delete Lambda resources when canaries are removed. This was quite a hassle always if you’ve removed a canary you needed to remove the CloudWatch Log Group, Lambda, and everything yourself. This should now be automated!

New Metrics (more coverage)

CloudWatch announced several new metrics to some services.

Event Source Mapping Metrics for Lambda

There are now metrics available for the actual event source mapping (ESM) in Lambda. This is quite useful. If you connect SQS with a Lambda, for example, the main magic happens within the event source mapping. Until now this was kind of a black box. Now you can see metrics like

  • PolledEventCount (events read by ESM)

  • InvokedEventCount (events invoking Lambda function)

  • FilteredOutEventCount (events filtered out)

  • FailedInvokeEventCount (events failing to invoke)

ECS Container Insights enhanced observability

ECS now has an additional mode called enhanced observability. Before it was only called ECS Container Insights and the enhanced observability bit gives you some more metrics.

You can set it up very easily: aws ecs put-account-setting --name containerInsights --value enhanced

Some more metrics are:

  • ContainerMemoryUtilization

  • ContainerCpuUtilization

  • ContainerCpuReserved

Database Insights

Screenshot of CloudWatch Database Insights for Amazon Aurora Databases. Features highlighted include unified views, SQL query metrics, dependency mapping, pre-built dashboards, and a fully managed experience. The interface shows database load details, top SQL queries, and performance metrics.

Database Insights gives you more insights into your database (🥁). Only Aurora MySQL and Aurora PostgreSQL are supported right now. It will mainly summarize logs and metrics from your DB in a dashboard.

There are two modes: Standard and Advanced.

Comparison table of database features showing support in Standard and Advanced modes. Advanced mode supports more features, such as visualizing per-query statistics and analyzing slow SQL queries, while both modes support defining access control policies and analyzing DB load contributors.

Network Flow Monitoring

Network flow monitoring allows you to get network data to CloudWatch. You need to install an agent for that. If you do that you get near real-time information about your network traffic. While this is a bit bigger than “we’ve added some new metrics”, in the end ,you’ll have new metrics 😉

Summary

This re:invent had some amazing launches. Only the CloudWatch launches were amazing!

TLDR;

  1. More Coverage: More Metrics

  2. Easier Correlation: CloudWatch Unified Navigation

  3. Less silos, more analytics: OpenSearch integration

  4. Deeper distributed tracing: X-Ray → OTEL spans

  5. Aided investigations: AI Q Developer Assistant

Improving the user experience for CloudWatch should be one of the number one topics of AWS in my opinion. CloudWatch is often the only service why developers log into the console still a lot. The unified navigation is a great first step.

Making use of OTEL spans instead of their own X-Ray format is a great idea as well in my perspective. It allows AWS to support more observability tools and gives customers the ability to export them into third-party tools and correlate with more systems.

Let’s see what the future brings!

Resources

AWS News was a great help for that!

OpenTelemetry on AWS: Observability at Scale with Open-Source