Monitoring Kafka

Monitoring Kafka with Kafka exporter + Prometheus + Grafana

Daniel Rosa
6 min readMay 30, 2018

Hi All,
For you that read my article about Install and Configure Kafka or is in charge to keep or build a Kafka cluster, probably asked himself : How can I monitor my Kafka cluster using opensource tools ?
Everyone knows that monitor a Kafka cluster using opensource tools is not so easy and monitor only the basics components like disk space, cpu usage and memory consumption is not enough.

Well, I have a pleasure to share with you one solution to monitor Kafka brokers using Kafka Exporter, JMX exporter, Prometheus and Grafana.

If you are used to work with containers, the steps to build this monitoring platform will be easy, if not, I will do my best to better describe the components relationship so you can decide to go ahead using containers or do a standalone installation using the sources.

Scenario

Basically, if you are using a container cluster to run your apps, you can add 3 new containers (kafka-exporter , prometheus and grafana) to monitor your kafka cluster. In this example, I´m using a cluster of docker in swarm mode to demonstrate.

Components detailed

Kafka-Exporter
Kafka exporter for Prometheus. For more information about exporters, click here

Prometheus
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Since its inception in 2012, many companies and organizations have adopted Prometheus, and the project has a very active developer and user community.

Grafana
Grafana allows you to query, visualize, alert on and understand your metrics no matter where they are stored. Create, explore, and share dashboards with your team and foster a data driven culture.

JMX Exporter
JMX to Prometheus exporter: a collector that can configurably scrape and expose mBeans of a JMX target.
This exporter is intended to be run as a Java Agent, exposing a HTTP server and serving metrics of the local JVM. It can be also run as an independent HTTP server and scrape remote JMX targets, but this has various disadvantages, such as being harder to configure and being unable to expose process metrics (e.g., memory and CPU usage). Running the exporter as a Java Agent is thus strongly encouraged.

A Pause for reflection

The most important part here is to understand the components relationship. In few words, this is what will happens :
1. Kafka Exporter and JMX Exporter will collect some broker metrics from Kafka cluster
2. Prometheus will collect these metrics and store in it´s time series database.

3. Grafana will connect on Prometheus to show some beautiful dashboards.
Cool, huh?

Let´s get started!

If you intend to using containers, take a look on this docker compose.
Note: I will skip the details related to docker swarm overview ,the overlay network (monitoring) and other docker concepts like secrets that I´m using ok ?

I will suppose that you already have your own docker cluster up and running :-)
As I said before, you can decide go ahead and install these components out of docker cluster, but I really encourage you to use docker due the facilities.

version: '3.4'services:kafka_exporter:
image: danielqsj/kafka-exporter
networks:
- monitoring
command: --kafka.server=kafka01.foo.bar:9092 --kafka.server=kafka02.foo.bar:9092 --kafka.server=kafka03.foo.bar:9092
deploy:
mode: replicated
resources:
limits:
memory: 128M
reservations:
memory: 64M
replicas: 1
endpoint_mode: vip
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
networks:
- monitoring
volumes:
- /prometheus:/prometheus
secrets:
- prometheus.yml
command: --config.file=/run/secrets/prometheus.yml --storage.tsdb.path=/prometheus --storage.tsdb.retention=168h
deploy:
resources:
limits:
cpus: '2'
memory: 4096M
reservations:
memory: 1024M
replicas: 1
endpoint_mode: vip
placement:
constraints:
- "node.labels.prometheus == true"
grafana:
image: grafana/grafana
ports:
- "3000:3000"
networks:
- monitoring
volumes:
- /var/lib/grafana:/var/lib/grafana
deploy:
mode: replicated
resources:
limits:
memory: 512M
reservations:
memory: 128M
replicas: 1
endpoint_mode: vip
placement:
constraints:
- "node.labels.grafana == true"
secrets:
prometheus.yml:
file: config/prometheus.yml
alertmanager.yml:
file: config/alertmanager.yml
networks:
monitoring
external: true

Prometheus.yml

global:
scrape_interval: 30s
evaluation_interval: 30s
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'prometheus-swarm'
rule_files:
- "alert.rules_nodes"
- "alert.rules_tasks"
- "alert.rules_service-groups"
scrape_configs:
- job_name: 'prometheus'
dns_sd_configs:
- names:
- 'tasks.prometheus'
type: 'A'
port: 9090
- job_name: 'kafka_exporter'
dns_sd_configs:
- names:
- 'tasks.kafka_exporter'
type: 'A'
port: 9308

- job_name: 'kafka'
static_configs:
- targets:
- kafka01.foo.bar:7072
- kafka02.foo.bar:7072
- kafka03.foo.bar:7072

Prometheus.yml explanation:

This is the Prometheus configuration file and as you can see we have 3 jobs ( prometheus itself, kafka_exporter and kafka)

kafka_exporter . This job is one exporter container explained early that will collect kafka broker metrics. Prometheus will connect on this container through the port TCP 9308 to get metrics collected and store on it´s time series database.

kafka. This job is the JMX exporter that runs on the kafka brokers as a javaagent. It runs on port TCP 7072 and Prometheus will connect on this port to get metrics and store on it´s time series database.

JMX Exporter. To install this exporter I followed the instructions here

After download jmx_prometheus_javaagent-0.9.jar and kafka-0–8–2.yml , in each kafka broker, insert the javaagent on the kafka-server-start.sh file and insert the lines :
export PROMETHEUS_PORT=${PROMETHEUS_PORT:-7072}
javaagent:/opt/kafka/libs/jmx_prometheus_javaagent-0.9.jar=$PROMETHEUS_PORT:/opt/kafka/libs/kafka-0–8–2.yml

And restart each kafka services.

Example:

#!/bin/bash
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [ $# -lt 1 ];
then
echo "USAGE: $0 [-daemon] server.properties [--override property=value]*"
exit 1
fi
base_dir=$(dirname $0)
if [ "x$KAFKA_LOG4J_OPTS" = "x" ]; then
export KAFKA_LOG4J_OPTS="-Dlog4j.configuration=file:$base_dir/../config/log4j.properties"
fi
if [ "x$KAFKA_HEAP_OPTS" = "x" ]; then
export KAFKA_HEAP_OPTS="-Xmx4G -Xms4G"
fi
EXTRA_ARGS=${EXTRA_ARGS-'-name kafkaServer -loggc'}
export JMX_PORT=${JMX_PORT:-9999}
export PROMETHEUS_PORT=${PROMETHEUS_PORT:-7072}
COMMAND=$1
case $COMMAND in
-daemon)
EXTRA_ARGS="-daemon "$EXTRA_ARGS
shift
;;
*)
;;
esac
exec $base_dir/kafka-run-class.sh $EXTRA_ARGS -javaagent:/opt/kafka/libs/jmx_prometheus_javaagent-0.9.jar=$PROMETHEUS_PORT:/opt/kafka/libs/kafka-0-8-2.yml kafka.Kafka "$@"

Now you need to connect on Grafana on port 3000, configure the prometheus as a Data source and import the dashboard.

Pause for reflection again

There are many broker metrics important to monitor and some of then is the LAG of consumers groups and under replicated partitions. The first is a metric that you need to pay attention because if the LAG are increasing, means that you should take a look if your consumers are running properly, or may be you need to scale then. The another means that your cluster are unstable due a network issues, or some broker is down, etc. Messages in Per Topics is interesting also because you can take a look if your cluster are receiving a traffic spike.

To see these metrics available to monitor, you can do a query on Prometheus interface though port 9090. I encourage you to take a look on PromQL language to perform some queries and extract what really matter to you.

When you get the right query, you can create a graph on Grafana. Example of the Kafka consumers group LAG :

sum(kafka_consumergroup_lag) by (consumergroup)

--

--

Daniel Rosa

Cloud infrastructure specialist with more than 19 years of experience with critical production systems. https://www.linkedin.com/in/danielmartinsrosa/