Monitoring stack for Kubernetes

Module installs the monitoring stack where components can be disabled or customized.
The default setup deploys a complete set with a minimal configuration - Grafana is exposed via Ingress, Loki and Prometheus are available only inside the cluster, Tempo disabled, persistent storage for all the components configured, main exporters are enabled.
Tempo works in distributed (microservice) mode.

Tempo core components are: compactor, distributor, ingester, querier, query-frontend, memcached.
Grafana and Prometheus use Recreate update strategy type that causes short downtime between deleting old pod and creating a new one to properly re-attach volumes.
All main components expect Nginx as Ingress class as dependency for this module.
If Prometheus or Loki are marked as enabled, corresponding local datasources for Grafana will be created.
Pushgateway installed within Prometheus helm chart and disabled by default. Use pushgateway values from variables table to install and configure it. The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus.
Nginx is used as ingress_class in Ingress Annotations by default for all main monitoring-stack components. If custom ingress_class and ingress_auth_enabled are used, specific auth Annotations must be provided through loki.custom_values and prometheus.custom_values.

NOTE: Enable loki.bind_memberlist_endpoint if you face the following issue during deployment:

 bashfailed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided

Once you have a Corewide Solutions Portal account, this one-time action will use your browser session to retrieve credentials:

 shellterraform login solutions.corewide.com

Provision instructions

Initialize mandatory providers:

Copy and paste into your Terraform configuration and insert the variables:

 hclmodule "tf_k8s_monitoring_stack" {
  source  = "solutions.corewide.com/kubernetes/tf-k8s-monitoring-stack/helm"
  version = "~> 4.0.1"

  # specify module inputs here or try one of the examples below
  ...
}

Initialize the setup:

 shellterraform init

Define update strategy

Corewide DevOps team strictly follows Semantic Versioning Specification to provide our clients with products that have predictable upgrades between versions. We recommend pinning patch versions of our modules using pessimistic constraint operator (~>) to prevent breaking changes during upgrades.

To get new features during the upgrades (without breaking compatibility), use ~> 4.0 and run terraform init -upgrade

For the safest setup, use strict pinning with version = "4.0.1"

$1,550

Dependencies included: $350

tf-k8s-crd	$50
tf-k8s-grafana	$300

BUY

License

v4.0.1 released 4 months, 3 weeks ago

New version approx. every 8 weeks

Upgrade Notes

Changelog

NOTE: Enable loki.bind_memberlist_endpoint if you face the following issue during deployment:

 bashfailed: failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided

All notable changes to this project are documented here.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

v4.0.2 - 2025-03-31

Fixed

alerts management error of Loki data source in Grafana

v4.0.1 - 2025-02-17

Fixed

Grafana instance label selector and access type definition for the external Grafana datasources' custom resources
conditions of supplying basic auth credentials for the external Grafana datasources

v4.0.0 - 2024-12-11

BREAKING CHANGE: Grafana roll-out is changed from the Helm-based approach to Grafana Operator, thus its settings and data management aren't compatible with previous version

NOTE: After module upgrade to v4.0 all previously collected data in Grafana will be lost unless an external database was used. About the upgrade to new Grafana Operator see Upgrade Notes section

Added

tf-k8s-crd module dependency
tf-k8s-grafana module dependency

Changed

migrated Grafana from the Helm-based approach to Grafana Operator managed by Grafana Operator Terraform module
moved Grafana datasources' management to custom resources that utilizes CRD Terraform module

v3.2.2 - 2024-12-03

Fixed

issue with Loki log rotation not working by adding delete_request_store parameter in compactor service

v3.2.1 - 2024-11-25

Fixed

issue with Loki log rotation not working by adding retention_enabled and retention_period parameters in compactor service

v3.2.0 - 2024-10-23

Added

validation of compatible Helm chart versions for each of the managed components' Helm releases

Changed

Grafana Helm chart (from 6.50.7 to 8.5.6) and application (from 9.5.1 to 11.2.2) versions, used new values
Prometheus Helm chart (from 15.8.7 to 18.4.0) and application (from v2.34.0 to v2.39.1) versions, used new values
Loki Helm chart (from 5.10.0 to 6.18.0) and application (from 2.8.3 to 3.2.0) versions, used new values
Promtail Helm chart (from 2.1.0 to 6.16.6) and application (from 2.5.0 to 3.0.0) versions, used new values
Eventrouter Helm chart (from 1.4.15 to 3.2.14) and application (from 0.11.0 to 1.7.0) versions, used new values
Tempo Helm chart (from 1.7.0 to 1.20.0) and application (from 2.3.0 to 2.6.0) versions, used new values

Fixed

issue with ignored persistent volume parameters for Pushgateway

v3.1.1 - 2024-03-20

Fixed

issue with passing the Storage class name for Loki

v3.1.0 - 2023-11-15

Added

option to enable ingress for Tempo service

Changed

updated Tempo chart and application versions

v3.0.0 - 2023-10-10

BREAKING CHANGE: Loki settings and data chunks management aren't compatible with previous version

NOTE: After module upgrade to v3.0 all previously collected data in Loki will be lost. About the upgrade to new Loki Helm Chart see Upgrade Notes section

Changed

Loki Helm chart source repository, chart (from 2.13.3 to 5.10.0) and application (from 2.6.1 to 2.8.3) versions, used new values
Promtail Helm chart source repository and version (from 2.0.2 to 2.1.0)
disabled PodSecurityPolicy Kubernetes resource creation for Grafana Helm release in order to keep compatibility with Kubernetes v1.25 and newer

v2.1.0 - 2023-05-26

Added

Optional Tempo creation/management
Traces to logs Tempo feature integration
Defining Loki UID parameter for the Tempo and Loki data sources linking purpose

Changed

Grafana server version to 9.5.1

v2.0.0 - 2023-02-21

BREAKING CHANGE: now all kubernetes provider resources use versioned resources which aren't compatible with previous version

NOTE: After module upgrade to v2.0 all kubernetes provider resources and other resources that depend on them will be recreated, or they can be reimported to TF state manually, see Upgrade Notes section

Changed

use versioned Kubernetes provider resources instead of standard ones in order to satisfy requirements of future updates of providers
default grafana application version to 9.3.6
default grafana helm chart version to 6.50.7
code refactoring
move all default app and chart versions from the locals to corresponding optional variables' values

v1.1.1 - 2023-01-13

Fixed

bug with rendering basic_auth credentials with special characters by disabling special characters

v1.1.0 - 2022-12-14

Added

option to toggle Alertmanager installation (prometheus.alertmanager_enabled, off by default)

v1.0.0 - 2022-12-08

Added

Grafana monitoring stack creation/management (Grafana, Loki, Prometheus, Node exporter, Promtail, Pushgateway and Eventrouter) inside the Kubernetes cluster
all components are optional
ability to create additional namespace for stack components or use already existing one
implemented Ingress Nginx basic authentication and creates K8s secret for this purpose
provided ability to add custom Grafana datasources
by default local datasource creation for Loki and Prometheus (if any enabled)

From `v1.x` to `v2.x`

Now all kubernetes provider resources use versioned resources. According to kubernetes provider's suggestions the simplest, non-destructive way to do this is to remove the old resource from state and import this resource as a version one, like so:

 bash# If Kubernetes namespace was managed by the module, it must be re-imported
terraform state rm module.monitoring.kubernetes_namespace.monitoring[0]
terraform import module.monitoring.kubernetes_namespace_v1.monitoring[0] monitoring
# If Kubernetes secret with basic auth credentials was created, it must be re-imported
terraform state rm module.monitoring.kubernetes_secret.ingress_basic_auth
terraform import module.monitoring.kubernetes_secret_v1.ingress_basic_auth monitoring/ingress-monitoring-basic-auth
# Re-import Cluster Role Binding for the Event Router
terraform state rm module.monitoring.kubernetes_cluster_role_binding.eventrouter
terraform import module.monitoring.kubernetes_cluster_role_binding_v1.eventrouter eventrouter

From `v2.x` to `v3.x`

The module from v3.0 has changed the used Chart repo and version for the Loki solution. Since Loki is deployed as a StatefulSet, its spec can't be updated. That's why Loki must be redeployed manually. First, update the reference of the module version and re-init the module, then uninstall an already deployed Loki Helm Chart:

by means of Terraform

 bashterraform destroy -target 'module.monitoring.helm_release.loki[0]'

by means of Helm

 bashhelm -n monitoring uninstall loki

Then, the Loki Helm Chart can be re-installed:

 bashterraform apply -target 'module.monitoring.helm_release.loki[0]'

NOTE: Because the new Loki processes data chunks in another way, the data collected before the update will be lost.

From `v3.x` to `v4.x`

The module from v4.0 utilizes Grafana Operator instead of a Helm-based approach, thus Grafana will be completely re-deployed, and the attached volume with all the managed data (users, dashboards, alert rules, etc.) will be lost.
To upgrade, update the module version reference, re-init the module, update module inputs following the documentation, and apply changes.

NOTE: custom configuration parameters can be provided by means Grafana environment variables - grafana.env_vars input (previosly the module utilized custom Helm values for the managed Helm release of Grafana).

NOTE: Grafana dashboards can by covered by Terraform only as custom resources

Deploy complete stack with only mandatory values:

 hclmodule "monitoring_stack" {
  source  = "solutions.corewide.com/kubernetes/tf-k8s-monitoring-stack/helm"
  version = "~> 4.0"

  grafana = {
    ingress_host = "testmon.example.com"
  }
}

Deploy partial stack with some customization, Prometheus and Node Exporter are disabled, Tempo enabled:

 hclmodule "monitoring_stack" {
  source  = "solutions.corewide.com/kubernetes/tf-k8s-monitoring-stack/helm"
  version = "~> 3.2"

  grafana = {
    ingress_host  = "testmon.example.com"
    admin_pass    = "YYYY-YYYY-YYYY"
    storage_class = "standard"
  }

  prometheus = {
    enabled               = false
    node_exporter_enabled = false
  }

  tempo = {
    enabled = true

    node_selector = {
      "kubernetes\\.azure\\.com/agentpool" = "maintenance"
    }
  }
}

Deploy full stack and an additional datasource.

Set node selectors for already existing Prometheus, add Pushgateway from Prometheus chart, add custom value for Grafana, enable basic authentication and its credentials:

 hclmodule "monitoring_stack" {
  source  = "solutions.corewide.com/kubernetes/tf-k8s-monitoring-stack/helm"
  version = "~> 4.0"

  name_prefix = "dev"

  auth_credentials = {
    password = "XXXX-XXXX-XXXX"
  }

  grafana = {
    ingress_host = "testmon.example.com"
    admin_pass   = "YYYY-YYYY-YYYY"

    node_selector = {
      "cloud\\.google\\.com/gke-nodepool" = "maintenance"
    }

    env_vars = {
      GF_PLUGIN_GRAFANA_IMAGE_RENDERER_RENDERING_IGNORE_HTTPS_ERRORS = true
    }
  }

  prometheus = {
    node_selector = {
      "cloud\\.google\\.com/gke-nodepool" = "maintenance"
    }
  }

  pushgateway = {
    enabled         = true
    ingress_enabled = true
    ingress_host    = "pushgw.example.com"
    volume_size     = "5Gi"
  }

  loki = {
    node_selector = {
      "cloud\\.google\\.com/gke-nodepool" = "maintenance"
    }
  }

  grafana_datasources = [
    {
      name               = "Prometheus Dev"
      type               = "prometheus"
      url                = "https://devprom.example.com"
      basic_auth_enabled = true
      basic_auth_pass    = "XXXX-XXXX-XXXX"
      basic_auth_user    = "monitoring"
    },
  ]
}

Variable	Description	Type	Default	Required	Sensitive
`grafana`	Grafana parameters	`object`		yes	no
`auth_credentials`	Ingress Nginx basic auth login credentials	`object`	`{}`	no	yes
`auth_credentials.password`	Ingress Nginx basic auth login password (will be randomly generated if it's not set)	`string`		no	yes
`auth_credentials.username`	Ingress Nginx basic auth login username	`string`	`monitoring`	no	yes
`create_namespace`	Indicates creation of dedicated namespace for monitoring components	`bool`	`true`	no	no
`grafana.admin_pass`	Grafana admin password (will be randomly generated if it's not set)	`string`		no	no
`grafana.admin_user`	Grafana admin username	`string`	`admin`	no	no
`grafana.enabled`	Toggle Grafana installation	`bool`	`true`	no	no
`grafana.env_vars`	Environment variables for Grafana container in key-value format	`map(any)`	`{}`	no	no
`grafana.grafana_version`	Grafana server version	`string`	`11.2.2`	no	no
`grafana.ingress_host`	Hostname to use with Ingress (required if `enabled` is `true`)	`string`		no	no
`grafana.log_level`	Grafana log level (Supported levels: `trace`, `debug`, `info`, `warn`, `error` or `critical`)	`string`	`warn`	no	no
`grafana.node_selector`	Node selector to place Grafana pods in	`map(any)`	`{}`	no	no
`grafana.operator_app_version`	Grafana operator image version	`string`	`v5.9.2`	no	no
`grafana.operator_chart_version`	Grafana operator Helm chart version	`string`	`v5.9.2`	no	no
`grafana.recreate_on_changes`	Whether the Grafana CRD should be recreated and not updated during `apply` phase	`bool`	`false`	no	no
`grafana.storage_class`	Storage class name	`string`		no	no
`grafana.volume_size`	Volume data size	`string`	`5Gi`	no	no
`grafana_datasources`	Grafana datasources for datasource provisioning	`list(object)`	`[]`	no	yes
`grafana_datasources[*].basic_auth_enabled`	Toggle Ingress basic auth	`bool`		no	yes
`grafana_datasources[*].basic_auth_pass`	Ingress basic auth password	`string`		no	yes
`grafana_datasources[*].basic_auth_user`	Ingress basic auth user	`string`		no	yes
`grafana_datasources[*].name`	Name of the datasource	`string`		no	yes
`grafana_datasources[*].type`	Type of the datasource	`string`		no	yes
`grafana_datasources[*].url`	URL of the datasource	`string`		no	yes
`ingress_cert_issuer`	Ingress TLS certificate issuer	`string`	`letsencrypt`	no	no
`ingress_class`	Ingress Class definition	`string`	`nginx`	no	no
`loki`	Loki parameters	`object`	`{}`	no	no
`loki.app_version`	Loki server version	`string`	`3.2.0`	no	no
`loki.bind_memberlist_endpoint`	Toggle explicit bind of POD IP to `Loki-Memberlist` Kubernetes service. Required only for deployment into EKS cluster in order to bind endpoint IP accurately	`bool`	`false`	no	no
`loki.chart_version`	Helm chart version (compatible chart version is `6.0.0` and newer)	`string`	`6.18.0`	no	no
`loki.custom_values`	Custom Helm chart values in key value format: `"persistance.enabled" = true`	`map(any)`	`{}`	no	no
`loki.enabled`	Toggle Loki installation	`bool`	`true`	no	no
`loki.eventrouter_app_version`	Eventrouter app version	`string`	`1.7.0`	no	no
`loki.eventrouter_chart_version`	Eventrouter Helm chart version (compatible chart version is `3.0.0` and newer)	`string`	`3.2.14`	no	no
`loki.eventrouter_enabled`	Toggle Eventrouter installation	`bool`	`true`	no	no
`loki.ingress_auth_enabled`	Toggle Ingress basic auth (effective if `ingress_enabled` is `true`)	`bool`	`false`	no	no
`loki.ingress_enabled`	Toggle Ingress	`bool`	`false`	no	no
`loki.ingress_host`	Hostname to use with Ingress (effective if `ingress_enabled` is `true`)	`string`		no	no
`loki.node_selector`	Node selector to place Loki pods in	`map(any)`	`{}`	no	no
`loki.promtail_app_version`	Promtail app version	`string`	`3.0.0`	no	no
`loki.promtail_chart_version`	Promtail Helm chart version (compatible chart version is `6.0.0` and newer)	`string`	`6.16.6`	no	no
`loki.promtail_enabled`	Toggle Promtail installation	`bool`	`true`	no	no
`loki.retention_period`	Data retention period	`string`	`93d`	no	no
`loki.storage_class`	Storage class name	`string`		no	no
`loki.volume_size`	Volume data size	`string`	`100Gi`	no	no
`name_prefix`	Name prefix for resources creation	`string`		no	no
`namespace`	Monitoring stack namespace	`string`	`monitoring`	no	no
`prometheus`	Prometheus parameters	`object`	`{}`	no	no
`prometheus.alertmanager_enabled`	Toggle Prometheus alertmanager installation	`bool`	`false`	no	no
`prometheus.app_version`	Prometheus server version	`string`	`v2.39.1`	no	no
`prometheus.chart_version`	Helm chart version (compatible chart version must be from `18.0.0` and up to `19.0.0`)	`string`	`18.4.0`	no	no
`prometheus.custom_values`	Custom Helm chart values in key value format: `"persistentVolume.enabled" = true`	`map(any)`	`{}`	no	no
`prometheus.enabled`	Toggle Prometheus installation	`bool`	`true`	no	no
`prometheus.ingress_auth_enabled`	Toggle Nginx basic auth (effective if `ingress_enabled` is `true`)	`bool`	`false`	no	no
`prometheus.ingress_enabled`	Toggle Ingress	`bool`	`false`	no	no
`prometheus.ingress_host`	Hostname to use with Ingress (effective if `ingress_enabled` is `true`)	`string`		no	no
`prometheus.node_exporter_enabled`	Toggle Node exporter installation	`bool`	`true`	no	no
`prometheus.node_selector`	Node selector to place Prometheus pods in	`map(any)`	`{}`	no	no
`prometheus.retention_period`	Data retention period	`string`	`93d`	no	no
`prometheus.storage_class`	Storage class name	`string`		no	no
`prometheus.volume_size`	Volume data size	`string`	`100Gi`	no	no
`pushgateway`	Pushgateway parameters	`object`	`{}`	no	no
`pushgateway.app_version`	Pushgateway server version	`string`	`v1.4.3`	no	no
`pushgateway.enabled`	Toggle Pushgateway installation	`bool`	`false`	no	no
`pushgateway.ingress_auth_enabled`	Toggle Nginx basic auth (effective if `ingress_enabled` is `true`)	`bool`	`false`	no	no
`pushgateway.ingress_enabled`	Toggle Ingress	`bool`	`false`	no	no
`pushgateway.ingress_host`	Hostname to use with Ingress (effective if `ingress_enabled` is `true`)	`string`		no	no
`pushgateway.volume_size`	Volume data size	`string`	`2Gi`	no	no
`tempo`	Tempo parameters	`object`	`{}`	no	no
`tempo.app_version`	Tempo components version	`string`	`2.6.0`	no	no
`tempo.azure_remote_storage`	Azure blob storage for Tempo data	`object`		no	no
`tempo.azure_remote_storage.container_name`	Storage account container name	`string`		no	no
`tempo.azure_remote_storage.storage_account_key`	Storage account key	`string`		no	no
`tempo.azure_remote_storage.storage_account_name`	Storage account name	`string`		no	no
`tempo.chart_version`	Tempo distributed Helm chart version (compatible chart version is `1.0.0` and newer)	`string`	`1.20.0`	no	no
`tempo.custom_values`	Custom Helm chart values in key value format: `"persistance.type" = "pvc"`	`map(any)`	`{}`	no	no
`tempo.enabled`	Toggle Tempo installation	`bool`	`false`	no	no
`tempo.ingress_auth_enabled`	Toggle Nginx basic auth (effective if `ingress_enabled` is `true`)	`bool`	`false`	no	no
`tempo.ingress_enabled`	Toggle Ingress	`bool`	`false`	no	no
`tempo.ingress_host`	Hostname to use with Ingress (effective if `ingress_enabled` is `true`)	`string`		no	no
`tempo.node_selector`	Node selector to place Tempo components in	`map(any)`	`{}`	no	no
`tempo.span_end_time_shift`	Shifts the end time for the logs query, based on the span's end time	`string`	`-1h`	no	no
`tempo.span_start_time_shift`	Shifts the start time for the logs query, based on the span's start time	`string`	`1h`	no	no
`tempo.traces_tags`	Define additional tags when provisioning traces to logs feature	`map(any)`	`{}`	no	no
`tempo.traces_to_logs`	Toggle traces to logs feature	`bool`	`false`	no	no

Output	Description	Type	Sensitive
`basic_auth_credentials`	Ingress basic auth credentials	`computed`	yes
`grafana_admin_credentials`	Contains admin user credentials for Grafana web UI	`map`	yes
`ingress_hosts`	Ingress exposed hosts	`map`	no

Dependency	Version	Kind
`terraform`	`>= 1.3`	CLI
`hashicorp/helm`	`~> 2.9`	provider
`hashicorp/kubernetes`	`~> 2.22`	provider
`hashicorp/random`	`~> 3.3`	provider
`tf-k8s-crd`	`~> 2.0`	module
`tf-k8s-grafana`	`~> 1.1`	module