Updating my homelab monitoring stack

When I started building my homelab I deployed a very rudimentary monitoring stack which consisted mostly of Uptime Kuma and Beszel. Eventually I opted for a more powerful stack and replaced Beszel with Grafana for dashboard visualization, Prometheus as time series database (TSDB) for metrics and various exporters to retrieve metrics like Node Exporter or Cadvisor.

However lately i felt the urge to improve upon that. Until now I had shied away from making any modifications and I just did not have the mental capacity to dive deep into a new topic. There is also always the question whether I need something sophisticated and professional like Grafana, Prometheus, etc. in my small personal homelab. Ultimately I decided it would be fun to have the possibility to learn something new. So when last week I read an article about Alloy I decided to adopt this for my homelab as well. With Alloy I can replace most of the running exporters to collect metrics. It comes bundled with a variety of integrations, for example Node Exporter and Cadvisor are already included. Something I want to look at in the future is to replace the PVE Exporter I have for my Proxmox host with the OpenTelemetry collector as an integration exists as of PVE 9. I went ahead and replaced all exporters with alloy which went mostly very smooth thanks to the great examples in the alloy-scenarios repository. I did this across 6 different machines. Since I was already working on the stack I decided to go even further:

  1. I added a mailcow exporter for my mail server to have better visibility about whats going on without logging into the admin interface

  2. I decided to add Loki as a database for logs as collecting logs is pretty easy with alloy as well and with the latest changes everything was in place already

Since configuring loki and alloy correctly is a bit involved here is a quick rundown of the configuration I currently deployed:

Loki Configuration:

---
auth_enabled: false

server:
  http_listen_port: 3100

common:
  instance_addr: 127.0.0.1
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  retention_period: "7d"
  ingestion_rate_mb: 4
  ingestion_burst_size_mb: 6
  max_streams_per_user: 10000
  max_line_size: 256000

Alloy Metrics (metrics.alloy):

// Collectors
prometheus.exporter.unix "integrations_node_exporter" {
  disable_collectors = ["ipvs", "btrfs", "infiniband", "xfs", "zfs"]
  enable_collectors = ["meminfo"]

  filesystem {
    fs_types_exclude     = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|tmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$"
    mount_points_exclude = "^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+)($|/)"
    mount_timeout        = "5s"
  }

  netclass {
    ignored_devices = "^(veth.*|cali.*|[a-f0-9]{15})$"
  }

  netdev {
    device_exclude = "^(veth.*|cali.*|[a-f0-9]{15})$"
  }
}

prometheus.exporter.cadvisor "integrations_cadvisor" {
  docker_only = true
}

// Relabeling

discovery.relabel "integrations_node_exporter" {
  targets = prometheus.exporter.unix.integrations_node_exporter.targets

  rule {
    target_label = "instance"
    replacement  = constants.hostname
  }

  rule {
    target_label = "job"
    replacement = "integrations/node_exporter"
  }
}

discovery.relabel "integrations_cadvisor" {
    targets = prometheus.exporter.cadvisor.integrations_cadvisor.targets

    rule {
        target_label = "job"
        replacement  = "integrations/docker"
    }

    rule {
        target_label = "instance"
        replacement  = constants.hostname
    }
}

// Scrapers

prometheus.scrape "integrations_node_exporter" {
  scrape_interval = "30s"
  targets    = discovery.relabel.integrations_node_exporter.output
  forward_to = [prometheus.remote_write.local.receiver]
}

prometheus.scrape "integrations_cadvisor" {
  scrape_interval = "30s"
  targets    = discovery.relabel.integrations_cadvisor.output
  forward_to = [ prometheus.remote_write.local.receiver ]
}

// Targets

prometheus.remote_write "local" {
  endpoint {
    url = "http://prometheus:9090/api/v1/write"
  }
}

Alloy Logs (logs.alloy):

loki.source.journal "logs_integrations_integrations_node_exporter_journal_scrape" {
  max_age       = "24h0m0s"
  relabel_rules = discovery.relabel.logs_integrations_integrations_node_exporter_journal_scrape.rules
  forward_to    = [loki.write.local.receiver]
  path = "/var/log/journal"
  labels = {component = string.format("%s-journal", constants.hostname)}
}

local.file_match "logs_integrations_integrations_node_exporter_direct_scrape" {
  path_targets = [
    {
      __address__ = "localhost",
      __path__    = "/var/log/{syslog,messages,*.log}",
      instance    = constants.hostname,
      job         = string.format("%s-system", constants.hostname),
    },
    {
      __address__ = "localhost",
      __path__    = "/var/log/tasks/**/*.log",
      instance    = constants.hostname,
      job         = string.format("%s-tasks", constants.hostname),
      log_type    = "tasks",
    },
    {
      __address__ = "localhost",
      __path__    = "/var/log/traefik/**/*.log",
      instance    = constants.hostname,
      job         = string.format("%s-traefik", constants.hostname),
      log_type    = "traefik",
    },
  ]
}

discovery.relabel "logs_integrations_integrations_node_exporter_journal_scrape" {
  targets = []

  rule {
    source_labels = ["__journal__systemd_unit"]
    target_label  = "unit"
  }

  rule {
    source_labels = ["__journal__boot_id"]
    target_label  = "boot_id"
  }

  rule {
    source_labels = ["__journal__hostname"]
    target_label  = "instance"
  }

  rule {
    source_labels = ["__journal__machine_id"]
    target_label  = "machine_id"
  }

  rule {
    source_labels = ["__journal__transport"]
    target_label  = "transport"
  }

  rule {
    source_labels = ["__journal_priority_keyword"]
    target_label  = "level"
  }
}

loki.source.file "logs_integrations_integrations_node_exporter_direct_scrape" {
  targets    = local.file_match.logs_integrations_integrations_node_exporter_direct_scrape.targets
  forward_to = [loki.write.local.receiver]
}

loki.write "local" {
    endpoint {
        url ="http://loki:3100/loki/api/v1/push"
    }
}

Alloy docker compose configuration:

---
services:
  alloy:
    image: grafana/alloy:v1.14.2@sha256:eadfe35ea52b26cbec4d4d780fbcc31edb31108c1c9e537ca59557a5a102c712
    hostname: srv-prod-01
    container_name: alloy
    restart: unless-stopped
    volumes:
      - /path/to/alloy/configs/directory:/etc/alloy/config      
      - "/:/rootfs:ro"
      - /var/run:/var/run:ro
      - "/sys:/sys:ro"
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
      - /var/log:/var/log
      - /run/log/journal:/run/log/journal:ro
      - /etc/machine-id:/etc/machine-id:ro
    env_file:
      - .env
    networks:
      - socket-network
    command: run --server.http.listen-addr=0.0.0.0:12345 --storage.path=/var/lib/alloy/data /etc/alloy/config

networks:
  socket-network:
    external: true

Again the alloy-scenarios repository was of tremendous help coming up with the right configuration. Also you might noticed I split the alloy configuration for collecting metrics and logs. It is possible to pass a directory instead of a single config file to alloy and it will pick up all files with the .alloy file extension automatically. I think this makes it way easier to modify and read the various config files. When everything is setup correctly you can start visualizing the data in your Grafana instance:

An image showing a Grafana dashboard with metrics for a Mailcow mail server.

An image showing the Grafana logs drilldown.

Not only am I know able to look at my various dashboards and get a quick overview of the overall health of my systems, I can also debug into into further in case of a problem without the need to grep log files on my different hosts. I can also improve my dashboards with aggregated log metrics for example to visualize failed SSH attempts, grepping the access log of my Traefik reverse proxy, etc.

One thing still on my to do list is to look at the Grafana Alert Manager integration. I already have alerting setup in my homelab through ntfy and mailrise but now with logs being available in Grafana I might be able to fetch even more valuable data from it and be immediately notified in case of emergency.

Currently I store logs for 7 days and will monitor the disk usage the upcoming days to get a sense how much data I am producing now and whether its worth it to increase or decrease this retention limit. In conclusion I am pretty happy overall with how things currently stand and whenever the need comes up I will have all the required information readily at hand. On the way I learned a bunch of new things and I count this as the biggest win of it all.

Posted in homelab