We have a machine running some stuff on Docker, and little by little it has started to become important to keep an eye on it. However, looking for information on monitoring a Docker server it always seem to assume you’re running it in Swarm mode, which is not and WILL NOT be the case of this machine, Swarm adds a layer of complexity unneeded in this case.
What do you recommend for this case? I for one would love if the thing didn’t just give you a view of the things running on it but also gave you notifications if something went wrong (like if a container had to be restarted, or if one suddenly started eating all the CPU or something unusual).
I will be keeping an eye on this thread to see what other people do, but what I have done in the past is to have a couple different health checking strategies.
- For web-accessible services I am running, I usually run something like Uptime Kuma or Gatus on a different box checking to make sure those web endpoints are available and performant. I lately have been really digging how Gatus can check more than just the response header, but also latency and certificate validity.
- For the host machine, you can set up custom alerts within netdata for stuff like cpu utilization and memory with custom thresholds. The only other solution I have used for this in the past is setting up alerts through my VPS provider (if it is a VPS that is).
- On really low-spec machines I have had trouble with netdata though, so I don’t have a good solution in those cases. Interested to see if there are less demanding options. Instead, I have resorted to just using dashdot as a PWA so that I can check it easily on my phone if I am on the go.
- For some custom services in the past that run on set schedules, I have used healthchecks.io (which you can selfhost) to send alerts in the case that they don’t run for some reason.
- As for the containers being restarted, I actually don’t have experience with that, so I am interested to see what others have done.
Gatus sounds pretty cool, I’ll definitely give it a closer look later. Maybe it’s the push I needed to go ahead and look into proper observability as a whole, log ingestion and whatnot. My homelab setup is sorely lacking on that department if I’m being honest lol
Uptime Kuma for web monitoring.
I’m experimenting with both Zabbix and Netdata to see which one I want to keep for monitoring resources on my hosts.
I use healthchecks.io to monitor backup scripts and cronjobs.
I’m using Autoheal to restart containers that are in an unhealthy state. For some containers this means I need to write my own health check. I mostly did this to resolve a rare issue where Plex would lock up but it’s helped in other scenarios too.
Have started experimenting with OpenTelemetry (https://opentelemetry.io/docs/what-is-opentelemetry/) to add observability to different parts of the stack running inside a Docker container.
Not gotten far enough to recommend anything specific, but there is big ecosystem of open source collectors and analytics tools out there.