Monitoring infrastructure¶
Tip
If you're looking for the dashboard, it's at monitor.acsalab.com.
Our monitoring infrastructure uses a typical suite consisting of:
Component | Function | Where |
---|---|---|
InfluxDB | Storing metrics | monitor VM |
Telegraf | Collecting metrics | Each monitored machine |
Grafana | Visualization | monitor VM, inside a Docker container |
MariaDB | Database for Grafana | monitor VM |
Monitor VM¶
The monitor VM is a Debian VM, with rootfs and InfluxDB data directory on two separate volumes:
- rootfs: 4 GB
- InfluxDB data: 32 GB (may increase as needed), mounted on
/mnt/data
/var/lib/mysql
is symlinked to/mnt/data/mysql
/var/lib/influxdb
is symlinked to/mnt/data/influxdb
Note
The following steps refer to the installation procedure, which do not need to be repeated, but may be of interest to maintainers.
Install InfluxDB¶
InfluxDB 1.8 is installed from the official APT repository. Note that we use InfluxDB v1 for better compatibility with other components.
wget -O /etc/apt/trusted.gpg.d/influxdb.asc https://mirrors.ustc.edu.cn/influxdata/influxdata-archive_compat.key
echo "deb https://mirrors.ustc.edu.cn/influxdata/debian bullseye stable" > /etc/apt/sources.list.d/influxdb.list
apt update
apt install --no-install-recommends influxdb
Notes on the commands:
apt
recognizes GPG keys in armored format (i.e. ASCII), but the file name must end with.asc
.- At the time of writing, Influxdata has not introduced the
bookworm
distribution yet, so we use thebullseye
repository instead.
Then we create the databases and users as needed:
CREATE USER admin WITH PASSWORD 'redacted' WITH ALL PRIVILEGES;
CREATE DATABASE monitor;
CREATE USER telegraf WITH PASSWORD 'redacted';
CREATE USER grafana WITH PASSWORD 'redacted';
GRANT WRITE ON monitor TO telegraf;
GRANT READ ON monitor TO grafana;
Then we enable authentication ( which is off by default) by uncommenting and changing auth-enabled = true
in /etc/influxdb/influxdb.conf
, then systemctl restart influxdb
.
In case we need to recover the admin user, change auth-enabled = false
in influxdb.conf
and restart InfluxDB. Don't forget to re-enable authentication after recovery work.
«Admin mode» script
I left a script in /root/run-influx.sh
to log in as the admin user conveniently. Its content is as simple as it needs to be:
#!/bin/sh
exec influx -username admin -password redacted
Install MariaDB¶
MariaDB installation is easier, as Debian provides the package mariadb-server
in the official repository.
apt install --no-install-recommends mariadb-server
Run with mysql
shell:
CREATE DATABASE grafana;
GRANT ALL PRIVILEGES ON grafana.* TO 'grafana'@'127.0.0.1' IDENTIFIED BY 'redacted';
Install Grafana¶
Grafana, however, is installed and managed with Docker.
First we create the config file:
[server]
protocol = http
http_addr = 127.0.0.1
http_port = 3000
root_url = https://monitor.acsalab.com/
[database]
url = mysql://grafana:grafana@127.0.0.1:3306/grafana
ssl_mode = false
[analytics]
reporting_enabled = false
[security]
cookie_secure = true
[auth.anonymous]
enabled = true
org_name = ACSA
[log]
mode = console
level = debug
Then we launch the Docker container:
#!/bin/sh
NAME=grafana
docker rm -f "$NAME"
docker run -itd \
--name="$NAME" \
--restart=always \
--net=host \
-v /srv/grafana:/etc/grafana:ro \
grafana/grafana:latest
Updating Grafana
Because we store Grafana data in MariaDB, the container can be destroyed and recreated at any time without data loss. This makes updating Grafana very easy: Just pull the latest image and recreate the container.
docker pull grafana/grafana:latest
/root/docker-grafana.sh
# optional
docker image prune -f
Install Cloudflared¶
Cloudflared is used to expose Grafana to the Internet. It is installed from the official APT repository. For brevity, the rest of the steps are omitted, just follow the official guide.
monitor.acsalab.com is the public URL for Grafana.
Configure Grafana¶
Grafana needs to be configured on two parts: Datasource and Dashboard.
Add the InfluxDB datasource:
- Go to Connections → Add new connection → Select InfluxDB from datasources
- Fill in information as required. Use
grafana
user for InfluxDB, and set the HTTP method to POST
Create a dashboard:
- Go to Dashboard → New dashboard → Import dashboard
- Enter
92820268 so we'll import this one, press Load - Enter a suitable UID, which will become the URL segment as in
https://monitor.acsalab.com/d/UID
Clients¶
Telegraf is the collector agent to be installed on monitored machines. It is also a project from InfluxData so it shares the same APT repository as InfluxDB.
wget -O /etc/apt/trusted.gpg.d/influxdb.asc https://mirrors.ustc.edu.cn/influxdata/influxdata-archive_compat.key
echo "deb https://mirrors.ustc.edu.cn/influxdata/debian bullseye stable" > /etc/apt/sources.list.d/influxdb.list
apt update
apt install --no-install-recommends telegraf
The only difference from installing InfluxDB is that we install the telegraf
package. All other commands remain identical.
We need to add our custom configuration file set, which is stored in the GitHub repository ACSAlab/telegraf-config. To apply the configuration from the repository:
- Clear the default configuration file
/etc/telegraf/telegraf.conf
. Usetruncate -s 0
or:>
(if you know what this does) to clear the file content without deleting it. - Clone the repository to
/etc/telegraf/repo
. You can configure a Deploy Key for the repository (more on that later) for convenience. -
Look at the files in the repositories, and symlink the ones you need to
/etc/telegraf/telegraf.d
.- Every host should include
base.conf
,disk-default.conf
andinfluxdb-acsa.conf
. - The NFS server should additionally include
disk-nfs.conf
. - Any GPU server should additionally include
nvidia.conf
. -
Example steps:
cd /etc/telegraf/telegraf.d ln -sf ../repo/{base,disk-default,influxdb-acsa}.conf .
- Every host should include
-
Restart Telegraf with
systemctl restart telegraf
.
You should now see stats from the host in Grafana after a refresh.
ASC¶
There are a few Docker containers running monitoring software for ASC. See ASC for more details.