ASC Resources¶
Power monitoring¶
Because ASC typically sets a power limit of 3,000 W, we need to monitor our servers' power during training and testing. For our servers, we do our monitoring with the following components.
Dashboard: https://monitor.acsalab.com/d/asc
Database and Visualization¶
We reuse our monitoring infrastructure, namely InfluxDB and Grafana for this part.
We create a separate user asc
and a separate database asc
in InfluxDB, and a dedicated Grafana dashboard named ASC.
IB switch adjustment
The APC PDU also powers an InfiniBand switch in addition to the five compute servers, so the "Power" graph has an extra line that simply adds an adjustment (that defaults to 135 W for ASC 2023) to the raw power reading to account for the switch.
Whole-server power monitoring¶
We collect power data from the whole server through IPMI with iBug's wrapper daemon for ipmitool
:
ipmi-sdr.
This daemon can collect IPMI sensor readings from multiple servers at once, so we run a single instance on our monitor
VM, inside a Docker container named ipmi-sdr
.
FROM alpine:latest
RUN apk add --no-cache ipmitool
#!/bin/sh
NAME=ipmi-sdr
SRC="$(realpath "$(dirname "$0")")"
docker build -t ipmitool --network=host "$SRC"
docker rm -f "$NAME"
docker run -itd --name="$NAME" --restart=always \
--net=host \
-w /app \
-v "$SRC":/app:ro \
ipmitool \
/app/ipmi-sdr
For details on how this tool works, consult its README.
APC PDU current reading¶
Our enterprise-grade server PDU, produced by APC, provides current readings through SNMP. iBug wrote another Go daemon to continuously pull data into InfluxDB: apc-monitor.
Similarly, we run this daemon on our monitor
VM, inside a Docker container named apc-monitor
, but without a dedicated image (since it doesn't need ipmitool
):
#!/bin/sh
NAME=apc-monitor
SRC="$(realpath "$(dirname "$0")")"
docker rm -f "$NAME"
docker run -itd --name="$NAME" --restart=always \
--net=host \
-w /app \
-v "$SRC":/app:ro \
alpine \
/app/apc-monitor
The "input current" reading has an SNMP OID of 1.3.6.1.4.1.318.1.1.12.2.3.1.1.2.1
. To read it one-off from the shell, use snmpget
:
$ snmpget -v2c -c public <IPaddress> 1.3.6.1.4.1.318.1.1.12.2.3.1.1.2.1
iso.3.6.1.4.1.318.1.1.12.2.3.1.1.2.1 = Gauge32: 137
Note that the data is given as a 32-bit interger with unit 1/10 A (100 mA), so the above example represents a reading of 13.7 A.
Voltage
The datacenter's power supply is around 235 V, which is how the "APC power" in the Grafana dashboard is calculated.
NVIDIA GPU power monitoring¶
iBug wrote yet another wrapper for the command nvidia-smi dmon -s pm
to collect power data from NVIDIA GPUs:
nvidia-dmon.
Unlike the other two daemons, this one is run on each compute hosts. So it is currently deployed on icarus0-4
under the systemd service ibug-nvidia-dmon.service
.
Troubleshooting¶
Chinese character under Linux VT¶
Install fbterm and a suitable font:
apt install fbterm fonts-noto-cjk
Then login as root and start an fbterm:
fbterm -s 16 # 16 is font size, change if you want
You can now see Chinese characters for any command, e.g. Vim.
Note that fbterm is a terminal itself, so it needs to be started for every TTY login.
Debug info¶
For applications using Intel MPI, set these environment variables to produce debug output:
I_MPI_PLATFORM=auto
I_MPI_DEBUG=10
You may increase I_MPI_DEBUG
to 20 if you need.