FAQ¶

General¶

Restoring files

If you accidentally delete a file from the compute servers, you may try retrieving it from our twice-a-week ZFS snapshots.

For details, see Snapshots.

本文档的语言？

本文档最早由英文编写，后来发现 99% 的同学并不能像阅读中文那样流畅地阅读英文，所以我们又逐渐把文档换回中文了（逃

MPI¶

Specify backend compiler for mpicc and mpic++

For MPICH, set the environment variables MPICH_CC and MPICH_CXX. For example, if you want to use GCC 9 as backend:

export MPICH_CC=gcc-9 MPICH_CXX=g++-9
mpicc [...]

For Open MPI, the corresponding environment variables are OMPI_CC and OMPI_CXX:

export OMPI_CC=gcc-9 OMPI_CXX=g++-9
mpicc [...]

CUDA¶

Specify exact libcudart version for linking

On systems with multiple CUDA version installed, the linker by default selects the latest minor version (e.g. 11.8 for libcudart.so.11).

This version selection is done at runtime by the dynamic linker. To specify an exact version, use LD_LIBRARY_PATH. For example, to specify CUDA 11.3:

export LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64
./my-app

Make sure to change the path to point to the actual CUDA path.

How to change my password?

Log in to any host and run passwd. You will be prompted for your current password, then your new password twice.

For LDAP users, you normally don't have a passwd when your account is created, since lab servers only accept ssh key login now. If you need(e.g change default shell, use OpenVPN service), you can ask admin to set an initial password for you and you change it later on your own.

Your new password is effective on ALL hosts, including our Synology DiskStation.

The change may require up to 10 minutes to propagate to other hosts due to nscd caching. This does not apply to the Synology DiskStation and our OpenVPN service (which runs on Synology DS).
For non-LDAP users, your new password is specific to the host you performed the password change on.
Note that our Kunpeng 920 nodes (brainiac1 and brainiac2) are not currently enrolled into LDAP.

If you forget your password, an administrator can reset it for you.

How to change my login shell?

LDAP users: Use chsh.ldap on any Ubuntu 22.04 host. This command is broken on Ubuntu 20.04 hosts (because of upstream bugs).

Non-LDAP users: Use chsh.

Public key is registered in authorized_keys but still can't log in

Ask an administrator to inspect /var/log/auth.log. One of the lines should contain this:

userauth_pubkey: key type ssh-rsa not in PubkeyAcceptedAlgorithms [preauth]

Client fix: Update your SSH client software.

Known incompatible clients include OpenSSH (8.2 and older) and Xshell (all versions).

Server fix: Add PubkeyAcceptedAlgorithms +ssh-rsa to /etc/ssh/sshd_config.d/acsa.conf and reload the SSH service.

This is because starting at version 8.8, OpenSSH disabled the old insecure ssh-rsa algorithm (that uses SHA-1 hash) by default. This does not correspond with the ssh-rsa key type. (Confusingly, the keyword ssh-rsa refers to multiple similar things in SSH.)

Other key types are not affected (ECDSA and Ed25519 keys).

References: 1, 2

The scp command does not copy any file and gives no output

scp uses SSH only as a transport layer, and transmits data in its own representation. If your .bashrc or .zshrc is invoked on non-interactive sessions (which it shouldn't), it produces output that interferes with scp's data stream, causing scp to fail.

Fix: Check your .bashrc. Make sure it starts with [[ $- == *i* ]] || return or something similar. Prepend your .bashrc with that line if it doesn't.

NFS Mounting¶

Just create an appropriate /etc/fstab entry. Example:

192.0.2.1:/home /home nfs4 soft,nofail 0 0

The soft and nofail mount options are important so as to prevent system halting when NFS mounts fail.

Do not try to work around this. Systemd will automatically handle the ordering and dependencies of NFS mounts and, in most cases, systemd is smarter than your crafted mount-acsa-nfs.service or what have you.

NFS mount appears laggy on one host but not others¶

If mountpoint works normally on other hosts then it's most likely the particular host's fault.

Try dropping the system's local directory cache:

echo 3 > /proc/sys/vm/drop_caches

NFS-over-RDMA kernel module¶

Install the latest MLNX_OFED package. Sometimes the mlnx-nfsrdma-dkms package isn't included in a downloaded ISO or TGZ file but don't worry, it can still be retrieved from https://linux.mellanox.com/public/repo/mlnx_ofed/latest/(version)/(arch)/mlnx-nfsrdma-dkms_*.deb. Just download the file and install. Make sure the version of mlnx-nfsrdma-dkms matches that of mlnx-ofed-kernel-dkms or the DKMS build will fail.

After successful installation, try modprobe svcrdma and modprobe xprtrdma. Both should produce no output and an exit code of 0 (success).

In case of unsuccessful installation, follow the logs as directed by DKMS, e.g. /var/lib/dkms/mlnx-ofed-kernel/5.7/build/make.log. Identify where the build failed and try to fix it.

Log keyword Module*.symvers

On Ubuntu and other Debian-derived distributions, the Module.symvers file is available at

/usr/src/linux-headers-$(uname -r)/Module.symvers

Just copy this file into /usr/src/<mlnx-ofed directory>.

Log keyword __PEDIT_CMD_MAX

Context

In file included from /var/lib/dkms/mlnx-ofed-kernel/5.7/build/drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.c:7:
/var/lib/dkms/mlnx-ofed-kernel/5.7/build/drivers/net/ethernet/mellanox/mlx5/core/en/tc_priv.h:41:35: error: ‘__PEDIT_CMD_MAX’ undeclared here (not in a function); did you mean ‘__DEVLINK_CMD_MAX’?
   41 |  struct pedit_headers_action hdrs[__PEDIT_CMD_MAX];
      |                                   ^~~~~~~~~~~~~~~
      |                                   __DEVLINK_CMD_MAX
make[3]: *** [scripts/Makefile.build:297: /var/lib/dkms/mlnx-ofed-kernel/5.7/build/drivers/net/ethernet/mellanox/mlx5/core/en/tc/meter.o] Error 1

Diagnosis: With help from Google we can learn that __PEDIT_CMD_MAX is defined in <kernel header>/include/uapi/linux/tc_act/tc_pedit.h.

Solution: Add #include <linux/tc_act/tc_pedit.h> to <mlnx-ofed source>/drivers/net/ethernet/mellanox/mlx5/core/en_tc.h.

MLNX OFED suite download

URL format:

https://content.mellanox.com/ofed/MLNX_OFED-<version>/MLNX_OFED_LINUX-<version>-<distro version>-<arch>.tgz

For example, https://content.mellanox.com/ofed/MLNX_OFED-23.10-1.1.9.0/MLNX_OFED_LINUX-23.10-1.1.9.0-ubuntu22.04-x86_64.tgz

NFS over RDMA: Protocol error¶

mount.nfs4: Protocol error

Try adding -v to the mount command and look for trying text-based options. Double-check if the values make sense.

Also examine /etc/netplan/*.yaml and the output of ip a. Fixing any discrepancy should help.

NFS mount.nfs4: an incorrect mount option was specified¶

Mount nfs failed, logs:

root@icarus0:~# mount -a -v
mount.nfs4: timeout set for Tue Jan 23 14:46:45 2024
mount.nfs4: trying text-based options 'soft,proto=rdma,port=2050,vers=4.2,addr=10.1.13.1,clientaddr=10.1.13.59'
mount.nfs4: mount(2): Invalid argument
mount.nfs4: trying text-based options 'soft,proto=rdma,port=2050,vers=4,minorversion=1,addr=10.1.13.1,clientaddr=10.1.13.59'
mount.nfs4: mount(2): Invalid argument
mount.nfs4: trying text-based options 'soft,proto=rdma,port=2050,vers=4,addr=10.1.13.1,clientaddr=10.1.13.59'
mount.nfs4: mount(2): Invalid argument
mount.nfs4: an incorrect mount option was specified

It's probably the Mellanox driver is not compatible with the new kernel, try install new Mellanox driver. See Install InfiniBand drivers.

SSH Troubleshooting¶

Client side¶

You can try ssh -v to learn what SSH is doing. This includes which public key files are available or reject for what reason, and usually this is enough for you to identify the problem.

One -v is enough

It's very rare that debug1 doesn't provide enough information. Extra verbosity from -vv and -vvv is only really useful to OpenSSH developers, or if you have modified the OpenSSH source code.

If it doesn't (or you believe something's wrong on the other side), ask an admin to investigate the server.

Server side¶

Edit /etc/ssh/sshd_config and change LogLevel to DEBUG1 (default INFO) and reload SSH service. Ask the user to make another login attempt, then check either journalctl -eu ssh or /var/log/auth.log.

Don't forget to restore LogLevel afterwards as it tends to bloat system log.

Common problems¶

Permission denied (publickey): If you've confirmed that your public key is valid but still get this error, it's likely that the NFS mount on the server is broken. Ask an administrator to fix it.

InfiniBand cards¶

On NFS server the IB interface may show NO-CARRIER even if it's otherwise fine. This should be solved by starting the opensmd service somewhere (not necessarily the host with the problem).

Permanent fix

OpenSM only needs to run once somewhere in the network, so we're running it on the NFS server. No other servers need to run OpenSM.

Miscellaneous quirks¶

NVIDIA drivers

NVIDIA driver should preferably be installed from Ubuntu resository, i.e. apt install linux-modules-nvidia-xxx-generic (or on older OS, nvidia-driver-xxx). This is both easier and more reliable than installing from NVIDIA's official installer, particularly across kernel and OS upgrades.

Mellanox InfiniBand drivers

On upgrading kernel to 5.15.0-83-generic, the InfiniBand DKMS driver failed to build. This is because the Mellanox driver is not compatible with the new kernel.

The DKMS build log indicated an unknown field xpo_release_rqst for struct svc_xprt_ops. We inspected the source and found a breaking change in the upstream kernel (compare svc_xprt.h between v5.15.112 and v5.15.113).

The solution is to revert to the old kernel 5.15.0-79-generic and abstain from upgrading until the Mellanox driver is updated.

Server going unresponsive after a while

If the server has a desktop environment installed, chances are "Automatic Suspend" is not disabled (where it is turned on by default). Desktop managers are handling these ACPI triggers, disabling them could fix, but they could be automatically enabled after a software upgrade.

Fix: Log in to the graphical interface via IPMI/KVM, open the Settings app and select Power on the left, as shown below.

Also disable sleep-related systemd targets:

systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

Permanent Fix: Change system boot target, use systemctl get-default to see the running config, change graphical.target to multi-user.target by excuting systemctl set-default multi-user.target. System should automatically disable desktop managers after reboot.

CUDA nvcc: unsupported GNU version! gcc [*] and up are not supported!

Install a compatible GCC version (e.g. for CUDA 11.3 install GCC 9 or 10)

Then head to the CUDA directory (e.g. /usr/local/cuda-11.3/bin) and symlink the desired GCC version to gcc and g++:

cd /usr/local/cuda-11.3/bin
sudo ln -s /usr/bin/gcc-9 gcc
sudo ln -s /usr/bin/g++-9 g++

Ref: Stack Overflow