Server Maintenance Log: Post-Power Restoration Issue Analysis

liblaf3/28/25Less than 1 minute

Following the resumption of power after campus-wide circuit maintenance, several server-related issues emerged during system checks. Below is a detailed breakdown of the problems encountered and their resolutions.

1. Intermittent Server Monitoring Failures

Symptoms

The monitoring system (polling server status via nvidia-smi every 15 seconds) frequently timed out.

Root Cause

NVIDIA GPUs default to non-persistent mode. When idle, nvidia-smi response latency increases significantly, causing monitoring script timeouts.

Solutions

Temporary Persistence Mode Activation
```
nvidia-smi --persistence-mode=1
```

Permanent Daemon Configuration

sudo systemctl enable --now nvidia-persistenced.service

Reference: NVIDIA/Tips and tricks - ArchWiki

2. Unstable SSH Connectivity

Symptoms

SSH connections intermittently failed despite successful pings, occasionally triggering public key mismatches.

Diagnosis

Unauthorized devices from other departments were connected to the same switch.
IP conflicts caused SSH requests to route to incorrect hosts.

Immediate Actions

Physically inspected switch ports and removed unauthorized devices.
Enforced MAC address binding and revised IP allocation policies.

A prior kernel update had not been applied until reboot.
The new kernel version lacked compatible NVIDIA driver modules.

Resolution

Rebuilt drivers via DKMS:
```
sudo dkms autoinstall
```
Verified functionality:
```
nvidia-smi
```

Key Takeaway

Validate driver compatibility immediately after kernel updates. DKMS automation is strongly recommended.

Server Maintenance Log: Post-Power Restoration Issue Analysis

1. Intermittent Server Monitoring Failures

Symptoms

Root Cause

Solutions

2. Unstable SSH Connectivity

Symptoms

Diagnosis

Immediate Actions

Long-Term Proposal

3. Post-Reboot NVIDIA Driver Failure

Symptoms

Cause Analysis

Resolution

Key Takeaway