Server Maintenance Log: Post-Power Restoration Issue Analysis
Following the resumption of power after campus-wide circuit maintenance, several server-related issues emerged during system checks. Below is a detailed breakdown of the problems encountered and their resolutions.
1. Intermittent Server Monitoring Failures
Symptoms
The monitoring system (polling server status via nvidia-smi every 15 seconds) frequently timed out.
Root Cause
NVIDIA GPUs default to non-persistent mode. When idle, nvidia-smi response latency increases significantly, causing monitoring script timeouts.
Solutions
Temporary Persistence Mode Activation
nvidia-smi --persistence-mode=1Permanent Daemon Configuration
sudo systemctl enable --now nvidia-persistenced.service
Reference: NVIDIA/Tips and tricks - ArchWiki
2. Unstable SSH Connectivity
Symptoms
SSH connections intermittently failed despite successful pings, occasionally triggering public key mismatches.
Diagnosis
- Unauthorized devices from other departments were connected to the same switch.
- IP conflicts caused SSH requests to route to incorrect hosts.
Immediate Actions
- Physically inspected switch ports and removed unauthorized devices.
- Enforced MAC address binding and revised IP allocation policies.
Long-Term Proposal
Implementing centralized SSH certificate signing for server authentication (deferred due to promotion complexity).
3. Post-Reboot NVIDIA Driver Failure
Symptoms
GPU drivers failed to load after reboot.
Cause Analysis
- A prior kernel update had not been applied until reboot.
- The new kernel version lacked compatible NVIDIA driver modules.
Resolution
Rebuilt drivers via DKMS:
sudo dkms autoinstallVerified functionality:
nvidia-smi
Key Takeaway
Validate driver compatibility immediately after kernel updates. DKMS automation is strongly recommended.