Following the resumption of power after campus-wide circuit maintenance, several server-related issues emerged during system checks. Below is a detailed breakdown of the problems encountered and their resolutions.
1. Intermittent Server Monitoring Failures
Symptoms
The monitoring system (polling server status via nvidia-smi every 15 seconds) frequently timed out.
3/28/25Less than 1 minute