(EN) Monitoring Alerts and Metrics on Large Power Systems Clusters
In this talk we’ll introduce an open source project being used to monitor large Power Systems clusters, such as in the IBM collaboration with Oak Ridge and Lawrence Livermore laboratories for the Summit project, a large deployment of custom AC922 Power Systems nodes augmented by GPUs that work in tandem to implement the (currently) largest Supercomputer in the world.
Data is collected out-of-band directly from the firmware layer and then redistributed to various components using an open source component called Crassd. In addition, in-band operating-system and service level metrics, logs and alerts can also be collected and used to enrich the visualization dashboards. Open source components such as the Elastic Stack (Elasticsearch, Logstash, Kibana and select Beats) and Netdata are used for monitoring scenarios appropriate to each tool’s strengths, with other components such as Prometheus and Grafana in the process of being implemented. We’ll briefly discuss our experience to put these components together, and the decisions we had to make in order to automate their deployment and configuration for our goals. Finally, we lay out collaboration possibilities and future directions to enhance our project as a convenient starting point for others in the open source community to easily monitor their own Power Systems environments.
Marcelo PerazoloIBM Systems
Marcelo Perazolo is the Lead Software Architect for Operational Management in the IBM Systems Cloud Solutions team. He is located in RTP, NC, and received his MSEE and BSEE degrees from UNICAMP, then started his career at IBM in 1990, with more than 25 years of experience on Infrastructure and Platform Management solutions. He drives planning & strategy to exploit Open Software to build Converged and Hyperconverged infrastructures. He is active in multiple organizations, such as OpenPOWER, OASIS and DMTF and focuses on furthering IBM’s Open Systems agenda in the marketplace.