10 years of OSMC: Why does my monitoring still look the same?
Little over 10 years ago I packed my brand new Sony Ericsson W910i, my IRiver mp3 player to fly to Nuremberg to attended my very first OSMC as an amazingly proud Nagios administrator…
Now 10 years later I am back: To share with you how my monitoring has changed over the course of the last decade!
Thruk 2½ – Current state of development
Thruk is a web-interface to show status information from multiple monitoring systems, ex. icinga, naemon,… at a glance. In this talk i will show recent development milestones like dashboards and the business process plugin as well as clustering and distributed monitoring features along with small but useful helpers for the daily sysadmin life.
Smart Home Monitoring mit openHAB 2
We continuously collect values and state changes, make graphs out of them, initiate processes in subject of them, get notified in many different ways about them – in summary we can say that our Smart Homes behave as „monitoring our own homes“. So let’s do a live demo of openHAB 2, covering questions like „why using something like this?“ and „where to start?“ and giving you some ideas about persisting and representing collected data.
Refocus: The One Stop Shop For Monitoring System Health
We will share why we created Refocus: our open source, internally developed, and self-service tool for monitoring computing systems. We will cover how it is extensible, describe its tech stack including Node.js, and its data model. Within Salesforce, Refocus serves 2000+ users and 20+ teams. https://github.com/salesforce/refocus
Scaling Icinga2 with many heterogeneous projects – and still preserving configurability
The main objective of the talk is to give a detailed real world example how we use Icinga 2 at a large scale with all its pros and cons. SysEleven monitors several hundred heterogeneous projects. To migrate our Icinga 1 setup to a high available Icinga 2 setup we developed icingadiff. The new cluster is fully automated with Puppet, deploys over 60000 checks and enables our engineers to fine tune every check if necessary. To integrate further information and custom workflows we modified Icingaweb2.
Windows Monitoring 3.0 – Jetzt erst recht!
Bruce Willis regularly has got to fight against crime in his movies, and his opponents are always facing an inevitable end, just as it is inevitable having to monitor singular Windows landscapes or complex environments. Icinga 2 offers various options by means of WMI, SNMP, Powershell, the Icinga 2 Agent or even the NSClient++.
Of course, there are plenty of other options, but isn`t there a simpler solution? Perhaps something which will help get the most out of every hidden byte?
Microsoft has invented something- it is called PowerShell. You are going to like it!
Monitoring with Sensu 2.0
Applications are complex systems. Their many moving parts, component and dependency services, may span any number of infrastructure technologies and platforms, from bare metal to serverless. As the number of services increases, teams responsible for them will naturally develop their own preferences, such as how they instrument their code or how and when they receive alerts. Sean will demonstrate how Sensu 2.0 is designed to monitor these ever changing heterogeneous environments. Sensu 2.0 is the next release of the open source monitoring framework, rewritten in Go, with new capabilities and reduced operational overhead. Sean will go over various patterns of data collection, including scraping Prometheus metrics, and show how Sensu enables self-service monitoring and alerting for service owners.
Netzwerkmonitoring mit Prometheus
Prometheus and Grafana referred to a dynamic duo and are currently talk of the town. But the questions that begs to be answered are :
* How can metrics be collected and processed via SNMP?
* How could Grafana visualize tons of network ports in a simple way?
* Is there a way to collect ad hoc metrics for troubleshooting?
In my talk, I try and address these questions and at the same time I will talk about the requirements and pitfalls of implementing the SNMP Exporter for Prometheus using OMD in big network environment for a global corporation.
Tailored SNMP monitoring – Your own SNMP MIB and sub-agent with Python and python-netsnmpagent
SNMP continues to be a essential component in monitoring where the information being made available is structured in so-called Management Information (MIB) modules. The standard net-snmp distribution comes a with a variety of standard MIBs implemented by its snmpd, but sometimes there is the need to make your own information available via SNMP. Luckily snmpd can be dynamically extended by so-called subagents implementing the AgentX protocol (RFC2747). The net-snmp API however pretty much focusses on the C programming language only, laying the entrance barrier especially for non-developers rather high. In this talk Pieter will not only demonstrate the creation of a MIB, which, being a text file, is the easier part, but also how easy it is to implement a simple subagent in Python using his python-netsnmpagent module. python-netsnmpagent is a shim wrapper over the net-snmp C API trying to implement just enough abstraction. Licensed under the GPL it is available atas well as PyPI.
Building a healthy on-call culture
Paging people just creates series of problems unless you put enough resources to build a healthy “on-call” culture. Nobody wants to be buried into alerts or wake up at 2 am in the morning.
There are several points you have to take into account to make on-call suck less. At the center of each of these items, there are people. If you put your people at the center and design your incident response thinking about them in the first place, on-call becomes a competitive advantage.
In this presentation, Serhat will start defining on-call and why we need a robust on-call culture. At this point, he’ll mention the impact of downtime and performance degradation such as direct revenue, and credibility losses. Then, continue listing 6 must-haves: – Be transparent – Share responsibilities – Get ready for wartime – Build resilient and sustainable systems – Create actionable alerts – Learn from your experiences In each of these steps, there will be crucial points and pieces of advice to both developers and management systems. In the end, Serhat will show that our efforts in building a better on-call culture, will pay off as our people and user’s happiness.03
SLA Monitoring mit Icinga & Prometheus
Our hosted customer services are committed to a strict SLA, so we need a monitoring system which is high available and is able to distinguish specific regional network outages from SLA violations while still producing useful monitoring data. The satellites use diverse upstream providers and are located at different locations.
Therefor we instrument several open source tools to scrape and aggregate metrics from a customer perspective.
We built a setup with multiple Icinga2 nodes combined with a vary of prometheus exporter and instances which can operate completely independent. In a reporting purpose we decided to have a fully redundant monitoring setup with consistent data off all distributed satellites.
Integrating Check_MK agent into Thruk – Windows monitoring made easy
Did you ever get a headache configuring Windows monitoring using NSClient++? We did, too! While using the Check_MK Windows agent we were impressed by the plethora of checks that had already been implemented there. With the goal in our minds to create a fast and easy-to-use Windows monitoring solution, we set out to integrate the Check_MK agent into Thruk and tap into it’s capabilities. Combining them even allows for fully automatic service discovery!
Logging is coming to Grafana
Grafana is an OSS dashboarding platform with a focus on visualising time.series data as beautiful graphs. Now we’re adding support to show your logs inside Grafana as well. Adding support for log aggregation makes Grafana an even better tool for incident response: First, the metric graphs help in a visual zoning in on the issue. Then you can seamlessly switch over to view and search related log files, allowing you to better understand what your software was doing while the issue was occurring. The main part of this talk shows how to deploy the necessary parts for this integrated experience. In addition I’ll show the latest features of Grafana both for creating dashboards and maintaining their configuration. The last 10-15 will be reserved for a Q&A.
Ha ha, fooled you. This talk will be in English, but I did learn a new German noun. Back in 2015 the OpenNMS Project split the code into two main branches: Horizon and Meridian. To compare it to Red Hat, think of Horizon as Fedora and Meridian as RHEL. Doing this allowed the team to greatly increase the release cycle of OpenNMS, as Horizon major releases occur once every three to four months. This presentation will cover the changes to OpenNMS introduced since the last OSMC, which includes releases Horizon 21, 22 and 23. The major new features include support for Minions (H21). No, not the little yellow guys shaped like medicine capsules, but a technology to both distribute OpenNMS functionality and make it redundant, perfect for large scale or IoT deployments. Another feature is “telemetryd” (H22) which adds the ability to collect and store “flow” data from protocols such as Netflow, JFlow and Sflow. And finally there is Project Sextant (H23) that adds high level alarm correlation to OpenNMS. Also new this year is support for the MQTT protocol that allows OpenNMS to talk to IoT gateways such Eclipse Kura. While it is hoped this talk with be informative, as an American and I assure you it will be loud (unless the organizers stick me as the first speaker on the second day, then it might be quieter).
Handling messages and notifications from software and gadgets with MQTT and mqttwarn
MQTT is a simple, lightweight, TCP-based protocol which can be used for having “things” such as sensors report data, but it can also be used in your software projects as a flexible method of distributing messages and notifications. In this talk we will show you what MQTT is, what its Last Wills and Testaments can do for monitoring your long-running processes, and how you can route MQTT messages through brokers. We will show you some examples of real-world MQTT use, and we discuss the mqttwarn utility which allows you to decouple MQTT message notifications from your systems. Be warned: there will be blinkenlights!
Distributed Tracing FAQ
Microservices, containers and more in general distributed systems have opened a different point of view on our system and applications. We need to understand how a single event or requests cross our app jumping over networks, containers, virtual machines and sometime clod provider. There is a specific practice called distributed tracing to increase observability of systems like that. After this talk, you will have a solid idea around what tracing means, how you can instrument your applications and you will be ready to trace your application across many languages using open source technologies like OpenTracing, OpenCensus, Zipkin, Jaeger, InfluxDB. You will ask yourself how you survived until today!
ALEXANDER WERT / MARIUS OEHLER
OpenAPM – Application Performance Management tailored to your needs
There is no ‘one-size-fits-all’ approach for Application Performance Management. Recently, the open source community is increasingly providing a great amount of tools that excellently address different aspects of APM. Combining the aspects and strengths of these tools is a great potential for building tailored APM solutions. Get to know the OpenAPM initiative () and see examples on how to integrate open source tools for future-proof and tailored APM solutions!
From Monitoring to ITSM
Having monitoring is good. Having proper alerting is better. Integrating monitoring with ticketing is awesome.
Todays open source ticketing landscape ranges from longtime players such as BestPractical’s RequestTracker to newcomers such as Zammad. Thanks to modern APIs on all sides a set of tools is easily transformed into an ITSM stack – routing incidents from monitoring into the ticketing, keeping everyone in the loop and laying the baseline for being able to offer better service level agreements.
While this talk gives a broad overview of available options it concludes with concrete examples as well as recommendations on how to implement and deploy such a setup.
Why we recommend PMM to our clients
As service providers, one of our responsibilities is helping clients understand what causes contributed to a production downtime incident, and how to avoid (as much as possible) them from happening again. We do this with Incident Reports, and one common recommendation we make is to have a historical monitoring system in place. All our clients have point-in-time monitoring solutions in place, solutions that can alert them when a system is down or behaving in unacceptable ways. But historical monitoring is still not common, and we believe a lot of companies can benefit from deploying one of them. In most cases, we have recommended Percona Monitoring and Management (PMM), as a good and Open Source solution for this problem. In this session, we will talk about the reasons why we recommend PMM as a way to prevent incidents, and also to investigate their possible causes when one has happened.
Monitoring Distributed Systems
From an external observer’s perspective components of distributed systems are starting and terminating in an unpredictable manner, which makes the monitoring challenging. Components can also start multiple times on a single server as well as on multiple machines. The Hadoop ecosystem is one example for such a distributed application and the primary example of this talk. The fundamental question to be addressed is: How can such unpredictable distributed systems be monitored? This talk presents a general analysis of the problem and its existing solutions. Based on this analysis, a new theoretical concept is developed and realized in a practical solution. A fully automated monitoring solution for distributed systems will be demonstrated. The solution is flexible and portable and can therefore be applied also outside the Hadoop environment. This new solution is an efficient and promising contribution to the monitoring community.
Make IT monitoring ready for cloud-native systems
Modern cloud-nativeinfrastructures consist of large number of components components to monitor (hardware and software). You have to manage hundreds to thousands of components (e.g. container-based microservices). Effective monitoring in such an environment is complex and challenging, notably to avoid false alerts and to concentrate your efforts on incidents that matter. Monitoring should not be just a matter of getting raw metrics through your favorite monitoring tool (e.g. Prometheus) to draw nice graphs in Grafana. It’s important that, for each incident, you be able to assess (quickly?) how it really impacts your business. Hence you need to rely on incidents aggregation, along with dashboards and notification systems that take into account the real impact of every incident on your business services. In other words, you should for example avoid to fire a critical alert while a failed container does not critically impact your business services (e.g. a service you provide to your end-users). Assessing the real impact of each incident allows operations staffs to be able to prioritize issues based on those that have higher impacts on services.
That’s where this talk comes into play. The intent is to discuss advanced aggregation and visualization capabilities aiming at helping operations staffs to deal with monitoring with focus on business applications. We’ll also introduce RealOpInsight , an open source business application dashboard management tool that works on top of various existing monitoring systems, including, Nagios/Icinga and derived, Zabbix, and even more. We’ll finish by showing some examples and demo based on RealOpInsight and Icinga .
Among others, RealOpInsight provides the following features:
- Service Impact & Root Cause Analysis
- Business Service Management Map & Advanced Event Correlation
- Multiple, Distributed & Heterogeneous Data Sources
- Graphical Configuration Manager
- Cross Operating Systems
- Multi-Tenant/Service Provider Ready
Monitor your application performances using InspectIT APM
In correlation with the fast paced technological evolution, software applications have also evolved to become more complex, more distributed and more dynamic. Cloud computing has pushed this evolution a step further. Therefore it is now more difficult to properly monitor application performances and ensure the end user that he will get the high-quality experience he is expecting. Luckily great tools such as InspectIT APM are taking care of the problem efficiently. In this presentation I will introduce you to inspectIt APM and then I will show you how to integrate it to your Spring boot application and how to diagnose, analyse and monitor your application. You will become familiar with the use of inspectIt three main components: the agent, the server and the user interface. You will also learn how to analyse user request from the invocation sequence and detect the root causes of potential problems. You will finally see how easy it is to perform real user monitoring with inspectIt in a production like environment.10
ANDREAS WEIGEL / JAKOB FELS
Observability in einer Microservicewelt
Our architecture migration from a monolithic to a distributed system at dmTECH entailed a lot of challenges. One of the main topics was observability. An increasing distribution of the systems demands for centralized logging to correlate events and analyze business processes.
Dependencies of services can be tracked by request based tracing. Metrics and monitoring are prerequisites to capture the reliability of applications and to take action on anomalies. Establishing a devops culture the product team is forced to face the challenges as every member of the team has holistic responsibility of their product.
In our talk we present some tools to handle the aforementioned challenges, especially those which fit best for our Spring-Boot-Stack.
Learnings, patterns and Uber’s metrics platform M3, open sourced as a Prometheus long term storage backend
At Uber we use high cardinality monitoring to observe and detect issues with our 4,000 microservices running on Mesos and across our infrastructure systems and servers. We’ll cover how we put the resulting 6 billion plus time series to work in a variety of different ways, auto-discovering services and their usage of other systems at Uber, setting up and tearing down alerts automatically for services, sending smart alert notifications that rollup different failures into individual high level contextual alerts, and more. We’ll also talk about how we accomplish all this with a global view of our systems with M3, our open source metrics platform. We’ll take a deep dive look at how we use M3DB, now available as an open source Prometheus long term storage backend, to horizontally scale our metrics platform in a cost efficient manner with a system that’s still sane to operate with petabytes of metrics data.
Scoring a Technical Cyber Defense Exercise with Nagios and Selenium
This talk is about building an availability scoring system for the annual international cyber defence exercise Locked Shields. The scoring solution is built around Nagios Core, accompanied by several tools (e.g., Selenium WebDriver) and custom scripts. In 2018, we had to monitor 3,080 hosts with 31,350 services with checks performed at least once per minute. During 16 hours of game-play, 34 million scoring checks were performed and logged, which averages at about 35,000 active checks per minute.
Mit KI zu mehr Automatisierung bei der Fehleranalyse
With the proliferation of artificial intelligence, new IT monitoring approaches and the detection of anomalies in data traffic has been receiving particular attention. Thanks to the insights gained by anomaly detection based on machine learning methods error analysis can be notably accelerated and partly automated. An entire register of supervised and unsupervised learning methods is available as Open Source to help answer a wide range of questions where standard methods are stressed to their limits. Of course it needs the necessary expertise and experience to use the right approach for a specific problem. Challenges, that can be met when implementing such solutions and concrete advantages that have already been confirmed in practice are going to be shown by means of Würth-Phoenix NetEye monitoring solution. Several more advances are expected in the months to come.
Monitoring of Software Defined Networks
Software Defined Networking has become a major part in modern datacenter environments and cloud computing. This talk covers some of the basic concepts and key challenges creating an automated monitoring for highly dynamic and fast changing virtual network environments. Examples and demos will be explained by showing the monitoring setup used by the hosting department of NETWAYS.
What We Should All Worry About When Monitoring Serverless Applications
Worried about problems or performance issues in your user experience? Discover the metrics that are critical when monitoring your serverless applications. A serverless expert will take you through the evolution of application architectures and their monitoring challenges. Participants will learn how to better monitor and troubleshoot issues and better understand how the events in their serverless applications are connected. This session is helpful for people who have a basic understanding of serverless environments as well as seasoned serverless developers.
The fine folks will get a short and accurate review of the evolution that happened regarding infrastructure and the way we develop software; I will guide and enlight them on the acute troubleshooting and operations problems that they experience every day while showing them how it SHOULD be. In addition, I will present some available solutions to overcome these issues.
Monitoring evolution at scale: From closed to open source in 13 years
The story starts 13 years ago with brave heroes mastering the IT monitoring of a worldwide company with more than 20k employees. Their adventure begins with a closed source tool and over the years, it has been replaced with many open source tools to solve challenges within the growing IT landscape. This talk highlights the changes and challenges involved with the tools and their functionality in contrast to today’s requirement of business processes and daily sysadmin workflows. Changes were sometimes rough, but the trust earned from SEC and NOC operators using an automated, highly distributed Icinga 2 monitoring solution proves the right decision. This story has everything: pain, cuts, happiness, eat-your-keyboard and finally the feeling of success on the road to reliable unified monitoring with applied workflows and happy users.
Katzeninhalt mit ein wenig Einhornmagie
Performance Zahlen ansprechend als Grafik mit Hilfe von Grafana in Icinga Web2 einbinden. Von der Installation über die Konfiguration des Grafana Modules bis hin zum Erstellen eigener Dasboards/Panels, sowie Annotations aus Datenquellen wie der Icinga2 IDO oder Elastic Search. Zum Abschluss ein kurzer Ausflug in die Welt der Themes für Icinga Web2, denn Themes erstellen kann jeder und Monitoring darf auch Spaß machen.
Icinga2 Scale-Out – Monitoring großer Umgebungen
Scale Out isn’t easy. Especially if you want monitor your assets in an Europe-wide spreaded network based on WAN-connections. The talk would guide you through the past, present and future of building and managing such a project. From the requirements back in 2002 to the fully managed Icinga2 implementation of nowadays.
DANIEL NEUBERGER – THOMAS WIDHALM
Fokus Log-Management: Wähle dein Werkzeug weise!
This talk compares Elastic Stack and Graylog with a strong focus on how they can be used for logmanagement.
The view on both toolsets is narrowed down to the essentials of logmanagement to get a more significant insight on how you can use them in this specific scenario.
The goal of this talk is helping you to chose which of these powerful stacks fits best to your specific logging needs. You will learn about their functionalities, handling, what you need to know and how much time you will have invest before you get a working solution.
Stream connector: Easily sending events and/or metrics from the Centreon open-source solution to any data platform
Since Centreon 2.8.18, Centreon broker provides a new connector called “Stream connector”. With it, users have the possibility to create an output to any tool of their choice. The topic of this talk is to present this connector and its use through several examples.
Centralized Logging Patterns
Most organizations feel the need to centralize their logs — once you have more than a couple of servers or containers, SSH and tail will not serve you well any more. However, the common question or struggle is how to achieve that. This talk presents multiple approaches and patterns with their advantages and disadvantages, so you can pick the one that fits your organization best:
- Parse: Take the log files of your applications and extract the relevant pieces of information.
- Send directly: Add a log appender to send out your events directly without persisting them to a log file.
- Structured file: Write your events in a structured file, which you can then centralize.
- From a container: Keep track of short lived containers and configure their logging correctly.
- In Kubernetes: Stay on top of your logs even when services are short lived and dynamically allocated.
Each pattern has its own demo, so you can easily try out the different approaches in your environment.
Eliminating Alerts or “Operation Forest”
Our goal is to eliminate alerts and a common mantra is “alert only if absolutely needed”. But that’s only for the monitoring-related people, whereas it seems as if the mantra for others is “whatever”. Excessive alerts are like Stephen King’s Langoliers – they eat our time. Decommissioned servers, flapping alerts, mysterious emails.
Let’s bring on “Operation Forest” and see what – and even more importantly, how – can be done about that. The monitoring tool You choose is just a part of the solution. What good is your tool if nobody pays attention to it?
We’ll talk about the usage of weapons like Terminology, Persistence and Involvement, as well as CommonSense and Policy as potential problem-solvers. Case studies (aka anecdotal examples) will be brought forward to illustrate ways alerts can become useless, be ignored and lead to serious outage.
Also find out: what’s “Operation Forest”.
It’s all about the… containers!
After the virtualization we’re now living the next migration trend: “containerization”.
Almost everyone is talking about containers. But what are containers and can they be monitored the same way as classical systems?
This presentation gives insights about System Containers (LXC) and Application Containers (Docker) and how they can be monitored using Icinga 2 with check_lxc and check_rancher2.
Visualization of your distributed infrastructure
In times of industrial IoT devices and cloud providers like AWS, which allow small and medium-sized companies to distribute their infrastructure around the world, it is becoming increasingly important to keep an overview. Here a few instances in Amsterdam, there a few in Tokyo and not to forget the satellites in Moscow. The Map Addon with its filters and dashboards helps to keep the overview in this constantly changing landscape and to recognize patterns and anomalies at an early stage.
CARLOS ALBERTO CORTEZ
Introduction to OpenTracing
OpenTracing is an effort to provide a vendor neutral API for distributed tracing, which can be used to trace both applications and OSS packages. Behind it stands the concept of distributed tracing: to measure the performance of application requests spanning many micro services. Using the OpenTracing API, developers can instrument their code without binding to any particular vendor (Jaeger, ZipKin, LightStep, etc). The API supports Span (request) management, inter-process propagation, active Span management, to name a few. Moreover, the specification works on a cross-language basis. This presentation is intended mostly for engineers working on distributed systems, and besides a full introduction to OpenTracing and its architecture, attendants will understand the elements that make up a successful tracing implementation. Finally, a small demo will be shown using to showcase OpenTracing at work using Jaeger, an Open Source tracing system released by Uber.
How to be a Monitoring Consultant
If you are wondering what it takes to get paid to build monitoring solutions, this is the talk for you. From getting started, to design, implementation and support of monitoring systems, I will uncover the secrets of how the professionals do it. Ideas on how to engage with customers, how to sell the idea to management and various stake holders, what to look out for when building systems, and how to actually get paid to do it. Even if you aren’t looking to branch out as a monitoring professional, other users will benefit from some of the softer skills and advice, to increase adoption and ultimately the success of your monitoring project.
MARIANNE SPILLER – JAN-PIET MENS – THOMAS WIDHALM – LENNART BETZ
A moderated panel discussion with the authors Marianne Spiller, Jan-Piet Mens, Thomas Widhalm and Lennart Betz on the subject “how to write a book” with the experienced talk-master Bernd Erk. Together Thomas Widhalm and Lennart Betz published a book on the subject Icinga 2. Marianne, however, wrote a book on OpenHAB 2 on her own. A relaxed round-table discussion with prepared questions from the talk-master, but also with spontaneous questions by the audience. Some questions subject-related, some on authorship and some regarding the contents of the literary works.