mirror of
https://github.com/opsschool/curriculum.git
synced 2026-01-15 12:15:03 +00:00
103 lines
4.7 KiB
ReStructuredText
103 lines
4.7 KiB
ReStructuredText
Monitoring, Notifications, and Metrics 101
|
|
******************************************
|
|
|
|
History: How we used to monitor, and how we got better (monitors as tests)
|
|
==========================================================================
|
|
|
|
Perspective (end-to-end) vs Introspective monitoring
|
|
====================================================
|
|
|
|
Metrics: what to collect, what to do with them
|
|
==============================================
|
|
|
|
Common tools
|
|
============
|
|
|
|
Syslog (basics!)
|
|
----------------
|
|
|
|
Syslog-ng
|
|
---------
|
|
|
|
Nagios
|
|
------
|
|
Nagios is a monitoring tool created by Ethan Galstad.
|
|
It was originally released as a opensource project named NetSaint in 1999.
|
|
Due to trademark issues it was later renamed Nagios.
|
|
Since then it has become one of if not the most common monitoring tool for production environments.
|
|
|
|
Nagios is primarily an alerting tool that will notify an administrator or group of administrators if a service enters a critical or warning state if desired.
|
|
It has a basic web interface that allows for acknowledgment of issues and scheduling downtime.
|
|
|
|
Nagios is highly configurable and extensible due to its reliance on external commands to perform almost every task.
|
|
For example, every availability test in Nagios is a standard executable, and notifications are generated by an external command.
|
|
Because of this Nagios does not restrict the administrator to using a particular language to extend the system, and often times plug-ins and tests are written in a variety of languages.
|
|
|
|
The feature set of Nagios is pretty extensive.
|
|
It supports service and host altering hierarchies via parenting, so you can reduce the number of alerts when a critical service or host fails.
|
|
It supports active and passive checks.
|
|
It has basic scheduling for on-call periods, and supports time periods in which you can disable alerting.
|
|
|
|
Since Naigos is so configurable, it can often be difficult to configure for the uninitiated.
|
|
It can use many files for configuration, and a single syntax error will prevent the system from starting.
|
|
Additionally, the open-source version does not natively support adding and removing hosts dynamically; the configuration needs to be modified, and the server restarted to add or remove a host.
|
|
|
|
|
|
Graphite
|
|
--------
|
|
|
|
Ganglia
|
|
-------
|
|
|
|
Munin
|
|
-----
|
|
|
|
RRDTool / cacti
|
|
---------------
|
|
|
|
Icinga
|
|
------
|
|
|
|
SNMP
|
|
----
|
|
Simple Network Management Protocol or SNMP, is a monitoring and management protocol.
|
|
It is the standard way of monitoring on switches, routers, and other networking equipment.
|
|
SNMP relies on an agents which when contacted by a management system return the information requested.
|
|
The data provided by the agent uses Object Identifiers or OIDs that provide information about the current system.
|
|
OIDs can contain anything from strings identifying information about the system, to total number of frames received by the Ethernet controller.
|
|
Devices and systems often are provided with MIBs or Management Information Base these help the management system identify the information contained in the OID.
|
|
Lastly, management systems request information by providing a community string, for example Public.
|
|
These community strings allow the agent to determine what information is appropriate to return to the requester, and whether the requesting system has read-only or read-write access.
|
|
|
|
There are three commonly used versions of the protocol, SNMPv1, SNMPv2c and SNMPv3.
|
|
SNMPv3 is the only cryptographically secure version of the protocol.
|
|
Most devices will have support at least two versions of SNMP.
|
|
|
|
Collectd
|
|
--------
|
|
|
|
`Collectd <https://collectd.org>`_ collects system-level metrics on
|
|
each machine. It works by loading a list of plugins, and polls data
|
|
from various sources. The data are sent to different backend
|
|
(Graphite, Riemann) and can be used to trigger alerts with Nagios.
|
|
|
|
Sensu
|
|
-----
|
|
`Sensu <https://github.com/sensu>`_ was written as a highly
|
|
configurable, Nagios replacement. Sensu can be described as a
|
|
"monitoring router", since it connects check scripts across any number
|
|
of systems with handler scripts run on one or more Sensu servers. It
|
|
is compatible with existing Nagios checks and additional checks can be
|
|
written in any language similar to writing Nagios checks. Check
|
|
scripts can send alert data to one or more handlers for flexible
|
|
notifications. Sensu provides the server, client, api and dashboard
|
|
needed to build a complete monitoring system.
|
|
|
|
Diamond
|
|
-------
|
|
`Diamond <https://github.com/BrightcoveOS/Diamond>`_ is a python daemon
|
|
that collects system metrics and publishes them to Graphite
|
|
(and others). It is capable of collecting cpu, memory, network, i/o,
|
|
load and disk metrics. Additionally, it features an API for implementing
|
|
custom collectors for gathering metrics from almost any source.
|