opsschool-curriculum/monitoring_101.rst

Monitoring, Notifications, and Metrics 101
******************************************

History: How we used to monitor, and how we got better (monitors as tests)
==========================================================================

Perspective (end-to-end) vs Introspective monitoring
====================================================

Metrics: what to collect, what to do with them
==============================================

Common tools
============

Syslog (basics!)
----------------

Syslog-ng
---------

Nagios
------
Nagios is a monitoring tool created by Ethan Galstad.
It was originally released as a opensource project named NetSaint in 1999.
Due to trademark issues it was later renamed Nagios.
Since then it has become one of if not the most common monitoring tool for production environments.

Nagios is primarily an alerting tool that will notify an administrator or group of administrators if a service enters a critical or warning state if desired.
It has a basic web interface that allows for acknowledgment of issues and scheduling downtime.

Nagios is highly configurable and extensible due to its reliance on external commands to perform almost every task.
For example, every availability test in Nagios is a standard executable, and notifications are generated by an external command.
Because of this Nagios does not restrict the administrator to using a particular language to extend the system, and often times plug-ins and tests are written in a variety of languages.

The feature set of Nagios is pretty extensive.
It supports service and host altering hierarchies via parenting, so you can reduce the number of alerts when a critical service or host fails.
It supports active and passive checks.
It has basic scheduling for on-call periods, and supports time periods in which you can disable alerting.

Since Naigos is so configurable, it can often be difficult to configure for the uninitiated.
It can use many files for configuration, and a single syntax error will prevent the system from starting.
Additionally, the open-source version does not natively support adding and removing hosts dynamically; the configuration needs to be modified, and the server restarted to add or remove a host.


Graphite
--------

Ganglia
-------

Munin
-----

RRDTool / cacti
---------------

Icinga
------

SNMP
----
Simple Network Management Protocol or SNMP, is a monitoring and management protocol.
It is the standard way of monitoring on switches, routers, and other networking equipment.
SNMP relies on an agents which when contacted by a management system return the information requested.
The data provided by the agent uses Object Identifiers or OIDs that provide information about the current system.
OIDs can contain anything from strings identifying information about the system, to total number of frames received by the Ethernet controller.
Devices and systems often are provided with MIBs or Management Information Base these help the management system identify the information contained in the OID.
Lastly, management systems request information by providing a community string, for example Public.
These community strings allow the agent to determine what information is appropriate to return to the requester, and whether the requesting system has read-only or read-write access.

There are three commonly used versions of the protocol, SNMPv1, SNMPv2c and SNMPv3.
SNMPv3 is the only cryptographically secure version of the protocol.
Most devices will have support at least two versions of SNMP.

Collectd
--------

`Collectd <https://collectd.org>`_ collects system-level metrics on
each machine.  It works by loading a list of plugins, and polls data
from various sources.  The data are sent to different backend
(Graphite, Riemann) and can be used to trigger alerts with Nagios.

Sensu
-----
`Sensu <https://github.com/sensu>`_ was written as a highly
configurable, Nagios replacement. Sensu can be described as a
"monitoring router", since it connects check scripts across any number
of systems with handler scripts run on one or more Sensu servers. It
is compatible with existing Nagios checks and additional checks can be
written in any language similar to writing Nagios checks. Check
scripts can send alert data to one or more handlers for flexible
notifications. Sensu provides the server, client, api and dashboard
needed to build a complete monitoring system.

Diamond
-------
`Diamond <https://github.com/BrightcoveOS/Diamond>`_ is a python daemon
that collects system metrics and publishes them to Graphite
(and others). It is capable of collecting cpu, memory, network, i/o,
load and disk metrics. Additionally, it features an API for implementing
custom collectors for gathering metrics from almost any source.