window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-3079425-11');

Appendix D: Nagios Monitoring integration

11 min read

A) Introduction 

Lark Router Gateway contains hooks to enable monitoring of the various key internal components/elements of the Gateway. Each kind of element records performance data during the operation of the Gateway, which data can be obtained via the Lark admin port. The kind of performance data depends on the kind of element. 

Nagios is an Open Source system and network monitoring application. It watches hosts and services that you specify, alerting you when things go bad and when they get better. Nagios supports service plugins, enabling it to monitor non-standard services. This document describes Nagios monitoring facilities and the necessary plugin for Lark MMS Gateway. 

B) What can be Monitored 

Lark has a number of key internal elements that can be monitored externally. Each key element records statistics/data about its performance, which are available on request via the Lark admin port of the relevant component (i.e. mmsbox or mmsc). Depending on the component, there are three main kinds of data available: 

1. Liveness: Whether the element is active or not. This relates to ports and binds (see below), and indicates whether the entity is active or not (e.g. was not started or has been temporarily suspended/shut down). An error is triggered if the selected entity is not active. 

2. Counters: These are generalised counters, such as the number of messages in the queue, or number of threads. 

3. Frequency Counters: These are usage counters averaged over pre-set time intervals (see stats-interval configurations directive in Lark). For instance, Lark can be configured to track the average number of MT MMS sent per second on a carrier bind over the last 5 seconds, 60 seconds and 1 hour. These data are then made available on request on the admin port of mmsbox

C) What are the Entities/Elements 

The Elements or Entities that can be monitored are as follows:

1. Port: These are general, incoming request ports for MM7 (VASP and Carrier) and MM1 (Carrier only). Each port maintains the following statistics: i. Queue: This is the size of the queue of pending (incoming) requests. That is, the requests that are pending processing (i.e. HTTP requests received that have not yet been processed/replied to) 

ii. WaitingThread: This is the number of waiting (i.e. sleeping) processor threads on the port at the instant. That is, the number of processors that are not busy processing requests, but are waiting for new requests. 

iii. Throughput: These are the frequency counters measuring the rate (per second) at which received requests have been processed, measured over the pre-set time intervals (see above). 

iv. AuthErrors: These are the rate (per second) at which authentication errors have been recorded on the port over the pre-set periods. 

v. IPErrors: These are the rate (per second) at which IP address access check errors have been recorded on the port over the pre-set periods. (IP check errors are requests which are denied due to the client request being received from a blacklisted or non-whitelisted IP address as per the port’s configuration.) 

vi. Errors: These are the rate (per second) at which all types of errors have been recorded on the port over the pre-set periods. 

2. Bind: These are carrier or VASP binds. Each bind is identified by its *id* field. Each bind maintains the following statistics: 

i. MT: These are the frequency counters measuring the rate (per second) at which MT MMS have been processed, measured over the pre-set time intervals (see above). 

ii. MO: These are the frequency counters measuring the rate (per second) at which MO (including DLR) MMS have been processed, measured over the pre-set time intervals (see above). 

iii. DLR: These are the frequency counters measuring the rate (per second) at which DLR MMS have been processed, measured over the pre-set time intervals (see above). 

iv. Forward: These are the frequency counters measuring the rate (per second) at which MM1 notifications have been processed/sent (for MM1 binds) or messages have been sent (i.e. MM7/MM4 forward requests), measured over the pre-set time intervals (see above). For non-MM1 binds these is the same as the MT counters. 

v. ForwardAck: These are the frequency counters measuring the rate (per second) at which MM1 notifications are acknowledged by the SMSC for MM1 binds. That is, counts of notitication SMS DLRs measured over the pre set time intervals (see above). For MM7 and MM4 binds this is the frequency of delivery reports (i.e. it is the same as the DLR counters).

vi. Fetched and Retrieved: These are the frequency counters measuring the rate (per second) at which *Fetchd* DLR MMS have been received (and MMS have been fetched on MM1 binds), measured over the pre-set time intervals (see above). 

vii. Mm1Notify: These are the frequency counters measuring the rate (per second) at which MM1 notifications have been processed/sent, measured over the pre-set time intervals (see above). Applies only to MM1 carrier binds. 

viii. Mm1Fetched: These are the MM1 fetched counters. This is the same as the MT counters for MM1 binds. 

ix. Mm1NotifyAck: These are the frequency counters measuring the rate (per second) at which MM1 notifications are acknowledged by the SMSC. That is, counts of notitication SMS DLRs measured over the pre-set time intervals (see above). 

x. Mm1NotifyAverages: These are *mm1notifyack* divided by *mm1notify* frequency counters (see above). 

xi. Throughput: These are the frequency counters measuring the rate (per second) at which combined MT and MO requests have been processed, measured over the pre-set time intervals (see above). 

xii. AuthErrors: These are the rate (per second) at which authentication errors have been recorded on the bind over the pre-set periods. 

xiii. IPErrors: These are the rate (per second) at which IP address access check errors have been recorded on the bind over the pre-set periods. Note that this metric is only available/valid if the bind has a specific/dedicated receive port. For instance carrier binds that receive MO (and/or DLR) MMS on the general MM7 port will naturally not record any IP check errors since those will be reported agains the main MM7 port. 

xiv. ParseErrors: These are the rate (per second) at which MM7/MM1 protocol errors have been recorded on the bind over the pre-set periods. These are errors such as invalid XML or SOAP received on an MM7 interface, or an MM1 fetch with an invalid queue ID in the request URL. 

xv. HTTPErrors: These are the rate (per second) at which there are HTTP errors such as an HTTP request failure or socket failure. 

xvi. OtherErrors: These are the rate (per second) at which other or uncategorised errors have been recorded on the bind over the pre-set periods. These are errors include HTTP request failure (e.g. due to a TCP/IP failure) as well as message rejection via SOAP status codes, etc. (ie status 4XXX) 

xvii. Errors: These are the rate (per second) at which all types of errors have been recorded on the bind over the pre-set periods. 

xviii. Sockets: These are the number of active connections (sockets), incoming/outgoing at the instant.

xix. WaitingThread: This is the number of waiting (i.e. sleeping) processor threads on the bind at the instant. That is, the number of processors that are not busy processing requests, but are waiting for new requests. 

xx. DR: This is the deliverability rate on the bind or bind group 

xxi. RR: This is the rejection rate on the bind or bind group 

3. Queue: This is the queue management sub-system of Lark. It monitors and can report on the queue sizes for any bind. Each bind (VASP or Carrier) is identified by its *id* field. Two kinds of metrics are reported: 

i. Queue: This reports the size of the in-memory queue of incoming messages pending processing. This queue holds the messages received on the bind from the carrier or VASP, that have not yet been parsed or stored to the database. 

ii. DBQueue: This reports the size of the on-database queue (at that instant) for the selected bind. Two counters are reported: IN and OUT , which are, respectively, the number of messages received (and not yet forwarded) on the bind, and the number of messages waiting to be sent on the bind. Note that these data always include messages queued for retry. They do not however include messages deemed sent (that is, MM1 MT MMS awaiting fetch by mobile devices for which the notification has already been sent.) 

4. Threads: This is the threads monitoring sub-system in Lark. Only one value is reported: The number of processing threads in the entire component (i.e. *mmsbox* or *mmsc*). 

D) The Nagios Plugin 

The Nagios plugin script check_lark.py is a Python script that links Nagios to Lark. The script must be called with the following command line arguments: 

–element , –entity or -e : Identifies the element type to query. Must be one of queue , threads , port or bind . 

–id or -b : The id of the entity. For binds this is the bind ID, for ports this is the port name (i.e. mm7 or mm1 ). This parameter is ignored for all other element types. 

–type The type of bind or port. Must be one ov *vasp* or *carrier* –stats or -m : The data/statistics required. The kinds of data available for each kind of element is as listed above. 

–host or -h : The IP or host address of Lark on which its admin interface can be reached. Defaults to localhost . 

–port or -p : The port on which the Lark admin interface can be reached. –password or -a : The admin password required to authenticate to the Lark component admin interface.

–ssl or -s : Whether to use HTTPS to access the admin interface or not. Default is HTTP. 

–warning or -w : The range of values used to determine whether the data received should trigger a warning in Nagios or not. See below for more on the format of these values. 

–critical or -c : The range of values used to determine whether the data received should trigger an error in Nagios or not. See below for more on the format of these values. 

–verbose or -v : Turn on verbose output. 

E) Warning and Error Range Specifications 

Nagios plugins report warning and error conditions based on thresholds and ranges. When service data falls outside a configured range, the plugin returns an error code to Nagios, which in turn triggers a configured action. 

The Lark expects warning and error values (–warnings and –critical parameters above) to be supplied using the format described in the Nagios Plugins Development Document. So for instance: 

check_lark.py –entity=queue –stats=queue –port=8001 –password=test — id=mm7test –type=vasp –warning=10 –critical=0:200 

will trigger a warning if the mm7test bind has more than 10 items pending processing in its internal queue, and a CRITICAL error if it has more than 200 items in queue. 

A missing range/threshold is always skipped, and no threshold check is performed. 

For interval-based frequency counters, the ranges are supplied as a list of comma separated ranges. If a range is not supplied for an interval, then the data in that interval are not checked. 

For example, assuming the pre-set intervals in Lark are 5,60,3600, then: 

check_mbuni.py –stats=throughput –entity=port –port=8001 –password=test -c ‘2:4,1:5’ –type=carrier –id=mm7 

will issue an error if the average throughput on the MM7 port over the last 5 seconds is less than 2 or greater than 4, or if the average throughput over the last minute is less than 1 or greater than 5. The throughput over the last hour is not checked. 

F) Performance Data 

The Lark plugin also reports the raw performance data on each request. This can then be displayed/plotted using other Nagios plugins. The example Lark Nagios config included shows how to use Nagios Graphs to graph plot statistics received.

Frequency counters are each labelled with the time interval (i.e. “5 seconds”, “1 minute”, “5 days”, etc.), whereas other basic data (e.g. number of threads) are unlabelled. 

G) Integrating the Plugin into Nagios 

Nagios integration is beyond the scope of this document. However a sample (representative, not complete) Nagios configuration for Lark is provided (see mbuni example.cfg). The configuration assumes Nagios Graphs is used/installed, however removing this dependency is fairly straight-forward. 

H) Pre-requisites 

The Nagios plugin requires Python v2.x. The sample Nagios configuration assumes Nagios v3.x or greater.