Reducing Data Volume

The last chapter discussed how data collected at a remote point could be enriched before sending to some central collector. However, sometimes there is a lot of data relative to the bandwidth available, and we would like to use the distributed processing power in the nodes to reduce the amount of data sent to the collector. This can be effective way to minimize licensing costs when using Splunk, for instance, but also a way of working with sites on the edge with only mobile data available.


One strategy is to discard events which are not considered important. filter has three variations: first, pattern matches on field values:

- filter:
    - severity: high
    - source: ^GAUTENG-

Second, conditional expressions.

- filter:
    condition: speed > 1

A very useful trick is to only send changed values by using stream to watch for changes.

- stream:
    operation: delta
    watch: throughput
- filter:
    condition: delta != 0

Discarding and Renaming

We may be only interested in certain data fields.

The thrid variation of filter is schema:

- filter:
    - source
    - destination
    - sent_kilobytes_per_sec

filter schema will only pass through events with the specified fields, and will discard any other fields.

It's useful to both document event structure and to get rid of any temporary fields generated during processing.

One can always explicitly use remove.

When JSON is used as data transport, then field names are a significant part of the payload size, so renaming fields can make a difference:

- rename:
  - source: s
  - destination: d
  - sent_kilobytes_per_sec: sent

This naturally leads to the next section:

Using a more Compact Data Format

CSV is a very efficient data transfer format, because each row does not repeat column names.

collapse will convert the fields of a JSON event into CSV data.

However, typically you would need to convert this back into JSON to store in Elasticsearch (for example). Having to create a Logstash filter to do this is tedious, so collapse provides some conveniences:

With CSV output you can ask for the column names-types to be written to a field. Optionally you can ask for it only to be written if the fields change:

# input:
# {"a":1,"b":"hello"}
# {"a":2,"b":"goodbye"}
- collapse:
    output-field: d
        header-field: h
        header-field-types: true
        header-field-on-change: true
# output:
# {"d":"1,hello","h":"a:num,b:str"}
# {"d":"2,goodbye"}

The reverse operation expand can happen on a server-side pipe. It will take the output of the remote pipe and restore the original output.

- expand:
    input-field: d
    remove: true
    delim: ','    # default
        header-field: h
        header-field-types: true    

If there is a corresponding pipe on the server, then you can move any enrichments to that pipe as well.