Goals of this course

This course gives an introduction to Spicy, a parser generator for network protocols and file formats which integrates seamlessly with Zeek. The course is complementary to the Spicy reference documentation and is intended to give a more guided (though selective and incomplete) tour of Spicy and its use with Zeek.

After this course you should be comfortable implementing protocol parsers specified in an RFC and integrate it with Zeek.

Why Spicy?

Historically extending Zeek with new parsers required interacting with Zeek's C++ API which was a significant barrier to entry for domain experts.

Spicy is a domain-specific language for developing parsers for network protocols or file formats which integrates well with Zeek.

Flexible multi-paradigm language

With Spicy parsers can be expressed in declaratively in a format close to specifications, e.g., the following TFTP ERROR message

#  2 bytes     2 bytes      string    1 byte
#  -----------------------------------------
# | Opcode |  ErrorCode |   ErrMsg   |   0  |
#  -----------------------------------------

can be expressed in Spicy as

type Error = unit {
    code: uint16;
    msg:  bytes &until=b"\x00";
};

Spicy supports procedural code which can be hooked into parsing to support more complex parsing scenarios.

function sum(a: uint64, b: uint64): uint64 { return a + b; }

type Fold = unit {
    a: uint8;
    b: uint8 &convert=sum(self.a, $$);
    c: uint8 &convert=sum(self.b, $$);
};

Incremental parsing

The parsers generated by Spicy automatically support incremental parsing. Data can be fed as it arrives without blocking until all data is available. Hooks allow reacting to parse results.

Built-in safety

Spicy code is executed safely so many common errors are rejected, e.g.,

integer under- or overflows
incorrect use of iterators
unhandled switch cases

Integration into Zeek

Spicy parsers can trigger events in Zeek. Parse results can transparently be made available to Zeek script code.

Prerequisites

To follow this course, installing recent versions of Zeek and Spicy is required (at least zeek-6.0 and spicy-1.8). The Zeek documentation shows the different ways Zeek can be installed.

In addition we require:

a text editor to write Spicy code
a C++ compiler to compile Spicy code and Zeek plugins
CMake for developing Zeek plugins with Spicy
development headers for libpcap to compile Zeek plugins

Docker images

The Zeek project provides Docker images.

Zeek playground

A simplified approach for experimentation is to use the Zeek playground repository which offers an environment integrated with Visual Studio Code. Either clone the project and open it locally in Visual Studio Code and install the recommended plugins, or open it directly in a Github Codespace from the Github repository view.

Spicy language

This chapter gives a coarse overview over the Spicy language.

We show a selection of features from the Spicy reference documentation, in particular

In terms of syntax Spicy borrows many aspects from C-like languages.

Hello world

# hello.spicy
module hello;

print "Hello, world!";

every Spicy source file needs to declare the module it belongs to
global statements are run once when the module is initialized

Compile and run this code with

$ spicyc -j hello.spicy
Hello, world!

Here we have used Spicy's spicyc executable to compile and immediately run the source file hello.spicy.

Hint

Most commands which compile Spicy code support -d to build parsers in debug mode. This is often faster than building production code and useful during parser development.

$ spicyc -j -d hello.spicy
Hello, world!

Basic types

All values in Spicy have a type.

While in some contexts types are required (e.g., when declaring types, or function signatures), types can also be inferred (e.g., for variable declarations).

global N1 = 0;        # Inferred as uint64.
global N2: uint8 = 0; # Explicitly typed.

# Types required in signatures, here: `uint64` -> `void`
function foo(arg: int64) {
    local inc = -1; # Inferred as int64.
    print arg + inc;
}

Spicy provides types for e.g.,

integers, booleans
optional
bytes, string
tuples and containers
enums, structs
special purpose types for e.g., network address, timestamps, or time durations

See the documentation for the full list of supported types and their API.

Boolean and integers

Boolean

Booleans have two possible values: True or False.

global C = True;

if (C)
    print "always runs";

Integers

Spicy supports both signed and unsigned integers with widths of 8, 16, 32 and 64 bytes:

uint8, uint16, uint32, uint64
int8, int16, int32, int64

Integers are checked at both compile and runtime against overflows. They are either statically rejected or trigger runtime exceptions.

Integer literals without sign like e.g., 4711 default to uint64; if a sign is given int64 is used, e.g., -47, +12.

If permitted integer types convert into each other when required; for cases where this is not automatically possible one can explicitly cast integers to each other:

global a: uint8 = 0;
global b: uint64 = 1;

# Modify default: uint8 + uint64 -> uint64.
global c: uint8 = a + cast<uint8>(b);

Optional

Optionals either contain a value or nothing. They are a good choice when one wants to denote that a value can be absent.

optional is a parametric (also sometimes called generic) type in that it wraps a value of some other type.

global opt_set1 = optional(4711);
global opt_set2: optional<uint64> = 4711;

global opt_unset: optional<uint64>;

Optionals implicitly convert to booleans. This can be used to check whether they are set.

assert opt_set1;
assert ! opt_unset;

Assigning Null to an optional empties it.

global x = optional(4711);
assert x;
x = Null;
assert ! x;

To extract the value contained in an optional dereference it with the * operator.

global x = optional(4711);
assert *x == 4711;

Bytes and strings

The bytes type represents raw bytes, typically from protocol data. Literals for bytes are written with prefix b, e.g., b"\x00byteData\x01".

The string type represents text in a given character set.

Conversion between bytes and string are always explicit, via bytes' decode method or string's encode, e.g.,

global my_bytes = b"abc";
global my_string = "abc";
global my_other_string = my_bytes.decode(); # Default: UTF-8.

print my_bytes, my_string, my_other_string;

bytes can be iterated over.

for (byte in b"abc") {
    print byte;
}

Use the format operator % to compute a string representation of Spicy values. Format strings roughly follow the POSIX format string API.

global n = 4711;
global s = "%d" % n;

The format operator can be used to format multiple values.

global start = 0;
global end = 1024;
print "[%d, %d)" % (start, end);

Collections

Tuples

Tuples are heterogeneous collections of values. Tuple values are immutable.

global xs = (1, "a", b"c");
global ys = tuple(1, "a", b"c");
global zs: tuple<uint64, string, bytes> = (1, "a", b"c");
print xs, ys, zs;

Individual tuple elements can be accessed with subscript syntax.

print (1, "a", b"c")[1];  # Prints "a".

Optionally individual tuple elements can be named, e.g.,

global xs: tuple<first: uint8, second: string> = (1, "a");
assert xs[0] == xs.first;
assert xs[1] == xs.second;

Containers

Spicy provides data structures for lists (vector), and associative containers (set, map).

The element types can be inferred automatically, or specified explicitly. All of the following forms are equivalent:

global a1 = vector(1, 2, 3);
global a2 = vector<uint64>(1, 2, 3);
global a3: vector<uint64> = vector(1, 2, 3);

global b1 = set(1, 2, 3);
global b2 = set<uint64>(1, 2, 3);
global b3: set<uint64> = set(1, 2, 3);

global c1 = map("a": 1, "b": 2, "c": 3);
global c2 = map<string, uint64>("a": 1, "b": 2, "c": 3);
global c3: map<string, uint64> = map("a": 1, "b": 2, "c": 3);

All collection types can be iterated.

for (x in vector(1, 2, 3)) {
    print x;
}

for (x in set(1, 2, 3)) {
    print x;
}

# Map iteration yields a (key, value) `tuple`.
for (x in map("a": 1, "b": 2, "c": 1)) {
    print x, x[0], x[1];
}

Indexing into collections and iterators is checked at runtime.

Use |..| like in Zeek to obtain the number of elements in a collection, e.g.,

assert |vector(1, 2, 3)| == 3;

To check whether a set or map contains a given key use the in operator.

assert 1 in set(1, 2, 3);
assert "a" in map("a": 1, "b": 2, "c": 1)

User-defined types

Enums and structs are user-defined data types which allow you to give data semantic meaning.

Enums

Enumerations map integer values to a list of labels.

By default enum labels are numbered 0, ...

type X = enum { A, B, C, };
local b: X = X(1);  # `X::B`.
assert 1 == cast<uint64>(b);

One can override the default label numbering.

Note

Providing values for either all or no labels tends to lead to more maintainable code. Spicy still allows providing values for only a subset of labels.

type X = enum {
    A = 1,
    B = 2,
    C = 3,
};

By default enum values are initialized with the implicit Undef label.

type X = enum { A, B, C, };
global x: X;
assert x == X::Undef;

If an enum value is constructed from an integer not corresponding to a label, an implicit label corresponding the numeric value is used.

type X = enum { A, B, C, };

global x = X(4711);
assert cast<uint64>(x) == 4711;
print x;  # `X::<unknown-4711>`.

Structs

Structs are similar to tuples but mutable.

type X = struct {
    a: uint8;
    b: bytes;
};

Structs are initialized with Zeek record syntax.

global x: X = [$a = 1, $b = b"abc"];

Struct fields can be marked with an &optional attribute to denote optional fields. The ?. operator can be used to query whether a field was set.

type X = struct {
    a: uint8;
    b: uint8 &optional;
    c: uint8 &optional;
};

global x: X = [$a = 47, $b = 11];
assert x?.a;
assert x?.b : "%s" % x;
assert ! x?.c : "%s" % x;

Additionally, one can provide a &default value for struct fields to denote a value to use if none was provided on initialization. Fields with a &default are always set.

type X = struct {
    a: uint8;
    b: uint8 &default=11;
    c: bytes &optional &default=b"abc";
};

global x: X = [$a = 47];
assert x.b == 11;
assert x.c;

Exercises

What happens at compile time if you try to create a uint8 a value outside of its range, e.g., uint8(-1) or uint8(1024)?

What happens at runtime if you perform an operation which leaves the domain of an integer value, e.g.,

global x = 0;
print x - 1;

global y: uint8 = 255;
print y + 1;

global z = 1024;
print cast<uint8>(z);

print 4711/0;

What happens at compile time if you access a non-existing tuple element, e.g.,

global xs = tuple(1, "a", b"c");
print xs[4711];

global xs: tuple<first: uint8, second: string> = (1, "a");
print xs.third;

What happens at runtime if you try to get a non-existing vector element, e.g.,
```
print vector(1,2,3)[4711];
```

What happens at runtime if you try to dereference at invalidated iterator, e.g.,

global xs = vector(1);
global it = begin(xs);
print *it;
xs.pop_back();
print *it;

Can you dereference a collection's end iterator?
What happens at runtime if you dereference an unset optional?

Variables

Variables in Spicy can either be declared at local or module (global) scope.

Local variables live in bodies of functions. They are declared with the local storage qualifier and always mutable.

function hello(name: string) {
    local message = "Hello, %s" % name;
    print message;
}

Global variables live at module scope. If declared with global they are mutable, or immutable if declared with const.

module foo;

global N = 0;
N += 1;

const VERSION = "0.1.0";

Conditionals and loops

Conditionals

`if`/`else`

Spicy has if statements which can optionally contain else branches.

global x: uint64 = 4711;

if (x > 100) {
    print "%d > 100" % x;
} else if (x > 10) {
    print "%d > 10" % x;
} else if (x > 1) {
    print "%d > 1" % x;
} else {
    print x;
}

Hint

Surrounding bodies with {..} is optional, but often makes code easier to follow.

`switch`

To match a value against a list of possible options the switch statement can be used.

type Flag = enum {
    OFF = 0,
    ON = 1,
};

global flag = Flag::ON;

switch (flag) {
    case Flag::ON: print "on";
    case Flag::OFF: print "off";
    default: print "???";
}

In contrast to its behavior in e.g., C, in Spicy

there is no fall-through in switch, i.e., there is an implicit break after each case,
switch cases are not restricted to literal integer values; they can contain any expression,
if no matching case or default is found, a runtime error is raised.

Loops

Spicy offers two loop constructs:

for for loops over collections
while for raw loops

global xs = vector("a", "b", "c");

for (x in xs)
    print x;

global i = 0;

while (i < 3) {
    print i;
    ++i;
}

Functions

Functions in Spicy look like this:

function make_string(x: uint8): string {
    return "%d" % x;
}

Functions without return value can either be written without return type, or returning void.

function nothing1() {}
function nothing2(): void {}

By default function arguments are passed as read-only references. To instead pass a mutable value declare the argument inout.

function barify(inout x: string) {
    x = "%s bar" % x;
}

global s = "foo";
assert s == "foo";
barify(s);
assert s == "foo bar";

Warning

While this should work for user-defined types, this still is broken for some builtin types, e.g., it works for passing string values, but is broken for integers.

If support is broken, you need to return a modified copy (use a tuple if you already return a value).

Exercises

Write a function computing values of the Fibonacci sequence, i.e., a function
```
function fib(n: uint64): uint64 { ... }
```
- if n < 2 return n
- else return fib(n - 1) + fib(n - 2)
For testing you can assert fib(8) == 21;.
Add memoization to your fib function. For that change its signature to
```
function fib(n: uint64, inout cache: map<uint64, uint64>): uint64 { ... }
```
This can then be called like so:
```
global m_fib: map<uint64, uint64>;
fib(64, m_fib);
```
For testing you can assert fib(64, m_fib) == 10610209857723;.
Try modifying your fib functions so users do not have to provide the cache themselves.

Modules revisited

Every Spicy file specifies the module it declares.

module foo;

Other modules can be imported with the import keyword.

Typically, to refer to a type, function or variable in another module, it needs to be declared public.

# file: foo.spicy
module foo;

public global A = 47;
public const B = 11;
global const C = 42;

# file: bar.spicy
module bar;

import foo;

print foo::A, foo::B;

# Rejected: 'foo::C' has not been declared public
# print foo::C;

Hint

Declaring something public makes it part of the external API of a module. This makes certain optimizations inapplicable (e.g., dead code removal).

Only declare something public if you intend it to be used by other modules.

Parsing

Parsing in Spicy is centered around the unit type which in many ways looks similar to a struct type.

A unit declares an ordered list of fields which are parsed from the input.

If a unit is public it can serve as a top-level entry point for parsing.

module foo;

public type Foo = unit {
    version: uint32;

    on %done { print "The version is %s." % self.version; }
};

The parser for Foo consists of a single parser which extracts an uint32 with the default network byte order.
The extracted uint32 is bound to a named field to store its value in the unit.
We added a unit hook which runs when the parser is done.

We can run that parser by using a driver which feeds it input (potentially incrementally).

$ printf '\x00\x00\x00\xFF' | spicy-driver -d hello.spicy
The version is 255.

We use spicy-driver as driver. It reads input from its stdin and feeds it to the parser, and executes hooks.

Another driver is spicy-dump which prints the unit after parsing. Zeek includes its own dedicated driver for Spicy parsers.

The major differences to struct are:

unit fields need to have a parsable type,
by default all unit fields are &optional, i.e., a unit value can have any or all fields unset.

Structure of a parser

A parser contains a potentially empty ordered list of subparsers which are invoked in order.

type Version = unit {
    major: uint32;
    minor: uint32;
    patch: uint32;
};

#   4 bytes   4 bytes   4 bytes
#  -----------------------------
# |  Major  |  Minor  |  Patch  |
#  -----------------------------
#
#   Figure 47-11: Version packet

Attributes

The behavior of individual subparsers or units can be controlled with attributes.

type Version = unit {
    major: bytes &until=b".";
    minor: bytes &until=b".";
    patch: bytes &eod;
} &convert="v%s.%s.%s" % (self.major, self.minor, self.patch);

There are a wide range of both generic and type-specific attributes, e.g.,

the &size and &max-size attributes to control how much data should be parsed,
attributes &parse-from and &parse-at allowing to change where from where data is parsed,
&convert to transform the value and/or type of parsed data, or
&requires to enforce post conditions.

Type-specific attributes are documented together with their type.

Extracting data without storing it

If one needs to extracted some data but does not need it one can declare an anonymous field (without name) to avoid storing it. With >=spicy-1.9.0 (>=zeek-6.1.0) one additionally can explicitly skip over input data.

# Parser for a series of digits. When done parsing yields the extracted number.
type Number = unit {
    n: /[[:digit:]]+/;
} &convert=self.n;

public type Version = unit {
    major: Number;
    : b".";
    minor: Number;
    : skip b".";
    patch: Number;
};

Hooks

We can hook into parsing via unit or field hooks.

In hooks we can refer to the current unit via self, and the current field via $$. We can declare multiple hooks for the same field/unit, even in multiple files.

public type X = unit {
    x: uint8 { print "a=%d" % self.x; }

    on %done { print "X=%s" % self; }
};

on X::x {
    print "Done parsing a=%d" % $$;
}

Conditional parsing

During parsing we often want to decide at runtime what to parse next, e.g., certain fields might only be set if a previous field has a certain value, or the type for the next field might be known dynamically from a previous field.

We can specify that a field should only be parsed if a condition is met.

type Integer = unit {
    width: uint8 &requires=($$ != 0 && $$ < 8);
    u8 : uint8  if (self.width == 1);
    u16: uint16 if (self.width == 2);
    u32: uint32 if (self.width == 3);
    u64: uint64 if (self.width == 4);
};

Alternatively we can express this with a unit switch statement.

type Integer = unit {
    width: uint8 &requires=($$ != 0 && $$ < 8);
    switch (self.width) {
        1 -> u8: uint8;
        2 -> u16: uint16;
        3 -> u32: uint32;
        4 -> u64: uint64;
    };
};

In both cases the unit will include all fields, both set and unset. Once can query whether a field has been set with ?., e.g.,

on Integer::%done {
    if (self?.u8) { print "u8 was extracted"; }
}

Often parsing requires examining input and dynamically choosing a matching parser from the input. Spicy models this with lookahead parsing which is explained in a separate section.

Controlling byte order

The used byte order can be controlled on the module, unit, or field level.

# The 'ByteOrder' type is defined in the built-in Spicy module.
import spicy;

# Switch default from network byte order to little-endian for this module.
%byte-order=spicy::ByteOrder::Little;

# This unit uses big byte order.
type X = unit {
    # Use default byte order (big).
    a: uint8;

    # Use little-endian byte order for this field.
    b: uint8 &byte-order=spicy::ByteOrder::Little;
} &byte-order=spicy::ByteOrder::Big;

Parsing types

Spicy parsers are build up from smaller parsers, at the lowest level from basic types present in the input.

Currently Spicy supports parsing for the following basic types:

Fields not extracting any data can be marked void. They can still have hooks attached.

Since they are pervasive we give a brief overview for vectors here.

Parsing vectors

A common requirement is to parse vector of the same type, possibly of dynamic length.

To parse a vector of three integers we would write:

type X = unit {
    xs: uint16[3];
};

If the number of elements is not known we can parse until the end of the input data. This will trigger a parse error if the input does not contain enough data to parse all elements.

type X = unit {
    xs: uint16[] &eod;
};

If the vector is followed by e.g., a literal we can dynamically detect with lookahead parsing where the vector ends. The literal does not need to be a field, but could also be in another parser following the vector.

type X = unit {
    xs: uint16[];
      : b"\x00"; # Vector is terminated with null byte.
};

If the terminator is in the domain of the vector elements we can also use the &until attribute.

type X = unit {
    # Vector terminate with a null value
    xs: uint8[] &until=$$==0;
};

If the vector elements require attributes themselves, we can pass them by grouping them with the element type.

type X = unit {
    # Parse a vector of 4-byte integers less than 1024 until we find a null.
    xs: (uint64 &requires=$$<1024)[] &until=$$==0;
};

Exercises: A naive CSV parser

Assuming the following simplified CSV format:

rows are separated by newlines b"\n"
individual columns are separated by b","
there are not separators anywhere else (e.g., no , in quoted column values)

A sample input would be

1,a,ABC
2,b,DEF
3,c,GHI

For testing you can use the -f flag to spicy-dump or spicy-driver to read input from a file instead of stdin, e.g.,

spicy-driver csv_naive.spicy -f input.csv

Write a parser which extracts the bytes on each row into a list.

Hint 1

You top-level parser should contain a list of rows which has unspecified length.

Hint 2

Define a new parser for a row which parses bytes until it finds a newline and consumes it.
Solution
```
module csv_naive;

public type CSV = unit {
    rows: Row[];
};

type Row = unit {
    data: bytes &until=b"\n";
};
```
Extend your parser so it also extracts individual columns (as bytes) from each row.

Hint

The &convert attribute allows changing the value and/or type of a field after it has been extracted. This allows you to split the row data into columns.

Is there a builtin function which splits your row data at a separator (consuming the iterator)? Functions on bytes are documented here. You can access the currently extracted data via $$.
Solution
```
module csv_naive;

public type CSV = unit {
    rows: Row[];
};

type Row = unit {
    cols: bytes &until=b"\n" &convert=$$.split(b",");
};
```
Without changing the actual parsing, can you change your grammar so the following output is produced? This can be done without explicit loops.
```
$ spicy-driver csv_naive.spicy -f input.csv
[[b"1", b"a", b"ABC"], [b"2", b"b", b"DEF"], [b"3", b"c", b"GHI"]]
```
Hint 1

You could add a unit hook for your top-level unit which prints the rows.
```
on CSV::%done {
    print self.rows;
}
```
Since rows is a list of units you still need to massage its data though ...
Hint 2

You can use a unit &convert attribute on your row type to transform it to its row data.
Solution
```
module csv_naive;

public type CSV = unit {
    rows: Row[];
};

type Row = unit {
    data: bytes &until=b"\n" &convert=$$.split(b",");
} &convert=self.data;

on CSV::%done {
    print self.rows;
}
```

Adding additional parser state

We might want to add additional state to parsers, e.g.,

share or modify data outside of our parser, or
to locally aggregate data while parsing.

Sharing state across multiple units in the same Zeek connection with unit contexts will be discussed separately in a later section.

Passing outside state into units

We might want to pass additional state into a unit, e.g., to parameterize the unit's behavior, or to give the unit access to external state. This can be accomplished with unit parameters.

type X = unit(init: uint64 = 64) {
    var sum: uint64;

    on %init { self.sum = init; }

    : uint8 { self.sum += $$; }
    : uint8 { self.sum += $$; }
    : uint8 { self.sum += $$; }

    on %done { print self.sum; }
};

A few things to note here:

Unit parameter look a lot like function parameters to the unit.
Unit parameters can have default values which are used if the parameter was not passed.
We refer to unit parameters by directly using their name; self is not used.

Unit parameters can also be used to give a unit access to its parent units and their state.

public type X = unit {
    var sum: uint8;

    : (Y(self))[];
};

type Y = unit(outer: X) {
    : uint8 { outer.sum += $$; }
};

Unit variables

Unit variables allow to add additional data to units. Their data can be accessed like other unit fields.

type X = unit {
    var sum: uint8;

    : uint8 { self.sum += $$; }
    : uint8 { self.sum += $$; }
    : uint8 { self.sum += $$; }

    on %done { print self.sum; }
};

By default unit variables are initialized with the default value of the type, e.g., for a uint8 with 0.

Info

If you want to capture whether a unit variable (or any other variable) was set, use a variable of optional type instead of a dummy value.

To use with a different value, assign the variable in the unit's %init hook, e.g.,

on %init { self.sum = 100; }

Lookahead parsing

Lookahead parsing is a core Spicy concept. Leveraging lookahead makes it possible to build concise grammars which remain comprehensible and maintainable as the grammar grows.

Deep dive: Parsing of lists of unknown size

We have already seen how we can use lookahead parsing to dynamically detect the length of a list.

type X = unit {
    : (b"A")[]; # Extract unknown number of literal 'A' bytes.
    x: uint8;
};

We can view the generated parser by requesting grammar debug output from Spicy's spicyc compiler.

$ spicyc -D grammar x.spicy -o /dev/null -p
#        ~~~~~~~~~~ ~~~~~~~ ~~~~~~~~~~~~ ~~
#        |          |          |          |
#        |          |          |          - emit generated IR
#        |          |          |
#        |          |          - redirect output of generated code to /dev/null
#        |          |
#        |          - compile file 'x.spicy'
#        |
#        - emit 'grammar' debug stream to stderr

[debug/grammar] === Grammar foo::X
[debug/grammar]         Epsilon: <epsilon> -> ()
[debug/grammar]           While: anon -> while(<look-ahead-found>): anon_2 [field: anon (*)] [item-type: vector<bytes>] [parse-type: vector<bytes>]
[debug/grammar]            Ctor: anon_2 -> b"A" (bytes) (container 'anon') [field: anon_2 (*)] [item-type: bytes] [parse-type: bytes]
[debug/grammar]       LookAhead: anon_l1 -> {uint<8> (not a literal)}: <epsilon> | {b"A" (bytes) (id 1)}: anon_l2
[debug/grammar]        Sequence: anon_l2 -> anon_2 anon_l1
[debug/grammar]  (*)       Unit: foo_X -> anon x
[debug/grammar]        Variable: x   -> uint<8> [field: x (*)] [item-type: uint<8>] [parse-type: uint<8>]
[debug/grammar]
[debug/grammar]   -- Epsilon:
[debug/grammar]      anon = true
[debug/grammar]      anon_l1 = true
[debug/grammar]      anon_l2 = false
[debug/grammar]      foo_X = false
[debug/grammar]
[debug/grammar]   -- First_1:
[debug/grammar]      anon = { anon_2 }
[debug/grammar]      anon_l1 = { anon_2 }
[debug/grammar]      anon_l2 = { anon_2 }
[debug/grammar]      foo_X = { anon_2, x }
[debug/grammar]
[debug/grammar]   -- Follow:
[debug/grammar]      anon = { x }
[debug/grammar]      anon_l1 = { x }
[debug/grammar]      anon_l2 = { x }
[debug/grammar]      foo_X = {  }
[debug/grammar]

In above debug output the entry point of the grammar is marked (*).

parsing a unit consists of parsing the anon field (corresponding to the anonymous list), and x
to parse the list lookahead is used.
lookahead inspects a uint8 (as epsilon) or literal b"A"

Types for lookahead

In addition to literals, lookahead also works with units which start with a literal. Spicy transparently detects such units and will use them for lookahead if possible.

Example

Confirm this yourself by wrapping the literal in above unit in its own unit, and validating by parsing an input like AAAAA\x01. Are there any major differences in the generated grammar?

Using lookahead for conditional parsing

We have seen previously how we can use unit switch for conditional parsing. Another instance of conditional parsing occurs when a protocol message holds one of multiple possible sub-messages (a union). The sub-messages often contain a tag to denote what kind of sub-message is transmitted.

With a unit switch statement we could model this like so.

public type X = unit {
    tag: uint8;
    switch (self.tag) {
        1 -> m1: Message1;
        2 -> m2: Message2;
        * -> : skip bytes &eod; # For unknown message types simply consume all data.
    };
};

type Message1 = unit {
    payload: bytes &eod;
};

type Message2 = unit {
    payload: bytes &eod;
};

The unit switch statement has a form without control variable which instead uses lookahead. With this we can push parsing of the tag variable into the units concerned with the particular messages so we keep all pieces related to a particular message together.

public type X = unit {
    switch {
        -> m1: Message1;
        -> m2: Message2;
        -> : skip bytes &eod; # For unknown message types, simply consume all data.
    };
};

type Message1 = unit {
    : skip uint8(1);
    payload: bytes &eod;
};

type Message2 = unit {
    : skip uint8(2);
    payload: bytes &eod;
};

Example

Do the generated grammars for above two ways to express the protocol differ?

Error recovery

Even with a grammar perfectly modelling a specification, parsing of real data can fail due to e.g.,

endpoints not conforming to spec, or
gaps in the input data due to capture loss.

Instead of altogether aborting parsing we would like to gracefully recover from parse errors, i.e., when the parser encounters a parse error we would like skip input until it can parse again.

Spicy includes support for expressing such recovery with the following model:

To resynchronize the input potential synchronization points are annotated, e.g., to synchronize input at the sequence b"POST" the grammar might contain a field
```
: b"POST" &synchronize;
```
All constructs supporting lookahead parsing can be synchronization points, e.g., literals or fields with unit type with a literal at a fixed offset.
On a parse error the unit enters a synchronization trial mode.

Once the input could be synchronized a %synced hook is invoked. The implementation of the hook can examine the data up to the &synchronize field, and either confirm it to leave trial mode and continue normal parsing, or reject it to look for a later synchronization point.

Exercises

Let's assume we are parsing a protocol where valid messages are always the sequence AB, i.e., a the byte sequence b"AB". We will use the following contrived grammar:

module foo;

public type Messages = unit {
    : Message[];
};

type Message = unit {
    a: b"A";
    b: b"B";
};

on Message::%done { print self; }

Validate that this grammar can parse the input
```
ABABAB
```
```
$ printf ABABAB | spicy-driver %
[$a=b"A", $b=b"B"]
[$a=b"A", $b=b"B"]
[$a=b"A", $b=b"B"]
```
Info

We used printf to avoid inserting a newline which our grammar does not expect.
What do you see if you pass misspelled input, like with the second A changed to 1, i.e., the input
```
AB1BAB
```
Why is this particular source range shown as error location?
Solution
```
[$a=b"A", $b=b"B"]
[error] terminating with uncaught exception of type spicy::rt::ParseError: no expected look-ahead token found (foo.spicy:3:30-4:17)
```
We first the result of parsing for the first Message from AB, and encounter an error for the second element.

The error corresponds to parsing the vector inside Messages. The grammar expects either A to start a new Message, or end of data to signal the end of the input; 1 matches neither so lookahead parsing fails.
What are the potential synchronization points in this grammar we could use so we can extract the remaining data?
Solution

In this case parsing failed at the first field of Message, Message::a. We could

a. synchronize on Message::b by changing it to
```
b: b"B" &synchronize;
```
b. Synchronize on Message::a in the next message, i.e., abandon parsing the remaining fields in Message and start over. For that we would synchronize on the vector elements in Messages,
```
: (Message &synchronize)[];
```
Info

A slight modification of this grammar seems to fail to synchronize and run into an edge case, https://github.com/zeek/spicy/issues/1594.
If you had to choose one, which one would you pick? What are the trade-offs?
Solution
- If we synchronize on Message::b it would seem that we should be able to recover at its data.
  
  This however does not work since the vector uses lookahead parsing, so we would fail already in Messages before we could recover in Message.
- We need to synchronize on the next vector element.
  
  In larger units synchronizing high up (e.g., on a vector in the top-level unit) allows recovering from more general errors at the cost of not extracting some data, e.g., we would be able to also handle misspelled Bs in this example.

Add a single &synchronized attribute to the grammar so you can handle all possible misspellings. Also add a %synced hook to confirm the synchronization result (on which unit?). Can you parse inputs like these?

ABABAB
AB1BAB
A11BAB

You can enable the spicy-verbose debug stream to show parsing progress.

printf AB1BAB | HILTI_DEBUG=spicy-verbose spicy-driver -d foo.spicy

Solution

module foo;

public type Messages = unit {
    : (Message &synchronize)[];
};

type Message = unit {
    a: b"A";
    b: b"B";
};

on Message::%done { print self; }
on Messages::%synced { confirm; }

Zeek integration

Zeek supports writing packet, protocol or file analyzers with Spicy. In addition to allowing inclusion of unmodified Spicy grammars, additional features include:

automatic generation of Zeek analyzers from Spicy parsers from interface definition (EVT) files
ability to trigger Zeek events from Spicy unit hooks,
(automatic) exporting of types defined in Spicy as Zeek record types,
a Spicy module to control Zeek from Spicy code.

Getting started

The recommended approach to integrate a Spicy parser with Zeek is to use the default Zeek package template.

We can create Zeek packet, protocol or file analyzers by selecting the appropriate template feature. E.g., to create a new Zeek package for a protocol analyzer and interactively provide required user variables,

zkg create --packagedir my_analyzer --features spicy-protocol-analyzer

Warning

zkg uses Git to track package information. When running in a VM, this can cause issues if the package repository is in a mounted directory. If you run into this trying creating the package in directory which is not mounted from the host.

Example

Use the template to create a Spicy protocol analyzer for analyzing TCP traffic now to follow along with later examples.

This will create a protocol analyzer from the template. Items which need to be updated are marked TODO. It will generate e.g.,

zkg.meta: package metadata describing the package and setting up building and testing
analyzer/
- *.evt: interface definition for exposing Spicy parser as Zeek analyzer
- *.spicy: Spicy grammar of the parser
- zeek_*.spicy: Zeek-specific Spicy code
scripts/
- main.zeek: Zeek code for interacting with the analyzer
- dpd.sig: Signatures for dynamic protocol detection (DPD)
testing/tests: BTest test cases

Info

You can use zkg to install the package into your Zeek installation.

zkg install <package_dir>

To run its tests, e.g., during development:

zkg test <package_dir>

The generated project uses CMake for building and BTest for testing. You can build manually, e.g., during development. The test scaffolding assumes that the CMake build directory is named build.

# Building.
mkdir build
(cd build && cmake .. && make)

# Testing.
(cd testing && btest)

We can show available template features with zkg template info.

$ zkg template info
API version: 1.0.0
features: github-ci, license, plugin, spicy-file-analyzer, spicy-packet-analyzer, spicy-protocol-analyzer
origin: https://github.com/zeek/package-template
provides package: true
user vars:
    name: the name of the package, e.g. "FooBar" or "spicy-http", no default, used by package, spicy-protocol-analyzer, spicy-file-analyzer, spicy-packet-analyzer
    namespace: a namespace for the package, e.g. "MyOrg", no default, used by plugin
    analyzer: name of the Spicy analyzer, which typically corresponds to the protocol/format being parsed (e.g. "HTTP", "PNG"), no default, used by spicy-protocol-analyzer, spicy-file-analyzer, spicy-packet-analyzer
    protocol: transport protocol for the analyzer to use: TCP or UDP, no default, used by spicy-protocol-analyzer
    unit: name of the top-level Spicy parsing unit for the file/packet format (e.g. "File" or "Packet"), no default, used by spicy-file-analyzer, spicy-packet-analyzer
    unit_orig: name of the top-level Spicy parsing unit for the originator side of the connection (e.g. "Request"), no default, used by spicy-protocol-analyzer
    unit_resp: name of the top-level Spicy parsing unit for the responder side of the connection (e.g. "Reply"); may be the same as originator side, no default, used by spicy-protocol-analyzer
    author: your name and email address, Benjamin Bannier <benjamin.bannier@corelight.com>, used by license
    license: one of apache, bsd-2, bsd-3, mit, mpl-2, no default, used by license
versions: v0.99.0, v1.0.0, v2.0.0, v3.0.0, v3.0.1, v3.0.2

Protocol analyzers

For a TCP protocol analyzer the template generated the following declaration in analyzer/*.evt:

protocol analyzer Foo over TCP:
    parse originator with foo::Request,
    parse responder with foo::Response;

Here we declare a Zeek protocol analyzer Foo which uses to different parsers for the originator (client) and responder (server) side of the connection, Request and Response. To use the same parser for both sides we would declare

    parse with foo::Messages;

Message and connection semantics: UDP vs. TCP

The parsers have these stub implementations:

module foo;

public type Request = unit {
    payload: bytes &eod;
};

public type Response = unit {
    payload: bytes &eod;
};

We have used &eod to denote that we want to extract all data. The semantics of all data differ between TCP and UDP parsers:

UDP has no connection concept so Zeek synthesizes UDP "connections" from flows by grouping UDP messages with the same 5-tuple in a time window. UDP has no reassembly, so a new parser instance is created for each UDP packet; &eod means until the end of the current packet.
TCP: TCP supports connections and packet reassembly, so both sides of a connection are modelled as streams with reassembled data; &eod means until the end of the stream. The stream is unbounded.

For this reason one usually wants to model parsing of a TCP connection as a list of protocol messages, e.g.,

public type Requests = unit {
    : Request[];
};

type Request = unit {
    # TODO: Parse protocol message.
};

the length of the list of messages is unspecified so it is detected dynamically
to avoid storing an unbounded list of messages we use an anonymous field for the list
parsing of the protocol messages is responsible for detecting when a message ends

Analyzer lifecycle

In Zeek's model all eligible analyzers would see the traffic.

If analyzers detect traffic not matching their protocol, they should signal Zeek an analyzer violation so they stop receiving data. This is not an error during protocol detection.
To signal matching traffic, analyzers should signal Zeek an analyzer confirmation. This e.g., leads to associating the protocol/service with the connection.

flowchart TD
    N((fa:fa-cloud)) -->|data| Z(Zeek)
    Z -->|looks up| Reg[Analyzers registered for port]
    Z --> |forwards for matching| dpd[Analyzers with matching signatures]

    Reg -->|data| A1
    Reg --> |data|A2

    dpd -->|data| B1
    dpd --> |data|B2
    dpd --> |data|B3

    AC(fa:fa-bell analyzer_confirmation)
    style AC fill:lightgreen

    AV(fa:fa-bell analyzer_violation)
    style AV fill:red

    B1 -->|triggers| AV
    B2 -->|triggers| AV
    B3 -->|triggers| AC

    A1 -->|triggers| AV
    A2 -->|triggers| AV

To integrate the parser into this the template generated the following stub implementations in analyzer/zeek_*.spicy:

# TODO: Protocol analyzers should confirm once they are reasonably sure that
# they are indeed parsing the right protocol. Pick a unit that's a little bit
# into the parsing process here.
#
# on Foo::SUITABLE_UNIT::%done {
#     zeek::confirm_protocol();
# }

# Any error bubbling up to the top unit will trigger a protocol rejection.
on Foo::Request::%error {
    zeek::reject_protocol("error while parsing Foo request");
}

on Foo::Response::%error {
    zeek::reject_protocol("error while parsing Foo reply");
}

We can use zeek::confirm_protocol and zeek::reject_protocol to signal Zeek.

Passing data to Zeek

Ultimately we want to make the parsed data available to Zeek for analysis and logging.

The handling of events is declared in the EVT file analyzer/*.EVT.

# TODO: Connect Spicy-side events with Zeek-side events. The example just
# defines simple example events that forwards the raw data (which in practice
# you don't want to do!).
on Foo::Request -> event Foo::request($conn, $is_orig, self.payload);
on Foo::Response -> event Foo::reply($conn, $is_orig, self.payload);

the LHS specifies a Spicy hook in Spicy syntax
the RHS specifies a (possibly generated) Zeek event in Zeek syntax
we can reference Spicy data via self on the RHS
data for builtin Spicy types are converted automatically to equivalent Zeek types
we can automatically generate Zeek record types from Spicy types
information about the generated analyzer is accessible via magic variables $conn, $file, $packet, $is_orig

The event is handled on the Zeek side in scripts/main.zeek, e.g.,

# Example event defined in foo.evt.
event Foo::request(c: connection, is_orig: bool, payload: string)
    {
    hook set_session(c);

    local info = c$foo;
    info$request = payload;
    }

Passing data to other Zeek analyzers (e.g., for analyzing subprotocols and files) is handled in a later section.

Forwarding to other analyzers

One often wants to forward an extracted payload to other analyzers.

HTTP messages with files
compressed files containing PE files
protocols using other sub-protocols

Inside Spicy we can forward data from one parser to another one with sink values, but in a Zeek context we can also forward data to other analyzers (Spicy or not).

Forwarding to file analyzers

Let's assume we are parsing protocol messages which contain bytes corresponding to a file. We want to feed the file data into Zeek's file analysis.

type Message = unit {
    : bytes &chunked &size=512;
};

By using the &chunked attribute on the bytes its field hook is invoked soon as a chunk of data arrives, even if the full data is not yet available. The caveat is that only the final chunk will be stored once parsing is done. This is fine since we usually do not store the data.

The protocol for passing data is:

open a handle for a new Zeek file with zeek::file_begin optionally specifying a MIME type
pass information to Zeek, e.g., feed data or gaps, or notify Zeek about the expected size
close the handle with zeek::file_end

E.g.,

import zeek;

public type File = unit {
    var h: string;

    on %init { self.h = zeek::file_begin(); }

    : bytes &chunked &eod {
        zeek::file_data_in($$, self.h);
    }

    on %done { zeek::file_end(self.h); }
};

Danger

File handles need to be closed explicitly.

Not closing them would leak them for the duration of the connection.

Forwarding to protocol analyzers

Forwarding to protocol analyzers follows a similar protocol of opening a handle, interacting with it, and closing it.

Danger

Protocol handles need to be closed explicitly.

For opening a handle, two APIs are supported:

function zeek::protocol_begin(analyzer: optional<string> = Null);
function zeek::protocol_handle_get_or_create(analyzer: string) : ProtocolHandle;

When using zeek::protocol_begin without argument all forwarded data will be passed to Zeek's dynamic protocol detection (DPD).

Otherwise use the Zeek name of the analyzer, e.g.,

local h = zeek::protocol_handle_get_or_create("SSL");

You can inspect the output of zeek -NN for available analyzer names, e.g.,

$ zeek -NN | grep ANALYZER | grep SSL
    [Analyzer] SSL (ANALYZER_SSL, enabled)

We sometimes want to correlate information from the originator and responder side of a connection, and need to share data across the same connection.

Often we can do that in Zeek script land, e.g.,

# Example: Mapping of connections to their request method.
#
# NOTE: FOR DEMONSTRATION ONLY. WE MIGHT E.G., WANT TO ALLOW MULTIPLE REQUESTS
# PER CONNECTION.
global methods: table[conn_id] of string &create_expire=10sec;

event http_request(c: connection, method: string, original_URI: string, unescaped_URI: string, version: string)
    {
    # Store method for correlation.
    methods[c$conn$id] = method;
    }

event http_reply(c: connection, version: string, code: count, reason: string)
    {
    local id = c$conn$id;

    if ( id in methods )
        {
        local method = methods[id];
        print fmt("Saw reply %s to %s request on %s", code, method, id);
        }
    else
        {
        print fmt("Saw reply to unseen request on %s", id);
        return;
        }
    }

Warning

This assumes that we always see requests before replies. Depending how we collect and process data this might not always hold.

If we need this information during parsing this is too late. Spicy allows sharing information across both sides with unit contexts. When declaring a Spicy analyzer Zeek automatically sets up so originator and responder of a connection share a context.

type Context = tuple<method: string>;

type Request = unit {
    %context = Context;

    method: /[^ \t\r\n]+/ { self.context().method = $$; }

    # ...
};

type Reply = unit {
    %context = Context;
    # ...

    on %done { print "Saw reply %s to %s request" % (code, self.context().method); }
};

Warning

If we see Reply before Request method will default to an empty string.

Exercise

Starting from the default protocol analyzer template we want to (redundantly) pass the number of bytes for Request to Zeek as well.

In the EVT file pass the number of bytes in self.payload.

Solution

on Foo::Request -> event Foo::request($conn, $is_orig, self.payload, |self.payload|);

Manually build your changed analyzer:
```
mkdir build
cd build/
cmake ..
make
```

Execute the test suite. This runs tests against an included PCAP file. What do you see?

cd testing/
btest -dv

Solution

Test tests.trace test fails. Its sources are in testing/tests/trace.zeek.

.. analyzer error in <..>/foo/analyzer/foo.evt, line 16: Event parameter mismatch, more parameters given than the 3 that the Zeek event expects

Fix the signatures of the handlers for Foo::request so tests pass. What type do need to use on the Zeek side to pass the length (uint64 in Spicy)?

Hint

The type mappings are documented here.
Solution

In both testing/tests/trace.zeek and scripts/main.zeek change the signatures to
```
event Foo::request(c: connection, is_orig: bool, payload: string, len: count) {}
```
Modify testing/tests/trace.zeek to include the length in the baseline, i.e., change the test case for Foo::request to
```
print fmt("Testing Foo: [request] %s %s %d", c$id, payload, len);
```
Rerun tests and update the test baseline with
```
cd testing/
btest -u
```
Make sure all tests with these changes.

Stage and commit all changes in the package repository.
```
git add -u
git commit -v -m "Pass payload length to Zeek"
```
Validate that the package also tests fine with zkg. This will require no uncommitted changes or untracked files in the repository.
```
# Make progress more verbose.
zkg -vvv test .
```
Optional Also add the length to the Zeek log generated from the code in scripts/main.zeek.

Hint

This requires adding a count &optional &log field to the Info record.

Set the field from the event handler for Foo::request.

Update test baselines as needed.

Introduction to Spicy