Goals of this course
This course gives an introduction to Spicy, a parser generator for network protocols and file formats which integrates seamlessly with Zeek. The course is complementary to the Spicy reference documentation and is intended to give a more guided (though selective and incomplete) tour of Spicy and its use with Zeek.
After this course you should be comfortable implementing protocol parsers specified in an RFC and integrate it with Zeek.

Why Spicy?
Historically extending Zeek with new parsers required interacting with Zeek's C++ API which was a significant barrier to entry for domain experts.
Spicy is a domain-specific language for developing parsers for network protocols or file formats which integrates well with Zeek.
Flexible multi-paradigm language
With Spicy parsers can be
expressed in declaratively in a format close to specifications, e.g., the
following TFTP ERROR message
# 2 bytes 2 bytes string 1 byte
# -----------------------------------------
# | Opcode | ErrorCode | ErrMsg | 0 |
# -----------------------------------------
can be expressed in Spicy as
type Error = unit {
code: uint16;
msg: bytes &until=b"\x00";
};
Spicy supports procedural code which can be hooked into parsing to support more complex parsing scenarios.
function sum(a: uint64, b: uint64): uint64 { return a + b; }
type Fold = unit {
a: uint8;
b: uint8 &convert=sum(self.a, $$);
c: uint8 &convert=sum(self.b, $$);
};
Incremental parsing
The parsers generated by Spicy automatically support incremental parsing. Data can be fed as it arrives without blocking until all data is available. Hooks allow reacting to parse results.
Built-in safety
Spicy code is executed safely so many common errors are rejected, e.g.,
- integer under- or overflows
- incorrect use of iterators
- unhandled switch cases
Integration into Zeek
Spicy parsers can trigger events in Zeek. Parse results can transparently be made available to Zeek script code.
Prerequisites
To follow this course, installing recent versions of Zeek and Spicy is required (at least zeek-6.0 and spicy-1.8). The Zeek documentation shows the different ways Zeek can be installed.
In addition we require:
- a text editor to write Spicy code
- a C++ compiler to compile Spicy code and Zeek plugins
- CMake for developing Zeek plugins with Spicy
- development headers for
libpcapto compile Zeek plugins
Docker images
The Zeek project provides Docker images.
Zeek playground
A simplified approach for experimentation is to use the Zeek playground repository which offers an environment integrated with Visual Studio Code. Either clone the project and open it locally in Visual Studio Code and install the recommended plugins, or open it directly in a Github Codespace from the Github repository view.
Spicy language
This chapter gives a coarse overview over the Spicy language.
We show a selection of features from the Spicy reference documentation, in particular
In terms of syntax Spicy borrows many aspects from C-like languages.
Hello world
# hello.spicy
module hello;
print "Hello, world!";
- every Spicy source file needs to declare the module it belongs to
- global statements are run once when the module is initialized
Compile and run this code with
$ spicyc -j hello.spicy
Hello, world!
Here we have used Spicy's
spicyc
executable to compile and immediately run the source file hello.spicy.
Most commands which compile Spicy code support -d to build parsers in debug
mode. This is often faster than building production code and useful during
parser development.
$ spicyc -j -d hello.spicy
Hello, world!
Basic types
All values in Spicy have a type.
While in some contexts types are required (e.g., when declaring types, or function signatures), types can also be inferred (e.g., for variable declarations).
global N1 = 0; # Inferred as uint64.
global N2: uint8 = 0; # Explicitly typed.
# Types required in signatures, here: `uint64` -> `void`
function foo(arg: int64) {
local inc = -1; # Inferred as int64.
print arg + inc;
}
Spicy provides types for e.g.,
- integers, booleans
- optional
- bytes, string
- tuples and containers
- enums, structs
- special purpose types for e.g., network address, timestamps, or time durations
See the documentation for the full list of supported types and their API.
Boolean and integers
Boolean
Booleans
have two possible values: True or False.
global C = True;
if (C)
print "always runs";
Integers
Spicy supports both signed and unsigned integers with widths of 8, 16, 32 and 64 bytes:
uint8,uint16,uint32,uint64int8,int16,int32,int64
Integers are checked at both compile and runtime against overflows. They are either statically rejected or trigger runtime exceptions.
Integer literals without sign like e.g., 4711 default to uint64; if a sign
is given int64 is used, e.g., -47, +12.
If permitted integer types convert into each other when required; for cases
where this is not automatically possible one can explicitly cast integers to
each other:
global a: uint8 = 0;
global b: uint64 = 1;
# Modify default: uint8 + uint64 -> uint64.
global c: uint8 = a + cast<uint8>(b);
Optional
Optionals either contain a value or nothing. They are a good choice when one wants to denote that a value can be absent.
optional is a parametric (also sometimes called generic) type in that it
wraps a value of some other type.
global opt_set1 = optional(4711);
global opt_set2: optional<uint64> = 4711;
global opt_unset: optional<uint64>;
Optionals implicitly convert to booleans. This can be used to check whether they are set.
assert opt_set1;
assert ! opt_unset;
Assigning Null to an optional empties it.
global x = optional(4711);
assert x;
x = Null;
assert ! x;
To extract the value contained in an optional dereference it with the * operator.
global x = optional(4711);
assert *x == 4711;
Bytes and strings
The bytes
type
represents raw bytes, typically from protocol data. Literals for bytes are
written with prefix b, e.g., b"\x00byteData\x01".
The string
type
represents text in a given character set.
Conversion between bytes and string are always explicit, via bytes'
decode
method
or string's
encode,
e.g.,
global my_bytes = b"abc";
global my_string = "abc";
global my_other_string = my_bytes.decode(); # Default: UTF-8.
print my_bytes, my_string, my_other_string;
bytes can be iterated over.
for (byte in b"abc") {
print byte;
}
Use the format operator
%
to compute a string representation of Spicy values. Format strings roughly follow the
POSIX format string
API.
global n = 4711;
global s = "%d" % n;
The format operator can be used to format multiple values.
global start = 0;
global end = 1024;
print "[%d, %d)" % (start, end);
Collections
Tuples
Tuples are heterogeneous collections of values. Tuple values are immutable.
global xs = (1, "a", b"c");
global ys = tuple(1, "a", b"c");
global zs: tuple<uint64, string, bytes> = (1, "a", b"c");
print xs, ys, zs;
Individual tuple elements can be accessed with subscript syntax.
print (1, "a", b"c")[1]; # Prints "a".
Optionally individual tuple elements can be named, e.g.,
global xs: tuple<first: uint8, second: string> = (1, "a");
assert xs[0] == xs.first;
assert xs[1] == xs.second;
Containers
Spicy provides data structures for lists
(vector),
and associative containers
(set,
map).
The element types can be inferred automatically, or specified explicitly. All of the following forms are equivalent:
global a1 = vector(1, 2, 3);
global a2 = vector<uint64>(1, 2, 3);
global a3: vector<uint64> = vector(1, 2, 3);
global b1 = set(1, 2, 3);
global b2 = set<uint64>(1, 2, 3);
global b3: set<uint64> = set(1, 2, 3);
global c1 = map("a": 1, "b": 2, "c": 3);
global c2 = map<string, uint64>("a": 1, "b": 2, "c": 3);
global c3: map<string, uint64> = map("a": 1, "b": 2, "c": 3);
All collection types can be iterated.
for (x in vector(1, 2, 3)) {
print x;
}
for (x in set(1, 2, 3)) {
print x;
}
# Map iteration yields a (key, value) `tuple`.
for (x in map("a": 1, "b": 2, "c": 1)) {
print x, x[0], x[1];
}
Indexing into collections and iterators is checked at runtime.
Use |..| like in Zeek to obtain the number of elements in a collection, e.g.,
assert |vector(1, 2, 3)| == 3;
To check whether a set or map contains a given key use the in operator.
assert 1 in set(1, 2, 3);
assert "a" in map("a": 1, "b": 2, "c": 1)
User-defined types
Enums and structs are user-defined data types which allow you to give data semantic meaning.
Enums
Enumerations map integer values to a list of labels.
By default enum labels are numbered 0, ...
type X = enum { A, B, C, };
local b: X = X(1); # `X::B`.
assert 1 == cast<uint64>(b);
One can override the default label numbering.
Providing values for either all or no labels tends to lead to more maintainable code. Spicy still allows providing values for only a subset of labels.
type X = enum {
A = 1,
B = 2,
C = 3,
};
By default enum values are initialized with the implicit Undef label.
type X = enum { A, B, C, };
global x: X;
assert x == X::Undef;
If an enum value is constructed from an integer not corresponding to a label, an implicit label corresponding the numeric value is used.
type X = enum { A, B, C, };
global x = X(4711);
assert cast<uint64>(x) == 4711;
print x; # `X::<unknown-4711>`.
Structs
Structs are similar to tuples but mutable.
type X = struct {
a: uint8;
b: bytes;
};
Structs are initialized with Zeek record syntax.
global x: X = [$a = 1, $b = b"abc"];
Struct fields can be marked with an &optional attribute to denote optional
fields. The ?. operator can be used to query whether a field was set.
type X = struct {
a: uint8;
b: uint8 &optional;
c: uint8 &optional;
};
global x: X = [$a = 47, $b = 11];
assert x?.a;
assert x?.b : "%s" % x;
assert ! x?.c : "%s" % x;
Additionally, one can provide a &default value for struct fields to denote a
value to use if none was provided on initialization. Fields with a &default
are always set.
type X = struct {
a: uint8;
b: uint8 &default=11;
c: bytes &optional &default=b"abc";
};
global x: X = [$a = 47];
assert x.b == 11;
assert x.c;
Exercises
-
What happens at compile time if you try to create a
uint8a value outside of its range, e.g.,uint8(-1)oruint8(1024)? -
What happens at runtime if you perform an operation which leaves the domain of an integer value, e.g.,
global x = 0; print x - 1; global y: uint8 = 255; print y + 1; global z = 1024; print cast<uint8>(z); print 4711/0; -
What happens at compile time if you access a non-existing tuple element, e.g.,
global xs = tuple(1, "a", b"c"); print xs[4711]; global xs: tuple<first: uint8, second: string> = (1, "a"); print xs.third; -
What happens at runtime if you try to get a non-existing
vectorelement, e.g.,print vector(1,2,3)[4711]; -
What happens at runtime if you try to dereference at invalidated iterator, e.g.,
global xs = vector(1); global it = begin(xs); print *it; xs.pop_back(); print *it; -
Can you dereference a collection's
enditerator? -
What happens at runtime if you dereference an unset
optional?
Variables
Variables in Spicy can either be declared at local or module (global) scope.
Local variables live in bodies of functions. They are
declared with the local storage qualifier and always mutable.
function hello(name: string) {
local message = "Hello, %s" % name;
print message;
}
Global variables live at module scope. If declared with global they are
mutable, or immutable if declared with const.
module foo;
global N = 0;
N += 1;
const VERSION = "0.1.0";
Conditionals and loops
Conditionals
if/else
Spicy has if
statements
which can optionally contain else branches.
global x: uint64 = 4711;
if (x > 100) {
print "%d > 100" % x;
} else if (x > 10) {
print "%d > 10" % x;
} else if (x > 1) {
print "%d > 1" % x;
} else {
print x;
}
switch
To match a value against a list of possible options the switch
statement
can be used.
type Flag = enum {
OFF = 0,
ON = 1,
};
global flag = Flag::ON;
switch (flag) {
case Flag::ON: print "on";
case Flag::OFF: print "off";
default: print "???";
}
In contrast to its behavior in e.g., C, in Spicy
- there is no fall-through in
switch, i.e., there is an implicitbreakafter eachcase, switchcases are not restricted to literal integer values; they can contain any expression,- if no matching
caseordefaultis found, a runtime error is raised.
Loops
Spicy offers two loop constructs:
global xs = vector("a", "b", "c");
for (x in xs)
print x;
global i = 0;
while (i < 3) {
print i;
++i;
}
Functions
Functions in Spicy look like this:
function make_string(x: uint8): string {
return "%d" % x;
}
Functions without return value can either be written without return type, or
returning void.
function nothing1() {}
function nothing2(): void {}
By default function arguments are passed as read-only references. To instead
pass a mutable value declare the argument inout.
function barify(inout x: string) {
x = "%s bar" % x;
}
global s = "foo";
assert s == "foo";
barify(s);
assert s == "foo bar";
While this should work for user-defined types, this still is broken for some
builtin types, e.g., it works for passing string
values, but is broken for
integers.
If support is broken, you need to return a modified copy (use a tuple if you
already return a value).
Exercises
-
Write a function computing values of the Fibonacci sequence, i.e., a function
function fib(n: uint64): uint64 { ... }- if
n < 2returnn - else return
fib(n - 1) + fib(n - 2)
For testing you can
assert fib(8) == 21;. - if
-
Add memoization to your
fibfunction. For that change its signature tofunction fib(n: uint64, inout cache: map<uint64, uint64>): uint64 { ... }This can then be called like so:
global m_fib: map<uint64, uint64>; fib(64, m_fib);For testing you can
assert fib(64, m_fib) == 10610209857723;. -
Try modifying your
fibfunctions so users do not have to provide the cache themselves.
Modules revisited
Every Spicy file specifies the module it declares.
module foo;
Other modules can be imported with the import
keyword.
Typically, to refer to a type, function or variable in another module, it needs to be declared public.
# file: foo.spicy
module foo;
public global A = 47;
public const B = 11;
global const C = 42;
# file: bar.spicy
module bar;
import foo;
print foo::A, foo::B;
# Rejected: 'foo::C' has not been declared public
# print foo::C;
Declaring something public makes it part of the external API of a module.
This makes certain optimizations inapplicable (e.g., dead code removal).
Only declare something public if you intend it to be used by other modules.
Parsing
Parsing in
Spicy
is centered around the unit type which in many ways looks similar to a
struct type.
A unit declares an ordered list of fields which are parsed from the input.
If a unit is public it can serve as a top-level entry point for parsing.
module foo;
public type Foo = unit {
version: uint32;
on %done { print "The version is %s." % self.version; }
};
- The parser for
Fooconsists of a single parser which extracts anuint32with the default network byte order. - The extracted
uint32is bound to a named field to store its value in the unit. - We added a unit hook which runs when the parser is done.
We can run that parser by using a driver which feeds it input (potentially incrementally).
$ printf '\x00\x00\x00\xFF' | spicy-driver -d hello.spicy
The version is 255.
We use
spicy-driver
as driver. It reads input from its stdin and feeds it to the parser, and
executes hooks.
Another driver is
spicy-dump
which prints the unit after parsing. Zeek includes its own dedicated driver for
Spicy parsers.
The major differences to struct are:
unitfields need to have a parsable type,- by default all
unitfields are&optional, i.e., aunitvalue can have any or all fields unset.
Structure of a parser
A parser contains a potentially empty ordered list of subparsers which are invoked in order.
type Version = unit {
major: uint32;
minor: uint32;
patch: uint32;
};
# 4 bytes 4 bytes 4 bytes
# -----------------------------
# | Major | Minor | Patch |
# -----------------------------
#
# Figure 47-11: Version packet
Attributes
The behavior of individual subparsers or units can be controlled with attributes.
type Version = unit {
major: bytes &until=b".";
minor: bytes &until=b".";
patch: bytes &eod;
} &convert="v%s.%s.%s" % (self.major, self.minor, self.patch);
There are a wide range of both generic and type-specific attributes, e.g.,
- the
&sizeand&max-sizeattributes to control how much data should be parsed, - attributes
&parse-fromand&parse-atallowing to change where from where data is parsed, &convertto transform the value and/or type of parsed data, or&requiresto enforce post conditions.
Type-specific attributes are documented together with their type.
Extracting data without storing it
If one needs to extracted some data but does not need it one can declare an
anonymous
field
(without name) to avoid storing it. With >=spicy-1.9.0 (>=zeek-6.1.0) one
additionally can explicitly skip over input
data.
# Parser for a series of digits. When done parsing yields the extracted number.
type Number = unit {
n: /[[:digit:]]+/;
} &convert=self.n;
public type Version = unit {
major: Number;
: b".";
minor: Number;
: skip b".";
patch: Number;
};
Hooks
We can hook into parsing via unit or field hooks.
In hooks we can refer to the current unit via self, and the current field via
$$. We can declare multiple hooks for the same field/unit, even in multiple
files.
public type X = unit {
x: uint8 { print "a=%d" % self.x; }
on %done { print "X=%s" % self; }
};
on X::x {
print "Done parsing a=%d" % $$;
}
Conditional parsing
During parsing we often want to decide at runtime what to parse next, e.g., certain fields might only be set if a previous field has a certain value, or the type for the next field might be known dynamically from a previous field.
We can specify that a field should only be parsed if a condition is met.
type Integer = unit {
width: uint8 &requires=($$ != 0 && $$ < 8);
u8 : uint8 if (self.width == 1);
u16: uint16 if (self.width == 2);
u32: uint32 if (self.width == 3);
u64: uint64 if (self.width == 4);
};
Alternatively we can express this with a unit switch statement.
type Integer = unit {
width: uint8 &requires=($$ != 0 && $$ < 8);
switch (self.width) {
1 -> u8: uint8;
2 -> u16: uint16;
3 -> u32: uint32;
4 -> u64: uint64;
};
};
In both cases the unit will include all fields, both set and unset. Once can
query whether a field has been set with
?.,
e.g.,
on Integer::%done {
if (self?.u8) { print "u8 was extracted"; }
}
Often parsing requires examining input and dynamically choosing a matching parser from the input. Spicy models this with lookahead parsing which is explained in a separate section.
Controlling byte order
The used byte order can be controlled on the module, unit, or field level.
# The 'ByteOrder' type is defined in the built-in Spicy module.
import spicy;
# Switch default from network byte order to little-endian for this module.
%byte-order=spicy::ByteOrder::Little;
# This unit uses big byte order.
type X = unit {
# Use default byte order (big).
a: uint8;
# Use little-endian byte order for this field.
b: uint8 &byte-order=spicy::ByteOrder::Little;
} &byte-order=spicy::ByteOrder::Big;
Parsing types
Spicy parsers are build up from smaller parsers, at the lowest level from basic types present in the input.
Currently Spicy supports parsing for the following basic types:
Fields not extracting any data can be marked void. They can still have hooks attached.
Since they are pervasive we give a brief overview for vectors here.
Parsing vectors
A common requirement is to parse vector of the same type, possibly of dynamic length.
To parse a vector of three integers we would write:
type X = unit {
xs: uint16[3];
};
If the number of elements is not known we can parse until the end of the input data. This will trigger a parse error if the input does not contain enough data to parse all elements.
type X = unit {
xs: uint16[] &eod;
};
If the vector is followed by e.g., a literal we can dynamically detect with lookahead parsing where the vector ends. The literal does not need to be a field, but could also be in another parser following the vector.
type X = unit {
xs: uint16[];
: b"\x00"; # Vector is terminated with null byte.
};
If the terminator is in the domain of the vector elements we can also use the
&until attribute.
type X = unit {
# Vector terminate with a null value
xs: uint8[] &until=$$==0;
};
If the vector elements require attributes themselves, we can pass them by grouping them with the element type.
type X = unit {
# Parse a vector of 4-byte integers less than 1024 until we find a null.
xs: (uint64 &requires=$$<1024)[] &until=$$==0;
};
Exercises: A naive CSV parser
Assuming the following simplified CSV format:
- rows are separated by newlines
b"\n" - individual columns are separated by
b"," - there are not separators anywhere else (e.g., no
,in quoted column values)
A sample input would be
1,a,ABC
2,b,DEF
3,c,GHI
For testing you can use the -f flag to spicy-dump or spicy-driver to read
input from a file instead of stdin, e.g.,
spicy-driver csv_naive.spicy -f input.csv
-
Write a parser which extracts the bytes on each row into a list.
Hint 1
You top-level parser should contain a list of rows which has unspecified length.
Hint 2
Define a new parser for a row which parses
bytesuntil it finds a newline and consumes it.Solution
module csv_naive; public type CSV = unit { rows: Row[]; }; type Row = unit { data: bytes &until=b"\n"; }; -
Extend your parser so it also extracts individual columns (as
bytes) from each row.Hint
The
&convertattribute allows changing the value and/or type of a field after it has been extracted. This allows you to split the row data into columns.Is there a builtin function which splits your row data at a separator (consuming the iterator)? Functions on
bytesare documented here. You can access the currently extracted data via$$.Solution
module csv_naive; public type CSV = unit { rows: Row[]; }; type Row = unit { cols: bytes &until=b"\n" &convert=$$.split(b","); }; -
Without changing the actual parsing, can you change your grammar so the following output is produced? This can be done without explicit loops.
$ spicy-driver csv_naive.spicy -f input.csv [[b"1", b"a", b"ABC"], [b"2", b"b", b"DEF"], [b"3", b"c", b"GHI"]]Hint 1
You could add a unit hook for your top-level unit which prints the rows.
on CSV::%done { print self.rows; }Since
rowsis a list of units you still need to massage its data though ...Hint 2
You can use a unit
&convertattribute on your row type to transform it to its row data.Solution
module csv_naive; public type CSV = unit { rows: Row[]; }; type Row = unit { data: bytes &until=b"\n" &convert=$$.split(b","); } &convert=self.data; on CSV::%done { print self.rows; }
Adding additional parser state
We might want to add additional state to parsers, e.g.,
- share or modify data outside of our parser, or
- to locally aggregate data while parsing.
Sharing state across multiple units in the same Zeek connection with unit contexts will be discussed separately in a later section.
Passing outside state into units
We might want to pass additional state into a unit, e.g., to parameterize the unit's behavior, or to give the unit access to external state. This can be accomplished with unit parameters.
type X = unit(init: uint64 = 64) {
var sum: uint64;
on %init { self.sum = init; }
: uint8 { self.sum += $$; }
: uint8 { self.sum += $$; }
: uint8 { self.sum += $$; }
on %done { print self.sum; }
};
A few things to note here:
- Unit parameter look a lot like function parameters to the unit.
- Unit parameters can have default values which are used if the parameter was not passed.
- We refer to unit parameters by directly using their name;
selfis not used.
Unit parameters can also be used to give a unit access to its parent units and their state.
public type X = unit {
var sum: uint8;
: (Y(self))[];
};
type Y = unit(outer: X) {
: uint8 { outer.sum += $$; }
};
Unit variables
Unit variables allow to add additional data to units. Their data can be accessed like other unit fields.
type X = unit {
var sum: uint8;
: uint8 { self.sum += $$; }
: uint8 { self.sum += $$; }
: uint8 { self.sum += $$; }
on %done { print self.sum; }
};
By default unit variables are initialized with the default value of the type,
e.g., for a uint8 with 0.
If you want to capture whether a unit variable (or any other variable) was set,
use a variable of optional type instead of a dummy value.
To use with a different value, assign the variable in the unit's %init hook,
e.g.,
on %init { self.sum = 100; }
Lookahead parsing
Lookahead parsing is a core Spicy concept. Leveraging lookahead makes it possible to build concise grammars which remain comprehensible and maintainable as the grammar grows.
Deep dive: Parsing of lists of unknown size
We have already seen how we can use lookahead parsing to dynamically detect the length of a list.
type X = unit {
: (b"A")[]; # Extract unknown number of literal 'A' bytes.
x: uint8;
};
We can view the generated parser by requesting grammar debug output from Spicy's
spicyc compiler.
$ spicyc -D grammar x.spicy -o /dev/null -p
# ~~~~~~~~~~ ~~~~~~~ ~~~~~~~~~~~~ ~~
# | | | |
# | | | - emit generated IR
# | | |
# | | - redirect output of generated code to /dev/null
# | |
# | - compile file 'x.spicy'
# |
# - emit 'grammar' debug stream to stderr
[debug/grammar] === Grammar foo::X
[debug/grammar] Epsilon: <epsilon> -> ()
[debug/grammar] While: anon -> while(<look-ahead-found>): anon_2 [field: anon (*)] [item-type: vector<bytes>] [parse-type: vector<bytes>]
[debug/grammar] Ctor: anon_2 -> b"A" (bytes) (container 'anon') [field: anon_2 (*)] [item-type: bytes] [parse-type: bytes]
[debug/grammar] LookAhead: anon_l1 -> {uint<8> (not a literal)}: <epsilon> | {b"A" (bytes) (id 1)}: anon_l2
[debug/grammar] Sequence: anon_l2 -> anon_2 anon_l1
[debug/grammar] (*) Unit: foo_X -> anon x
[debug/grammar] Variable: x -> uint<8> [field: x (*)] [item-type: uint<8>] [parse-type: uint<8>]
[debug/grammar]
[debug/grammar] -- Epsilon:
[debug/grammar] anon = true
[debug/grammar] anon_l1 = true
[debug/grammar] anon_l2 = false
[debug/grammar] foo_X = false
[debug/grammar]
[debug/grammar] -- First_1:
[debug/grammar] anon = { anon_2 }
[debug/grammar] anon_l1 = { anon_2 }
[debug/grammar] anon_l2 = { anon_2 }
[debug/grammar] foo_X = { anon_2, x }
[debug/grammar]
[debug/grammar] -- Follow:
[debug/grammar] anon = { x }
[debug/grammar] anon_l1 = { x }
[debug/grammar] anon_l2 = { x }
[debug/grammar] foo_X = { }
[debug/grammar]
In above debug output the entry point of the grammar is marked (*).
- parsing a unit consists of parsing the
anonfield (corresponding to the anonymous list), andx - to parse the list lookahead is used.
- lookahead inspects a
uint8(as epsilon) or literalb"A"
Types for lookahead
In addition to literals, lookahead also works with units which start with a literal. Spicy transparently detects such units and will use them for lookahead if possible.
Confirm this yourself by wrapping the literal in above unit in its own unit, and
validating by parsing an input like AAAAA\x01. Are there any major differences
in the generated grammar?
Using lookahead for conditional parsing
We have seen previously how we can use unit switch for conditional parsing.
Another instance of conditional parsing occurs when a protocol message
holds one of multiple possible sub-messages (a union). The sub-messages often
contain a tag to denote what kind of sub-message is transmitted.
With a unit switch statement we could model this like so.
public type X = unit {
tag: uint8;
switch (self.tag) {
1 -> m1: Message1;
2 -> m2: Message2;
* -> : skip bytes &eod; # For unknown message types simply consume all data.
};
};
type Message1 = unit {
payload: bytes &eod;
};
type Message2 = unit {
payload: bytes &eod;
};
The unit switch statement has a form without control variable which instead
uses lookahead. With this we can push parsing of the tag variable into the
units concerned with the particular messages so we keep all pieces related to a
particular message together.
public type X = unit {
switch {
-> m1: Message1;
-> m2: Message2;
-> : skip bytes &eod; # For unknown message types, simply consume all data.
};
};
type Message1 = unit {
: skip uint8(1);
payload: bytes &eod;
};
type Message2 = unit {
: skip uint8(2);
payload: bytes &eod;
};
Error recovery
Even with a grammar perfectly modelling a specification, parsing of real data can fail due to e.g.,
- endpoints not conforming to spec, or
- gaps in the input data due to capture loss.
Instead of altogether aborting parsing we would like to gracefully recover from parse errors, i.e., when the parser encounters a parse error we would like skip input until it can parse again.
Spicy includes support for expressing such recovery with the following model:
-
To resynchronize the input potential synchronization points are annotated, e.g., to synchronize input at the sequence
b"POST"the grammar might contain a field: b"POST" &synchronize;All constructs supporting lookahead parsing can be synchronization points, e.g., literals or fields with
unittype with a literal at a fixed offset. -
On a parse error the unit enters a synchronization trial mode.
Once the input could be synchronized a
%syncedhook is invoked. The implementation of the hook can examine the data up to the&synchronizefield, and eitherconfirmit to leave trial mode and continue normal parsing, orrejectit to look for a later synchronization point.
Exercises
Let's assume we are parsing a protocol where valid messages are always the
sequence AB, i.e., a the byte sequence b"AB". We will use the following
contrived grammar:
module foo;
public type Messages = unit {
: Message[];
};
type Message = unit {
a: b"A";
b: b"B";
};
on Message::%done { print self; }
-
Validate that this grammar can parse the input
ABABAB$ printf ABABAB | spicy-driver % [$a=b"A", $b=b"B"] [$a=b"A", $b=b"B"] [$a=b"A", $b=b"B"] -
What do you see if you pass misspelled input, like with the second
Achanged to1, i.e., the inputAB1BABWhy is this particular source range shown as error location?
Solution
[$a=b"A", $b=b"B"] [error] terminating with uncaught exception of type spicy::rt::ParseError: no expected look-ahead token found (foo.spicy:3:30-4:17)We first the result of parsing for the first
MessagefromAB, and encounter an error for the second element.The error corresponds to parsing the vector inside
Messages. The grammar expects eitherAto start a newMessage, or end of data to signal the end of the input;1matches neither so lookahead parsing fails. -
What are the potential synchronization points in this grammar we could use so we can extract the remaining data?
Solution
In this case parsing failed at the first field of
Message,Message::a. We coulda. synchronize on
Message::bby changing it tob: b"B" &synchronize;b. Synchronize on
Message::ain the next message, i.e., abandon parsing the remaining fields inMessageand start over. For that we would synchronize on the vector elements inMessages,: (Message &synchronize)[];A slight modification of this grammar seems to fail to synchronize and run into an edge case, https://github.com/zeek/spicy/issues/1594.
-
If you had to choose one, which one would you pick? What are the trade-offs?
Solution
-
If we synchronize on
Message::bit would seem that we should be able to recover at its data.This however does not work since the vector uses lookahead parsing, so we would fail already in
Messagesbefore we could recover inMessage. -
We need to synchronize on the next vector element.
In larger units synchronizing high up (e.g., on a vector in the top-level unit) allows recovering from more general errors at the cost of not extracting some data, e.g., we would be able to also handle misspelled
Bs in this example.
-
-
Add a single
&synchronizedattribute to the grammar so you can handle all possible misspellings. Also add a%syncedhook to confirm the synchronization result (on which unit?). Can you parse inputs like these?ABABAB AB1BAB A11BABYou can enable the
spicy-verbosedebug stream to show parsing progress.printf AB1BAB | HILTI_DEBUG=spicy-verbose spicy-driver -d foo.spicySolution
module foo; public type Messages = unit { : (Message &synchronize)[]; }; type Message = unit { a: b"A"; b: b"B"; }; on Message::%done { print self; } on Messages::%synced { confirm; }
Zeek integration
Zeek supports writing packet, protocol or file analyzers with Spicy. In addition to allowing inclusion of unmodified Spicy grammars, additional features include:
- automatic generation of Zeek analyzers from Spicy parsers from interface definition (EVT) files
- ability to trigger Zeek events from Spicy unit hooks,
- (automatic) exporting of types defined in Spicy as Zeek record types,
- a Spicy module to control Zeek from Spicy code.
Getting started
The recommended approach to integrate a Spicy parser with Zeek is to use the default Zeek package template.
We can create Zeek packet, protocol or file analyzers by selecting the appropriate template feature. E.g., to create a new Zeek package for a protocol analyzer and interactively provide required user variables,
zkg create --packagedir my_analyzer --features spicy-protocol-analyzer
zkg uses Git to track package information. When running in a VM, this can
cause issues if the package repository is in a mounted directory. If you run
into this trying creating the package in directory which is not mounted from the
host.
Use the template to create a Spicy protocol analyzer for analyzing TCP traffic now to follow along with later examples.
This will create a protocol analyzer from the template. Items which need to be
updated are marked TODO. It will generate e.g.,
zkg.meta: package metadata describing the package and setting up building and testinganalyzer/*.evt: interface definition for exposing Spicy parser as Zeek analyzer*.spicy: Spicy grammar of the parserzeek_*.spicy: Zeek-specific Spicy code
scripts/main.zeek: Zeek code for interacting with the analyzerdpd.sig: Signatures for dynamic protocol detection (DPD)
testing/tests: BTest test cases
You can use zkg to install the package into your Zeek installation.
zkg install <package_dir>
To run its tests, e.g., during development:
zkg test <package_dir>
The generated project uses CMake for building and BTest for testing. You can
build manually, e.g., during development. The test scaffolding assumes that the
CMake build directory is named build.
# Building.
mkdir build
(cd build && cmake .. && make)
# Testing.
(cd testing && btest)
We can show available template features with zkg template info.
$ zkg template info
API version: 1.0.0
features: github-ci, license, plugin, spicy-file-analyzer, spicy-packet-analyzer, spicy-protocol-analyzer
origin: https://github.com/zeek/package-template
provides package: true
user vars:
name: the name of the package, e.g. "FooBar" or "spicy-http", no default, used by package, spicy-protocol-analyzer, spicy-file-analyzer, spicy-packet-analyzer
namespace: a namespace for the package, e.g. "MyOrg", no default, used by plugin
analyzer: name of the Spicy analyzer, which typically corresponds to the protocol/format being parsed (e.g. "HTTP", "PNG"), no default, used by spicy-protocol-analyzer, spicy-file-analyzer, spicy-packet-analyzer
protocol: transport protocol for the analyzer to use: TCP or UDP, no default, used by spicy-protocol-analyzer
unit: name of the top-level Spicy parsing unit for the file/packet format (e.g. "File" or "Packet"), no default, used by spicy-file-analyzer, spicy-packet-analyzer
unit_orig: name of the top-level Spicy parsing unit for the originator side of the connection (e.g. "Request"), no default, used by spicy-protocol-analyzer
unit_resp: name of the top-level Spicy parsing unit for the responder side of the connection (e.g. "Reply"); may be the same as originator side, no default, used by spicy-protocol-analyzer
author: your name and email address, Benjamin Bannier <benjamin.bannier@corelight.com>, used by license
license: one of apache, bsd-2, bsd-3, mit, mpl-2, no default, used by license
versions: v0.99.0, v1.0.0, v2.0.0, v3.0.0, v3.0.1, v3.0.2
Protocol analyzers
For a TCP protocol analyzer the template generated the following declaration in
analyzer/*.evt:
protocol analyzer Foo over TCP:
parse originator with foo::Request,
parse responder with foo::Response;
Here we declare a Zeek protocol analyzer Foo which uses to different parsers
for the originator (client) and responder (server) side of the connection,
Request and Response. To use the same parser for both sides we would declare
parse with foo::Messages;
Message and connection semantics: UDP vs. TCP
The parsers have these stub implementations:
module foo;
public type Request = unit {
payload: bytes &eod;
};
public type Response = unit {
payload: bytes &eod;
};
We have used &eod to denote that we want to extract all data. The semantics
of all data differ between TCP and UDP parsers:
- UDP has no connection concept so Zeek synthesizes UDP "connections" from flows by
grouping UDP messages with the same
5-tuple
in a time window. UDP has no reassembly, so a new parser instance is
created for each UDP packet;
&eodmeans until the end of the current packet. - TCP: TCP supports connections and packet reassembly, so both sides of a
connection are modelled as streams with reassembled data;
&eodmeans until the end of the stream. The stream is unbounded.
For this reason one usually wants to model parsing of a TCP connection as a list of protocol messages, e.g.,
public type Requests = unit {
: Request[];
};
type Request = unit {
# TODO: Parse protocol message.
};
- the length of the list of messages is unspecified so it is detected dynamically
- to avoid storing an unbounded list of messages we use an anonymous field for the list
- parsing of the protocol messages is responsible for detecting when a message ends
Analyzer lifecycle
In Zeek's model all eligible analyzers would see the traffic.
- If analyzers detect traffic not matching their protocol, they should signal Zeek an analyzer violation so they stop receiving data. This is not an error during protocol detection.
- To signal matching traffic, analyzers should signal Zeek an analyzer confirmation. This e.g., leads to associating the protocol/service with the connection.
flowchart TD
N((fa:fa-cloud)) -->|data| Z(Zeek)
Z -->|looks up| Reg[Analyzers registered for port]
Z --> |forwards for matching| dpd[Analyzers with matching signatures]
Reg -->|data| A1
Reg --> |data|A2
dpd -->|data| B1
dpd --> |data|B2
dpd --> |data|B3
AC(fa:fa-bell analyzer_confirmation)
style AC fill:lightgreen
AV(fa:fa-bell analyzer_violation)
style AV fill:red
B1 -->|triggers| AV
B2 -->|triggers| AV
B3 -->|triggers| AC
A1 -->|triggers| AV
A2 -->|triggers| AV
To integrate the parser into this the template generated the following stub implementations in analyzer/zeek_*.spicy:
# TODO: Protocol analyzers should confirm once they are reasonably sure that
# they are indeed parsing the right protocol. Pick a unit that's a little bit
# into the parsing process here.
#
# on Foo::SUITABLE_UNIT::%done {
# zeek::confirm_protocol();
# }
# Any error bubbling up to the top unit will trigger a protocol rejection.
on Foo::Request::%error {
zeek::reject_protocol("error while parsing Foo request");
}
on Foo::Response::%error {
zeek::reject_protocol("error while parsing Foo reply");
}
We can use
zeek::confirm_protocol
and
zeek::reject_protocol
to signal Zeek.
Passing data to Zeek
Ultimately we want to make the parsed data available to Zeek for analysis and logging.
The handling of events is declared in the EVT file analyzer/*.EVT.
# TODO: Connect Spicy-side events with Zeek-side events. The example just
# defines simple example events that forwards the raw data (which in practice
# you don't want to do!).
on Foo::Request -> event Foo::request($conn, $is_orig, self.payload);
on Foo::Response -> event Foo::reply($conn, $is_orig, self.payload);
- the LHS specifies a Spicy hook in Spicy syntax
- the RHS specifies a (possibly generated) Zeek event in Zeek syntax
- we can reference Spicy data via
selfon the RHS - data for builtin Spicy types are converted automatically to equivalent Zeek types
- we can automatically generate Zeek record types from Spicy types
- information about the generated analyzer is accessible via magic
variables
$conn,$file,$packet,$is_orig
The event is handled on the Zeek side in scripts/main.zeek, e.g.,
# Example event defined in foo.evt.
event Foo::request(c: connection, is_orig: bool, payload: string)
{
hook set_session(c);
local info = c$foo;
info$request = payload;
}
Passing data to other Zeek analyzers (e.g., for analyzing subprotocols and files) is handled in a later section.
Forwarding to other analyzers
One often wants to forward an extracted payload to other analyzers.
- HTTP messages with files
- compressed files containing PE files
- protocols using other sub-protocols
Inside Spicy we can forward data from one parser to another one with sink
values,
but in a Zeek context we can also forward data to other analyzers (Spicy or
not).
Forwarding to file analyzers
Let's assume we are parsing protocol messages which contain bytes
corresponding to a file. We want to feed the file data into Zeek's file
analysis.
type Message = unit {
: bytes &chunked &size=512;
};
By using the &chunked attribute on the bytes its field hook is invoked soon as a chunk of data
arrives, even if the full data is not yet available.
The caveat is that only the final chunk will be stored once parsing is done. This is
fine since we usually do not store the data.
The protocol for passing data is:
- open a handle for a new Zeek file with
zeek::file_beginoptionally specifying a MIME type - pass information to Zeek, e.g., feed data or gaps, or notify Zeek about the expected size
- close the handle with
zeek::file_end
E.g.,
import zeek;
public type File = unit {
var h: string;
on %init { self.h = zeek::file_begin(); }
: bytes &chunked &eod {
zeek::file_data_in($$, self.h);
}
on %done { zeek::file_end(self.h); }
};
File handles need to be closed explicitly.
Not closing them would leak them for the duration of the connection.
Forwarding to protocol analyzers
Forwarding to protocol analyzers follows a similar protocol of opening a handle, interacting with it, and closing it.
For opening a handle, two APIs are supported:
function zeek::protocol_begin(analyzer: optional<string> = Null);
function zeek::protocol_handle_get_or_create(analyzer: string) : ProtocolHandle;
When using zeek::protocol_begin without argument all forwarded data will be
passed to Zeek's dynamic protocol detection (DPD).
Otherwise use the Zeek name of the analyzer, e.g.,
local h = zeek::protocol_handle_get_or_create("SSL");
You can inspect the output of zeek -NN for available analyzer names, e.g.,
$ zeek -NN | grep ANALYZER | grep SSL
[Analyzer] SSL (ANALYZER_SSL, enabled)
Sharing data across the same connection
We sometimes want to correlate information from the originator and responder side of a connection, and need to share data across the same connection.
Often we can do that in Zeek script land, e.g.,
# Example: Mapping of connections to their request method.
#
# NOTE: FOR DEMONSTRATION ONLY. WE MIGHT E.G., WANT TO ALLOW MULTIPLE REQUESTS
# PER CONNECTION.
global methods: table[conn_id] of string &create_expire=10sec;
event http_request(c: connection, method: string, original_URI: string, unescaped_URI: string, version: string)
{
# Store method for correlation.
methods[c$conn$id] = method;
}
event http_reply(c: connection, version: string, code: count, reason: string)
{
local id = c$conn$id;
if ( id in methods )
{
local method = methods[id];
print fmt("Saw reply %s to %s request on %s", code, method, id);
}
else
{
print fmt("Saw reply to unseen request on %s", id);
return;
}
}
This assumes that we always see requests before replies. Depending how we collect and process data this might not always hold.
If we need this information during parsing this is too late. Spicy allows
sharing information across both sides with unit
contexts.
When declaring a Spicy analyzer Zeek automatically sets up so originator and
responder of a connection share a context.
type Context = tuple<method: string>;
type Request = unit {
%context = Context;
method: /[^ \t\r\n]+/ { self.context().method = $$; }
# ...
};
type Reply = unit {
%context = Context;
# ...
on %done { print "Saw reply %s to %s request" % (code, self.context().method); }
};
Exercise
Starting from the default protocol analyzer template we want to (redundantly) pass the number of
bytes for Request to Zeek as well.
-
In the EVT file pass the number of
bytesinself.payload.Solution
on Foo::Request -> event Foo::request($conn, $is_orig, self.payload, |self.payload|); -
Manually build your changed analyzer:
mkdir build cd build/ cmake .. make -
Execute the test suite. This runs tests against an included PCAP file. What do you see?
cd testing/ btest -dvSolution
Test
tests.tracetest fails. Its sources are intesting/tests/trace.zeek... analyzer error in <..>/foo/analyzer/foo.evt, line 16: Event parameter mismatch, more parameters given than the 3 that the Zeek event expects -
Fix the signatures of the handlers for
Foo::requestso tests pass. What type do need to use on the Zeek side to pass the length (uint64in Spicy)?Hint
The type mappings are documented here.
Solution
In both
testing/tests/trace.zeekandscripts/main.zeekchange the signatures toevent Foo::request(c: connection, is_orig: bool, payload: string, len: count) {} -
Modify
testing/tests/trace.zeekto include the length in the baseline, i.e., change the test case forFoo::requesttoprint fmt("Testing Foo: [request] %s %s %d", c$id, payload, len);Rerun tests and update the test baseline with
cd testing/ btest -uMake sure all tests with these changes.
Stage and commit all changes in the package repository.
git add -u git commit -v -m "Pass payload length to Zeek"Validate that the package also tests fine with
zkg. This will require no uncommitted changes or untracked files in the repository.# Make progress more verbose. zkg -vvv test . -
Optional Also add the length to the Zeek log generated from the code in
scripts/main.zeek.Hint
This requires adding a
count &optional &logfield to theInforecord.Set the field from the event handler for
Foo::request.Update test baselines as needed.