Encoding Techniques in Verified Serializers

March 28, 2018 20 minute read

Written by Ethan Shea, edited by James Wilcox.

Serialization converts in-memory data to an external representation, typically a list or stream of bytes, which is then ready to be stored on disk or sent over the network.

This post describes Cheerios, a verified library for serialization in Coq. Cheerios uses typeclasses to make it easy to create new serializers by composing existing serializers, such that the correctness proofs also compose. We first give an overview of the core definitions of Cheerios and show how to build simple serializers for booleans, natural numbers, and pairs. Then, we describe two generic strategies for serializing recursive “container-like” types, such as lists and trees, and discuss the tradeoffs in proof effort between the strategies. A recurring theme is the challenge of expressing decoders via structural recursion.

This post is generated from a literate Coq file, which we encourage you to step through.

Defining Serialization

In order to define serialization, three things are needed, types for the serialization and deserialization functions, and a correctness specification. The correctness spec should roughly show that serialization and deserialization are inverses.This enables the proof that any object can be serialized then deserialized into the same object. We’ll start with serialization because it conceptually comes first in the process.

In order to serialize something, all of it’s information must be mapped into bits. It makes sense then to define a serializer for some type A as A -> list bool. Take the following type for example, representing olympic medals:

Inductive medal := Gold | Silver | Bronze.

A serialization function should map each case to a symbol of bits. There are many ways this could be done, each with different trade offs that will be explored later. For now, we just pick one.

Definition medal_serialize (m: medal) : list bool :=
  match m with
  | Gold => [true; true]
  | Silver => [true; false]
  | Bronze => [false]
  end.

As it turns out, this first attempt at a type will be exactly what is needed.

Now a type for the deserializer can be determined. We want something that acts as an inverse to the serialization function we picked. At first thought, list bool -> A seems like a good option. This would allow the correctness spec to be deserialize (serialize a) = a. However, this runs into problems pretty quickly.

Fail Definition medal_deserialize (bools: list bool) : medal :=
  match bools with
  | [true; true] => Gold
  | [true; false] => Silver
  | [false] => Bronze
  end.

Coq catches the mistake and points out that the bools is not exhaustively matched on. What if it’s empty? Fundamentally, this problem is encountered because not every sequence of booleans decodes into a medal. Even non-empty sequences such as [false; true] pose issues. Since these sequences are not produced by the serializer, they can be considered erroneous. In cheerios we handle this case by returning the option constructor None to indicate an error.

This makes the spec become deserialize (serialize a) = Some a. In English: deserialization on any serialized stream always succeeds and returns the correct value.

Definition medal_deserialize1 (bools: list bool):option medal :=
  match bools with
  | [true; true] => Some Gold
  | [true; false] => Some Silver
  | [false] => Some Bronze
  | _ => None
  end.

This works for a single medal being encoded in the bitstream, but there are problems when the work from above is reused to a type which requires composition, like a pair of medals. Serialization works just fine, but deserialization is problematic.

Definition medal_serialize_pair (medals: medal * medal) :=
  medal_serialize (fst medals) ++ medal_serialize (snd medals).

Fail Definition medal_deserialize_pair (bools: list bool)
    : option (medal * medal) :=
(medal_deserialize1 bools, medal_deserialize1 hmmm).

When deserializing the first medal, the entire list is consumed. There is nothing to pass into the second call to medal_deserialize1 because it is not known how much of the list has been deserialized. The definition of deserialize needs a way to communicate how much of the stream is remaining back to the caller. In Cheerios, this is represented with the type medal * list bool where the deserialized medal and remaining portion of the stream are returned. This is wrapped in an option to allow the entire deserialization operation to fail. Failure happens at this level because once an error is encountered, it is impossible in general to resume serialization of the remaining content.

Definition medal_deserialize (bools: list bool)
	: option (medal * list bool) :=
  match bools with
  | true :: true :: bools => Some (Gold, bools)
  | true :: false :: bools => Some (Silver, bools)
  | false :: bools => Some (Bronze, bools)
  | _ => None
  end.

As we will see shortly, this type is sufficient to support both composition and malformed inputs. Let’s take a moment to generalize these definitions before continuing so we can arrive at a definition for the spec.

Definition serializer (A: Type) := A -> list bool.

Definition deserializer (A: Type) :=
  list bool -> option (A * list bool).

How does this alter the correctness specification? We can start by taking what we had last time and making it typecheck:

deser (ser a) = Some (a, [])

However this still doesn’t address the problem with the remaining bools. How do we reason about deserialization with any other input following? Another attempt leads us to something like this:

deser (ser a ++ ser b) = Some (a, ser b)

This works, but now exactly two objects must be encoded in the stream. We can’t easily reason about deserializing multiple objects or a single object this way. Generalizing again for what comes after gives:

deser (ser a ++ bools) = Some (a, bools)

Now the dependence on a second object is removed and as a side effect the spec becomes stronger, allowing any data to follow rather than just data produced by some serializer.

Note that the spec only needs to worry about encodings which the serializer produces. This eliminates our need to reason about the error cases that were necessary in the deserializer. However, in doing this, nothing is said about how malformed bitstrings are parsed, or that every deserialized value can be generated by exactly one bit string. These may be useful properties to prove, but cheerios does not handle deserialization from unknown and unverified sources since this minimum spec is enough.

Definition ser_deser_spec A
           (ser : serializer A)
           (deser : deserializer A) :=
  forall (a : A) (bools: list bool),
      (deser (ser a ++ bools)) = Some (a, bools).

Wrapping this up in a class gives us the following definition which includes the following three things: a serializer, a deserializer, and a proof of correctness.

Class Serializer (A : Type) : Type := {
    serialize : A -> list bool;
    deserialize : list bool -> option (A * list bool);
    ser_deser_identity : ser_deser_spec A serialize deserialize
}.

In general, the correctness proofs tend to be straightforward and repetitive, but this first one is included here to show the structure. Concretely this becomes:

Theorem medal_ser_deser_identity :
  ser_deser_spec medal medal_serialize medal_deserialize.
Proof.
  unfold ser_deser_spec.
  unfold medal_deserialize.
  unfold medal_serialize.
  intros m.
  destruct m; reflexivity.
Qed.

Instance MedalSerializer : Serializer medal.
Proof.
exact {| serialize := medal_serialize;
         deserialize := medal_deserialize;
         ser_deser_identity := medal_ser_deser_identity;
       |}.
Defined.

Generalizing this pair deserailizer for arbitrary types A and B comes naturally now that there are better type signatures for serialization and deserialization. Wrapping all three components in a section avoids some boilerplate. Note that the type system requires a serializer for A and B in order for the A * B serializer to function.

Section PairSerializer.
Variable A : Type.
Variable B : Type.
Variable serA : Serializer A.
Variable serB : Serializer B.

Definition pair_serialize (p : A * B) : list bool :=
  serialize (fst p) ++ serialize (snd p).

Definition pair_deserialize bools 
    : option ((A * B) * list bool) :=
  match deserialize bools with
  | Some (a, bools) => 
    match deserialize bools with
    | Some (b, bools) => Some ((a, b), bools)
    | None => None
    end
  | None => None
  end.

Theorem pair_ser_deser_identity : 
  ser_deser_spec (A * B) pair_serialize pair_deserialize.
Proof.
  unfold ser_deser_spec.
  intros.
  unfold pair_serialize.
  rewrite app_ass.
  unfold pair_deserialize.
  rewrite ser_deser_identity, ser_deser_identity.
  rewrite <- surjective_pairing.
  reflexivity.
Qed.

Instance PairSerializer : Serializer (A * B).
Proof.
exact {| serialize := pair_serialize;
         deserialize := pair_deserialize;
         ser_deser_identity := pair_ser_deser_identity;
       |}.
Defined.

End PairSerializer.

Note that the variable bools is shadowed several times in this definition. Normally this can complicate code, but in this case it improves clarity because bools always refers to “what’s left to parse”.

Now, we will build a simple (inefficient¹) serializer/deserializer for a more useful datatype, nats. The encoding is essentially the unary representation of the natural number.

Fixpoint nat_serialize (n : nat) : list bool :=
  match n with
  | O => [false]
  | S n => [true] ++ (nat_serialize n)
  end.

Fixpoint nat_deserialize bools : option (nat * list bool) :=
  match bools with
  | true :: bools => 
    match nat_deserialize bools with
    | None => None
    | Some (n, bools) => Some (S n, bools)
    end
  | false :: bools => Some (O, bools)
  | [] => None (* Deserializing an empty stream *)
  end.

Theorem nat_ser_deser_identity :
  ser_deser_spec nat nat_serialize nat_deserialize.
Proof.
  unfold ser_deser_spec.
  intros n; induction n; intros.
  - simpl. reflexivity.
  - simpl.
    rewrite IHn.
    reflexivity.
Qed.

Instance NatSerializer : Serializer nat.
Proof.
exact {| serialize := nat_serialize;
         deserialize := nat_deserialize;
         ser_deser_identity := nat_ser_deser_identity;
       |}.
Defined.

Notice that the information about when to stop deserialization of each element must be encoded into the stream itself. For example with the following definition of nat_serialize, deserialization of nat * nat would become problematic.

Fixpoint nat_serialize_broken (n : nat) : list bool :=
  match n with
  | O => []
  | S n => [true] ++ (nat_serialize n)
  end.

Under this definition, it’s unclear what deserializing [true, true true] as a pair of nats should return. It could be (0,3), (1,2), (2,1) or (3,0). To remove this ambiguity, the information about when to stop must be encoded in the stream itself in one form or another rather than implicitly by using the end of the stream as a token. Consider the serialized pair of nats [true, false, true, true, false], serialized using the not-broken serializer. It is unambiguously (1, 2). When deserializing it is known precisely when each nat finishes (when false is reached), and when the pair finishes (when the second nat finishes). This information about the structure of the encoded data plays a crucial part in showing ser_deser_identity.

List Serialization

When serializing lists (or any variable sized collection) there must be some information about the structure in the serialized stream. Imagine this is not done, and a pair of lists is serialized into the byte stream. This would produce an encoding which looks like the figure below. It’s impossible to tell where one list stops and the next begins just by looking at the stream.

This serializer is broken for the same reason as the broken nat serializer, the information in a serialized object must be entirely contained within the bitstream. Note that we don’t run into this problem with any collection of fixed size, like a pair or vector. It is clear when to stop deserializing a Vec 5 because 5 elements have been deserialized. In this case, the information about the shape of the data in this case is encoded in the type. Since the type is known to the serializer and the deserializer, it does not need to be encoded in the bitstream.

Let’s start with solving this problem by including a “continue” bit before every element. If it is true an element follows, and if it is false, the end of the list has been reached. This appears as follows:

Let’s see what this looks like in code.

Fixpoint list_serialize_inter (l : list A) : list bool :=
  match l with
  | [] => [false]
  | h :: t => [true] ++ serialize h ++ list_serialize_inter t
  end.

With this scheme, deserialization again proves to be difficult. In the definition below, because bools_after_elem is not a syntactic subterm of bools, the termination checker refuses to accept this definition. The fact that bools_after_elem is returned from a function hides the subterm property from the typechecker. When executed, the definition does terminate, since bools_after_elem is a strict suffix of bools, but the type system does not see this. An attempted definition is given below:

Fail Fixpoint list_deserialize_inter
  (bools: list bool) : option (list A * list bool) :=
  match bools with
  | [] => None
  | false :: bools => Some ([], bools)
  | true :: bools =>
    match deserialize bools with
    | None => None
    | Some (a, bools_after_elem) =>
      match list_deserialize_em bools_after_elem with
      | None => None
      | Some (tail, bools_after_list) =>
          Some (a :: tail, bools_after_list)
      end
    end
  end.

It is intuitively impossible to define this deserialization function without using general recursion. To solve this recursion problem, the same information encoded in the continuation bits can be moved to the front of the list’s encoding in the form of a size. Then the rest of the deserializer can recurse on the number of elements remaining.

Programmatically,

Fixpoint list_serialize_elts (l : list A) : list bool :=
  match l with
  | [] => []
  | h :: t => serialize h ++ list_serialize_elts t
  end.

Definition list_serialize (l : list A) : list bool :=
  nat_serialize (length l) ++ list_serialize_elts l.

Fixpoint list_deserialize_elts (size : nat) (bools : list bool)
      : option (list A * list bool) :=
  match size with
  | O => Some ([], bools)
  | S size => 
    match deserialize bools with
    | None => None
    | Some (n, bools) =>
      match list_deserialize_elts size bools with
      | None => None
      | Some (tail, bools) => Some (n :: tail, bools)
      end
    end
  end.

Definition list_deserialize bools :=
  match deserialize bools with
  | None => None
  | Some (size, bools) => list_deserialize_elts size bools
  end.

This gives a definition which can be defined using only structural recursion, just by moving the information around. It’s worth noting that because the size information is grouped together instead of spread apart, it would be much easier to make the encoding format more efficient by swapping in a more efficient nat serializer. The only property lost with this encoding is that it is now impossible to reason about any tail of the list in isolation, the concept of a size must also be considered.

Binary Trees

To continue exploring this idea of serializing shape, we need to look at a more complicated data structure such as a binary tree. Our definition of a binary tree is straightforward:

Inductive tree: Type := 
| leaf : tree
| node : A -> tree -> tree -> tree.

Just as with lists, there are two general approaches to serializing trees: interleaved and up-front.

Interleaved Tree Serializer

For the interleaved shape tree serializer, the concept of a “path” is needed. A path is simply the list of directions taken from the root to reach some node. We’ll use true to represent left and false to represent right. These directions are stored with the head at the top of the tree. Below is the path [true, false].

Using the concept of a path, the position and data of any node can be serialized. When this is done for all nodes in the tree, all information captured by the original data structure has been encoded.²

Even though an interleaved structure is impossible to deserialize without general recursion, using an interleaved structure is still possible if there is just enough information up front to recurse on. The number of nodes in the tree provides a nice metric. Our serializer will not be truely interleaved since we require this header, but information about the shape will still be interleaved in the encoding.

The encoding using an interleaved structure looks like this:

Serialization is performed as follows:

Fixpoint tree_size (t : tree A) : nat :=
  match t with
  | leaf => 0
  | node _ l r => 1 + tree_size l + tree_size r
  end.

Fixpoint tree_serialize_subtree_inter 
    (t: tree A) (path: list bool) :=
  match t with
    | leaf => []
    | node a l r => serialize path ++ serialize a
      ++ tree_serialize_subtree_inter l (path ++ [true])
      ++ tree_serialize_subtree_inter r (path ++ [false])
  end.

Definition tree_serialize_inter (t: tree A) : list bool :=
  nat_serialize (tree_size t) ++ 
  tree_serialize_subtree_inter t [].

Deserialization is more complicated. As elements are parsed, they are inserted into the tree structure parsed already. The insertion function used is not particularly robust, however during deserialization as long as any given node is preceded by all of its parents no issues arise. This is the case with a preorder traversal, and other traversals like BFS, so it meets our needs.

Fixpoint tree_insert (into t: tree A)(path: list bool): tree A :=
  match into with
  | leaf => t
  | node a l r =>
      match path with
      | [] => t (* not supported *)
      | true :: path => node a (tree_insert l t path) r
      | false :: path => node a l (tree_insert r t path)
      end
  end.

Fixpoint tree_deserialize_inter_impl
         (remaining : nat) (root : tree A) (bools : list bool)
         : option (tree A * list bool) :=
  match remaining with
  | S n =>
    match deserialize bools with
    | None => None
    | Some (path, bools) =>
      match deserialize bools with
      | None => None
      | Some (a, bools) =>
        tree_deserialize_inter_impl
          n
          (tree_insert root (node a leaf leaf) path)
          bools
      end
    end
  | O => Some (root, bools)
  end.

Definition tree_deserialize_inter bools :=
  match nat_deserialize bools with 
  | Some (size, bools) => 
		tree_deserialize_inter_impl size leaf bools
  | None => None
  end.

Because of this concept of a path, which is a global address of any particular node, reasoning about a tree becomes much more difficult. In particular, we must now prove that every insertion is made on a leaf of the tree so it does not overwrite data or fall off the end.

Fixpoint leaf_insertable (into: tree A)(path: list bool): Prop :=
  match into with
  | leaf => 
    (* Only if the path and tree run out at the same time
       should we be able to insert *)
      match path with
      | [] => True
      | _ => False
      end
  | node a l r =>
      match path with
      | [] => False
      | true :: path => (leaf_insertable l path)
      | false :: path => (leaf_insertable r path)
      end
  end.

The proof for this serializer is quite large (about 150 lines) and uninteresting, so it has been omitted. It can be found here.

Up-front Tree Serializer

Alternatively, the structure may be recorded at the beginning and then filled in as the tree is parsed. To do this, a tree’s shape can be reasoned about as the type tree unit, and it’s elements as the type list A.

This technique requires serialization and deserialization to be a two step process, which has the advantage of better mapping to the information stored in the tree (shape and element data), but the disadvantage of being more complicated.

The shape is encoded similarly to HTML with three symbols:

[true; true]: The beginning of a node
[true; false]: The end of a node
[false]: A leaf node

Each node requires exactly two subtrees between its start and end marker. Storing the shape as tree unit works because unit contains no information, so tree unit only contains the information that the tree portion of tree A describes, which is the shape. Since the shape is recorded in a preorder traversal, the elements are also encoded in the same order, which makes it easy to marry the two together.

A visual representation of this encoding:

And in code:

Fixpoint tree_serialize_shape (t : tree A) : list bool :=
  match t with
  | leaf => [false]
  | node _ l r => [true; true] ++ tree_serialize_shape l ++
                  tree_serialize_shape r ++ [true; false]
  end.

Fixpoint tree_serialize_data_preorder (t : tree A) : list bool :=
  match t with
  | leaf => [] (* No data contained within leaf nodes *)
  | node a l r => serialize a ++
				  tree_serialize_data_preorder l ++
				  tree_serialize_data_preorder r
  end.

Definition tree_serialize_front (t: tree A) : list bool :=
  tree_serialize_shape t ++ tree_serialize_data_preorder t.

Fixpoint tree_deserialize_shape 
	(bools: list bool) (progress: list (list (tree unit)))
	: option (tree unit * list bool) :=
  match bools with
  | false :: bools => 
    match progress with
    | [] => Some (leaf, bools)
    | level :: progress =>
		tree_deserialize_shape
		  bools
		  ((leaf :: level) :: progress)
    end
  | true :: true :: bools =>
		tree_deserialize_shape bools ([] :: progress)
  | true :: false :: bools =>
    match progress with
    | [] => None (* end without a beginning *)
    | level :: [] => 
      match level with
      | [r; l] => Some (node tt l r, bools)
      | _ => None
      end
    | level :: parent :: progress =>
      match level with
      | [r; l] =>
		tree_deserialize_shape
		  bools
		  ((node tt l r :: parent) :: progress)
      | _ => None
      end
    end
  | _ => None
  end.

Fixpoint tree_deserialize_front_elts
	(shape : tree unit) (bools : list bool) 
	: option (tree A * list bool) :=
  match shape with
  | leaf => Some (leaf, bools)
  | node _ l r =>
    match deserialize bools with
    | None => None
    | Some (a, bools) =>
      match tree_deserialize_front_elts l bools with
      | None => None
      | Some (l, bools) => 
        match tree_deserialize_front_elts r bools with
        | None => None
        | Some (r, bools) => Some (node a l r, bools)
        end
      end
    end
  end.

Definition tree_deserialize_front (bools : list bool)
	: option (tree A * list bool) :=
  match tree_deserialize_shape bools [] with
  | None => None
  | Some (shape, bools) =>
		tree_deserialize_front_elts shape bools
  end.

Because of the more recursive nature of the encoding, reasoning is significantly easier. We can consider any portion of the shape in isolation from all others because there are no ties to any global state.

Again, the proof for this serializer is large (about 70 lines) and uninteresting, so it has been omitted. It can be found here.

Conclusion

It’s worth noting that possible encodings for a given type are restricted by information dependencies within that type. Imagine a list is encoded as follows:

Since the size of the list is at the end, rather than at the beginning, information about how to deserialize the structure isn’t known until its too late. Similarly, the size can’t be put anywhere in the middle (say after the first element), because of the possibility of an empty list. Before deserializing each element, it must be known that it actually is an element of the list, and not some other data coming after the list.

This is why the interleaved list serializer is able to work. Right before each element is deserialized, we mark that the list continues with the continue bit.

This is also why the tree serializers are able to encode the shape at the front or the end. In both cases, the size is known so deserializing additional elements is justified. The question of how to arrange these elements can be reasoned about independently of the elements themselves, therefore the shape of the tree can be encoded without regard to where the element data is located.

One might expect to be able to speculatively parse elements of the bitstream and stop when an invalid element is reached. But this requires that we don’t accidentally interpret whatever came after in the bit stream as an element. If the encoding of different types are guaranteed to not overlap, then this would be possible. But in our model, serializers can choose arbitrary encodings, so this is not possible.

Beyond practical necessity, serialization can be used as a forcing function to understand the information contained within data structures. By requiring a well defined format, the information contained in that structure may be deduced and formalized. For example, a list needs to have a length, and a tree needs to have a shape. From there, the encoding of this information is flexible, although some encodings are easier to work with than others.

A linked list of booleans is not computationally efficient, and could be replaced with another more sensible structure such as a stream of bytes. ↩
It’s worth noting that this representation could be made more efficient by recording locations relative to the previous node instead of absolute ones. However, this fact does not significantly change how hard it is to reason about the tree. Recording relative locations would allow us to reason about subtrees instead of parts of some tree, but we still must reason about insertions. ↩