Exploring Wyrd as a Merkle-like CRDT

It turns out that this article has been sitting mostly finished in my drafts for months now. The idea was to finish exploring how Wyrd benefits from Vessel’s DAG, and what it still needs to do.

#Recap

After this long a wait, here’s a quick recap:

These were relatively theoretical discussions. This article focuses more on the interplay between Wyrd and Vessel in practice, and also offers an effective solution to that last problem.

#(Time) Synchronization

In an offhand remark in the last article, I wrote that using time synchronization was a bad idea, therefore logical clocks are the only way to go. This statement should be explained a little.

  1. The typical use of CRDTs are distributed systems without a central arbitration. But they have been found useful for offline-first use cases, where synchronization only occurs eventually.
  2. Use cases that are different from the Offline First1 are delay- and disruption-tolerant networking (DTN) uses.

It’s the latter that is more of a focus of Interpeer – though in technical practice, the differences have very little to do with the use of CRDTs!

From a CRDT point of view, the reason why synchronizations do not happen constantly is of little importance. In offline-first, the reasons may be arbitrary, and the motivation for the movement is more one of convenience for mobile use cases.

The DTN point of view is that disruptions are to be expected at any given time, because the architectural constraints of the network are such that this is unavoidable. This may be because network connectivity is not constant in a location, or power supply is not, etc.

Where DTN differs significantly from the offline-first movement is that it also takes into account arbitrarily high delay even when communications are possible.

This is because DTN is borne out of the needs of (deep) space communications. To put that into context: a radio round-trip to Mars and back can take up to 48 minutes.

For DTN, then, it’s not just about storing and forwarding operations. The real problem is that round trips suck. The more you can avoid those, the better off you are.

I recently had a chance to present the Interpeer Architecture at the DTN Working Group in the IETF2, making that point. The experience in the group is that this is not something many engineers fully understand.

So while CRDTs solve data synchronization in general, time synchronization in particular does take several round trips using common protocols. This is to avoid clock skew induced by transit delays, for the most part.

It does not make sense to use CRDTs on the one hand, and then require the use of something else, potentially worse.

That is why logical clocks make more sense.

#System Clock Changes

Another, often overlooked aspect to using timestamps in communications protocols is that they do not take into account well if each node’s local clock has been updated. I might record an event at 08:00 am. At the point in time that the clock shows 08:01 am, time synchronization occurs (to use a common case), and corrects the clock to 07:58 am. I then write a second event at 07:59 am according to the updated time.

Based solely on timestamps, the second event now occurred before the first! That can’t be true!

A solution to this are monotonic clocks. These are usually offered by hardware and operating systems as clocks that have a rather arbitrary starting point at boot time, and then monotonically increase. If time synchronization occurs, it will affect the regular system clock, but not the monotonic clock.

Monotonic clocks only make statements about elapsed time. They cannot be used to accurately determine time points, except in relation to each other on the same clock.

Since they usually reset at system boot, they are also only useful for the duration of a running system! One of the most obvious use cases are timeouts. If you want an event to occur 10s from now, you typically do not care about the time stamp of that event – only that it is as close to 10s from now as the system can manage.

#Vessel’s Extents

The other offhand remark that needs addressing is that Wyrd operations are lumped together into Vessel extents, and the order they appear in within the extent determines the overall event order.

The problem with this is that Vessel’s DAG has a much coarser resolution than a Merkle Clock. In Merkle Clock, each operation gets its own entry in the Merkle Tree.

Vessel operates on extents, chunks of data with a size typically measured in kilobytes. That is not an amount of memory one wants to waste when recording the increment of an integer by one – so compacting multiple CRDT operations into one extent is absolutely the way to go! And the logical ordering is still perfectly deterministic!

However, the coarser resolution means that we now have a new issue. Suppose the following happens:

  1. Node A publishes an extent Ex1 with events E1, E2.
  2. Node B receives it, and adds an extent Ex2 with events E3, E4.
  3. In the meantime, node A adds an extent Ex3 with the events E5, E6.
  4. Nodes A and B synchronize.

Vessel’s DAG provides a tie-breaker between ordering Ex2 and Ex3, so on both nodes the extent sequence will be Ex1, Ex2, Ex3, so the event order will be E1, E2, E3, E4, E5, E6. That’s what we want, right?

Well…

Let’s say – and yes, here I look to wall clocks again, to illustrate the problem – let’s say that E2 is recorded at 08:00 am. E3 is recorded at 09:00 am, and E4 is recorded at 10:00 am. So far, so good.

But Ex3 gets authored in parallel to Ex2. Nothing is preventing node B from recording E5 at 09:30 am, and E6 at 09:45 am. In fact, that is even likely if communications are intermittent!

So now we have a logical ordering that is absolute deterministic, but fails to take into account when events actually occur.

Sadly, Merkle CRDTs do a better job at this, because Merkle Clocks record a logical time point for every single event! They are more precise, but they use more overhead for this.


Now this blog post has become quite long already, so I’ll do the mean thing and end on this cliffhanger. But do not fret: not only is there a solution, the solution is not overly complicated, and does not invalidate the use of Vessel’s DAG. It’s just involved enough that it’s easier to digest in its own blog post.

Stay tuned!


  1. Offline First
    https://offlinefirst.org/
    See also: References ↩︎

  2. Delay and Disruption Tolerant Networking (DTN)
    https://ietf.org/wg/dtn/about/
    See also: References ↩︎


Published on November 20, 2024