Chunks#

The RTA API streams data to the client in Chunks.

Each chunk represents one¹ Channel, and covers a contiguous span of records.

A chunk is the minimum unit of data returned by a request.
Given the Constraints listed below, clients must expect that a response will over-fetch beyond requested time-bounds.

Column-Oriented Data#

This is an example of column-oriented data.

Some databases and file formats work this way natively, such as Apache Parquet.
Many stores are row-oriented — including most relational databases.

Example

Imagine a database table with a time column and three data columns:

Time	alpha	beta	gamma
12:10:05	`A01`	`B01`	`C01`
12:10:15	`A02`	`B02`	`C02`
12:10:25	`A03`	`B03`	`C03`
12:10:35	`A04`	`B04`	`C04`
12:10:45	`A05`	`B05`	`C05`
12:10:55	`A06`	`B06`	`C06`
12:11:05	`A07`	`B07`	`C07`
12:11:15	`A08`	`B08`	`C08`
12:11:25	`A09`	`B09`	`C09`
12:11:35	`A10`	`B10`	`C10`
12:11:45	`A11`	`B11`	`C11`
12:11:55	`A12`	`B12`	`C12`
12:12:05	`A13`	`B13`	`C13`
12:12:15	`A14`	`B14`	`C14`

If we started collating records together into Periodic Data, a single result might look like this:

Channel	Start Time	Samples	Result Buffer
alpha	12:10:05	3	`[A01 A02 A03]`

A chunk is based on list types (PeriodicData ➔ PeriodicDataList), so each would contain multiple results.

If each chunk covered one minute, requesting the alpha channel could look like this:

Channel	Start Time	End Time	Results
alpha	12:10:05	12:10:55	`[A01 A02 A03]` `[A04 A05 A06]`
alpha	12:11:05	12:11:55	`[A07 A08 A09]` `[A10 A11 A12]`
alpha	12:12:05	12:12:15	`[A13 A14]`

To turn most row-oriented data into chunks:

Collate samples into PeriodicData or TimestampedData results;
Collate results into Lists (e.g. PeriodicData ➔ PeriodicDataList) and encode to chunks as shown below.

Using the API#

The MAT.OCS.RTA.Model NuGet Package has a Chunk class that holds the data in serialized and compressed form.

For example:

var data = new PeriodicDataList
{
    PeriodicData =
    {
        // ... add results here
    }
};

var chunkData = ChunkData.EncodePooled(ChunkDataMemoryPool.Shared, data, new[] {channelId});
var chunk = new Chunk(dataStartTime, dataEndTime, chunkData);

This applies fast LZ4 compression by default and uses a Memory Pool for low-overhead buffer reuse.

Compression

Protobuf is already a compact binary serialization.

Some parts of the data schema — such as TimestampedData — use delta-compression so that protobuf variable length encoding produces a more compact output with more repeating byte sequences. These repeating sequences are then further compressed with LZ4, which is a very fast algorithm that typically achieves 2:1 compression on this data with no noticeable loss of throughput.

Make sure you respect the Constraints listed below.

The MAT.OCS.RTA.Services.AspNetCore NuGet Package provides formatters to send a ChunkedResult (wrapping a Chunk stream) back to the client as the application/vnd.mat.protobuf+chunked wire-format.

Constraints#

This section describes some important constraints, using RFC 2119 language (MUST, SHOULD, etc).

These are important if you are implementing a Data Service.

Size Limits#

To ensure performance and reliability in all components:

Each result (e.g. PeriodicData) SHOULD contain up to 128 samples and MUST NOT exceed 10,000 samples.
Each list SHOULD serialize to no more than 64 KiB before compression and MUST NOT exceed 4 MiB.
Each list SHOULD cover 1-100 seconds and MUST NOT exceed 1000 seconds.

Time Ordering#

Data MUST be returned in time order:

Samples MUST be ordered within results
Results MUST be ordered within chunks
Chunks MUST be ordered within channels

The ordering of channels with respect to each other does not matter.
For example, if returning data for alpha and beta in the example above, all of these orderings are valid:

Example 1

Channel	Start Time	End Time	Results
alpha	12:10:05	12:10:55	`[A01 A02 A03]` `[A04 A05 A06]`
alpha	12:11:05	12:11:55	`[A07 A08 A09]` `[A10 A11 A12]`
alpha	12:12:05	12:12:15	`[A13 A14]`
beta	12:10:05	12:10:55	`[B01 B02 B03]` `[B04 B05 B06]`
beta	12:11:05	12:11:55	`[B07 B08 B09]` `[B10 B11 B12]`
beta	12:12:05	12:12:15	`[B13 B14]`

Example 2

Channel	Start Time	End Time	Results
alpha	12:10:05	12:10:55	`[A01 A02 A03]` `[A04 A05 A06]`
beta	12:10:05	12:10:55	`[B01 B02 B03]` `[B04 B05 B06]`
alpha	12:11:05	12:11:55	`[A07 A08 A09]` `[A10 A11 A12]`
beta	12:11:05	12:11:55	`[B07 B08 B09]` `[B10 B11 B12]`
alpha	12:12:05	12:12:15	`[A13 A14]`
beta	12:12:05	12:12:15	`[B13 B14]`

Example 3

Channel	Start Time	End Time	Results
beta	12:10:05	12:10:55	`[B01 B02 B03]` `[B04 B05 B06]`
beta	12:11:05	12:11:55	`[B07 B08 B09]` `[B10 B11 B12]`
alpha	12:10:05	12:10:55	`[A01 A02 A03]` `[A04 A05 A06]`
beta	12:12:05	12:12:15	`[B13 B14]`
alpha	12:11:05	12:11:55	`[A07 A08 A09]` `[A10 A11 A12]`
alpha	12:12:05	12:12:15	`[A13 A14]`

This constraint enables a client to resume an interrupted download, since it can track progress through a request.

No Overlaps#

When a session has reached a closed state:

Each result (e.g. PeriodicData) MUST NOT overlap any other result.
Each chunk MUST NOT overlap any other chunk.

These constraints do not apply while a session is still open, since it may be difficult to maintain consistency between the REST API and Live Streaming Data, and new data may force re-chunking to maintain the Size Limits constraint.

Stable Chunks#

Chunks MUST have stable boundaries once the session has closed.

This means that if a sample is included in request, it always appears in the same chunk with the same start and end time.

Using the example above:

Channel	Start Time	End Time	Results
alpha	12:10:05	12:10:55	`[A01 A02 A03]` `[A04 A05 A06]`
alpha	12:11:05	12:11:55	`[A07 A08 A09]` `[A10 A11 A12]`
alpha	12:12:05	12:12:15	`[A13 A14]`

If the request bounds are start: 12:10:10.00, end: 12:11:05.00, the response is:

Channel	Start Time	End Time	Results
alpha	12:10:05	12:10:55	`[A01 A02 A03]` `[A04 A05 A06]`
alpha	12:11:05	12:11:55	`[A07 A08 A09]` `[A10 A11 A12]`

This response includes samples outside the request bounds:

A01 (12:10:05) is included because the unbounded response chunk boundary is 12:10:05
A08-A12 (12:11:15 - 12:11:55) are included because A07 is inside the (inclusive) request bounds so the whole chunk is returned

There are two main approaches to implement this:

time-based chunking:

Round the request time-bounds (if any) outwards — generally just by dropping precision.
The MAT.OCS.RTA.Services NuGet Package provides a ChunkTime utility to help with this.

This is easy to implement but requires some knowledge of the expected data rate so the chunks are neither too small nor exceed the Size Limits. Data rates could be communicated by convention, configuration or the Schema Mapping Service.

Important

The chunk start and end time must still reflect the data contained inside the chunk, rather than the expanded time bounds.

This is important so that clients can accurately assess what data they have, particularly at the leading edge of a live streaming session.

Consecutive chunks should appear to have gaps between them if placed on a timeline.

storage-based chunking:

Split the request based on storage pages: for example, pages within a memory-mapped file.

This can be more flexible when data rates vary signficantly, but typically needs the data to be stored by time so the Time Ordering constraint is not broken.

Info

This constraint is annoying to fulfil, and may be dropped in future if it can be shown that it does not impact caching.

For now, assume that this requirement is mandatory.

Except for Row Data, which represents a set of channels encoded into a packed buffer ↩