Working with .metadata files

Author: Adarsh Pyarelal

In this writeup, we will describe the .metadata files produced by the TA3 testbed and how to work with them.

A .metadata file contains all the messages that were published on the MQTT message bus (if you are not familiar with the concept of a message bus, see here for a brief introduction) during a particular trial.

JSON key reference convention

In the rest of this writeup, we use the following convention to refer to values in JSON objects. Consider the following example JSON object:

 {
     "key1": "value1",
     "key2": {
         "key3" : "value2"
     }
}

In the above, the value at .key1 is value1. We extend this pattern for nested objects, so the value at .key2.key3 is value2.

`.metadata` file contents and format

The first line of a .metadata file contains metadata about the trial itself, and is generated by the TA3-developed mechanism that exports the data from the Elasticsearch database in which the messages are stored. This metadata is also used by the TA3 application that can be used to import .metadata files into an Elastic database. For now, ToMCAT team members can ignore this line.

The subsequent lines of the file are the messages published on the bus during a trial. Each line of the file corresponds to a single message. The messages themselves are in JSON format, with the following objects: .header, .msg, and .data. We describe them and their subcomponents below.

.header : A common header for all messages on the testbed. - .header.timestamp: Timestamp of when the message was published. - .header.message_type: The type of message - .header.version: A version number that tells downstream applications how to parse the .msg portion of the message.
.msg : This component contains metadata about the experiment, trial, and the message itself. - .msg.experiment_id: The experiment ID. - .msg.trial_id: The trial ID. - .msg.timestamp: Timestamp of when the message was generated (this is

usually the same or almost the same as the .header.timestamp)
- .msg.source: The ID of the testbed component that published the message.
- .msg.sub_type: The sub-type of the message (for components that publish more than one type of message).
- .msg.version: A version number that tells downstream applications how to parse the .data portion of the message.
.data: This component contains the actual data published by the testbed component. The contents of this object depend on the functionality of the testbed component in question.

In addition to these objects, the export process for the .metadata files adds two additional keys that were not part of the original message:

.topic: The topic that the message was originally published on.
.@timestamp: The timestamp marking when the message was ingested by the Elastic stack.

Quick exploration with `jq`

For quick exploration of .metadata files, you can use jq, which is kind of like a JSON-aware grep. If you are working on the kraken compute VM, jq is already installed on it. If not, you can install it using a package manager (e.g., apt, MacPorts, etc.).

The output of jq can be piped into further invocations of jq or to other command-line tools in order to compose pipelines. Here are some potentially useful jq recipes for working with .metadata files.

Pretty-printing Pretty-print all the messages in a .metadata file on a particular topic:

jq 'select(.topic=="topic_name")' < input.metadata

Selecting fields: Print all the .header.timestamp values for messages on a particular topic:

jq 'select(.topic=="topic_name")' < input.metadata | jq '.header.timestamp'

Filtering: Create a new .metadata file containing only the messages on a certain topic (the -c flag below disables pretty-printing):

jq -c 'select(.topic=="topic_name")' < input.metadata >  output.metadata

Unix tool composition: Count the number of messages on a given topic:

jq -c 'select(.topic=="topic_name")' < input.metadata | wc -l

Offline analyses with .metadata files

While you can use jq for quick exploration at the command line, for more detailed offline analysis you should write scripts in Python or programs in C++. We use the term offline to refer to an analysis that does not require a message broker to be running - that is, your programs operate directly on the files themselves rather than consuming the messages from a message bus.

Online analyses with .metadata files

In contrast to offline analysis, you can also run online analyses with .metadata files using a replayer program that reads in messages from a .metadata files and re-publishes them to the message bus. The elkless_replayer is an example of such a program. Replays are essential to be able to develop and test testbed components that work with the message bus.

Here are some of my tips for working with the elkless_replayer.

Clone the ml4ai/tomcat repo in order to be able to pull updates to the replayer whenever necessary.
Add the path to the directory containing the script to your PATH environment variable so that you can invoke it from any directory.
Create a virtual environment in which you can install necessary Python packages (I have one named tomcat that I use for all my tomcat Python programming tasks). For the elkless_replayer you can install the prerequisites by running the following in your virtual environment:
```
pip install paho-mqtt tqdm python-dateutil
```
To see the help message and the command line options, run:
```
elkless_replayer -h
```
By default, the elkless_replayer replays messages in the order they are in the .metadata file. This should work for the majority of .metadata files. However, for some files that have been run through the TA3 replay process, the timestamps have been adjusted, and the correction process might make the messages in the .metadata file not sorted correctly by their @timestamp key. In this scenario (and perhaps in others if you wish), you can have the replayer sort the messages by their .header.timestamp value prior to publishing them.
The default behavior of the replayer is to publish messages as fast as possible. This is good if your testbed component can handle it, or if you want to do some quick testing without worrying too much about the timing between messages. However, for a replay that is more faithful to the original experimental trial, you can use The -r (for ‘real-time’) flag, which tells the replayer to insert delays between publishing messages that approximate the delays between messages in the original trial.