Working with .metadata files
Author: Adarsh Pyarelal
In this writeup, we will describe the .metadata
files produced by the TA3
testbed and how to work with them.
A .metadata
file contains all the messages that were published on the MQTT
message bus (if you are not familiar with the concept of a message bus, see
here for a brief introduction) during a particular trial.
JSON key reference convention
In the rest of this writeup, we use the following convention to refer to values in JSON objects. Consider the following example JSON object:
{
"key1": "value1",
"key2": {
"key3" : "value2"
}
}
In the above, the value at .key1
is value1
. We extend this pattern for
nested objects, so the value at .key2.key3
is value2
.
.metadata
file contents and format
The first line of a .metadata
file contains metadata about the trial itself,
and is generated by the TA3-developed mechanism that exports the data from the
Elasticsearch database in which the messages are stored. This metadata is also
used by the TA3 application that can be used to import .metadata files into an
Elastic database. For now, ToMCAT team members can ignore this line.
The subsequent lines of the file are the messages published on the bus during a
trial. Each line of the file corresponds to a single message. The messages
themselves are in JSON format, with the following objects: .header
,
.msg
, and .data
. We describe them and their subcomponents below.
.header
: A common header for all messages on the testbed. -.header.timestamp
: Timestamp of when the message was published. -.header.message_type
: The type of message -.header.version
: A version number that tells downstream applications how to parse the.msg
portion of the message..msg
: This component contains metadata about the experiment, trial, and the message itself. -.msg.experiment_id
: The experiment ID. -.msg.trial_id
: The trial ID. -.msg.timestamp
: Timestamp of when the message was generated (this isusually the same or almost the same as the
.header.timestamp
).msg.source
: The ID of the testbed component that published the message..msg.sub_type
: The sub-type of the message (for components that publish more than one type of message)..msg.version
: A version number that tells downstream applications how to parse the.data
portion of the message.
.data
: This component contains the actual data published by the testbed component. The contents of this object depend on the functionality of the testbed component in question.
In addition to these objects, the export process for the
.metadata
files adds two additional keys that were not part of the
original message:
.topic
: The topic that the message was originally published on..@timestamp
: The timestamp marking when the message was ingested by the Elastic stack.
Quick exploration with jq
For quick exploration of .metadata
files, you can use jq, which is kind of
like a JSON-aware grep
. If you are working on the kraken
compute VM,
jq
is already installed on it. If not, you can install it using a package
manager (e.g., apt
, MacPorts, etc.).
The output of jq
can be piped into further invocations of jq
or to
other command-line tools in order to compose pipelines. Here are some
potentially useful jq
recipes for working with .metadata
files.
Pretty-printing Pretty-print all the messages in a .metadata
file on a particular topic:
jq 'select(.topic=="topic_name")' < input.metadata
Selecting fields: Print all the .header.timestamp
values for messages on a particular topic:
jq 'select(.topic=="topic_name")' < input.metadata | jq '.header.timestamp'
Filtering: Create a new .metadata
file containing only the messages on a certain
topic (the -c
flag below disables pretty-printing):
jq -c 'select(.topic=="topic_name")' < input.metadata > output.metadata
Unix tool composition: Count the number of messages on a given topic:
jq -c 'select(.topic=="topic_name")' < input.metadata | wc -l
Offline analyses with .metadata files
While you can use jq
for quick exploration at the command line, for more
detailed offline analysis you should write scripts in Python or programs in
C++. We use the term offline to refer to an analysis that does not require a
message broker to be running - that is, your programs operate directly on the
files themselves rather than consuming the messages from a message bus.
Online analyses with .metadata files
In contrast to offline analysis, you can also run online analyses with
.metadata
files using a replayer program that reads in messages from a
.metadata
files and re-publishes them to the message bus. The
elkless_replayer is an example of such a program. Replays are essential to
be able to develop and test testbed components that work with the message bus.
Here are some of my tips for working with the elkless_replayer
.
Clone the
ml4ai/tomcat
repo in order to be able to pull updates to the replayer whenever necessary.Add the path to the directory containing the script to your
PATH
environment variable so that you can invoke it from any directory.Create a virtual environment in which you can install necessary Python packages (I have one named
tomcat
that I use for all mytomcat
Python programming tasks). For theelkless_replayer
you can install the prerequisites by running the following in your virtual environment:pip install paho-mqtt tqdm python-dateutil
To see the help message and the command line options, run:
elkless_replayer -h
By default, the
elkless_replayer
replays messages in the order they are in the.metadata
file. This should work for the majority of.metadata
files. However, for some files that have been run through the TA3 replay process, the timestamps have been adjusted, and the correction process might make the messages in the.metadata
file not sorted correctly by their@timestamp
key. In this scenario (and perhaps in others if you wish), you can have the replayer sort the messages by their.header.timestamp
value prior to publishing them.The default behavior of the replayer is to publish messages as fast as possible. This is good if your testbed component can handle it, or if you want to do some quick testing without worrying too much about the timing between messages. However, for a replay that is more faithful to the original experimental trial, you can use The
-r
(for ‘real-time’) flag, which tells the replayer to insert delays between publishing messages that approximate the delays between messages in the original trial.