Working with .metadata files ============================ *Author: Adarsh Pyarelal* .. toctree:: :maxdepth: 1 :caption: Contents: In this writeup, we will describe the ``.metadata`` files produced by the TA3 testbed and how to work with them. A ``.metadata`` file contains all the messages that were published on the MQTT message bus (if you are not familiar with the concept of a message bus, see `here`_ for a brief introduction) during a particular trial. JSON key reference convention ----------------------------- In the rest of this writeup, we use the following convention to refer to values in JSON objects. Consider the following example JSON object: .. code:: json { "key1": "value1", "key2": { "key3" : "value2" } } In the above, the value at ``.key1`` is ``value1``. We extend this pattern for nested objects, so the value at ``.key2.key3`` is ``value2``. ``.metadata`` file contents and format -------------------------------------- The first line of a ``.metadata`` file contains metadata about the trial itself, and is generated by the TA3-developed mechanism that exports the data from the Elasticsearch database in which the messages are stored. This metadata is also used by the TA3 application that can be used to import .metadata files into an Elastic database. For now, ToMCAT team members can ignore this line. The subsequent lines of the file are the messages published on the bus during a trial. Each line of the file corresponds to a single message. The messages themselves are in JSON format, with the following objects: ``.header``, ``.msg``, and ``.data``. We describe them and their subcomponents below. - ``.header`` : A common header for all messages on the testbed. - ``.header.timestamp``: Timestamp of when the message was published. - ``.header.message_type``: The type of message - ``.header.version``: A version number that tells downstream applications how to parse the ``.msg`` portion of the message. - ``.msg`` : This component contains metadata about the experiment, trial, and the message itself. - ``.msg.experiment_id``: The experiment ID. - ``.msg.trial_id``: The trial ID. - ``.msg.timestamp``: Timestamp of when the message was generated (this is usually the same or almost the same as the ``.header.timestamp``) - ``.msg.source``: The ID of the testbed component that published the message. - ``.msg.sub_type``: The sub-type of the message (for components that publish more than one type of message). - ``.msg.version``: A version number that tells downstream applications how to parse the ``.data`` portion of the message. - ``.data``: This component contains the actual data published by the testbed component. The contents of this object depend on the functionality of the testbed component in question. In addition to these objects, the export process for the ``.metadata`` files adds two additional keys that were not part of the original message: - ``.topic``: The topic that the message was originally published on. - ``.@timestamp``: The timestamp marking when the message was ingested by the Elastic stack. Quick exploration with ``jq`` ----------------------------- For quick exploration of ``.metadata`` files, you can use jq_, which is kind of like a JSON-aware ``grep``. If you are working on the ``kraken`` compute VM, ``jq`` is already installed on it. If not, you can install it using a package manager (e.g., ``apt``, MacPorts, etc.). The output of ``jq`` can be piped into further invocations of ``jq`` or to other command-line tools in order to compose pipelines. Here are some potentially useful ``jq`` recipes for working with ``.metadata`` files. **Pretty-printing** Pretty-print all the messages in a ``.metadata`` file on a particular topic: .. code:: jq 'select(.topic=="topic_name")' < input.metadata **Selecting fields**: Print all the ``.header.timestamp`` values for messages on a particular topic: .. code:: jq 'select(.topic=="topic_name")' < input.metadata | jq '.header.timestamp' **Filtering**: Create a new ``.metadata`` file containing only the messages on a certain topic (the ``-c`` flag below disables pretty-printing): .. code:: jq -c 'select(.topic=="topic_name")' < input.metadata > output.metadata **Unix tool composition**: Count the number of messages on a given topic: .. code:: jq -c 'select(.topic=="topic_name")' < input.metadata | wc -l Offline analyses with .metadata files ------------------------------------- While you can use ``jq`` for quick exploration at the command line, for more detailed offline analysis you should write scripts in Python or programs in C++. We use the term *offline* to refer to an analysis that does not require a message broker to be running - that is, your programs operate directly on the files themselves rather than consuming the messages from a message bus. Online analyses with .metadata files ------------------------------------ In contrast to offline analysis, you can also run *online* analyses with ``.metadata`` files using a *replayer* program that reads in messages from a ``.metadata`` files and re-publishes them to the message bus. The `elkless_replayer`_ is an example of such a program. Replays are essential to be able to develop and test testbed components that work with the message bus. Here are some of my tips for working with the ``elkless_replayer``. - Clone the ``ml4ai/tomcat`` repo in order to be able to pull updates to the replayer whenever necessary. - Add the path to the directory containing the script to your ``PATH`` environment variable so that you can invoke it from any directory. - Create a virtual environment in which you can install necessary Python packages (I have one named ``tomcat`` that I use for all my ``tomcat`` Python programming tasks). For the ``elkless_replayer`` you can install the prerequisites by running the following in your virtual environment: .. code:: pip install paho-mqtt tqdm python-dateutil - To see the help message and the command line options, run: .. code:: elkless_replayer -h - By default, the ``elkless_replayer`` replays messages in the order they are in the ``.metadata`` file. This should work for the majority of ``.metadata`` files. However, for some files that have been run through the TA3 replay process, the timestamps have been adjusted, and the correction process might make the messages in the ``.metadata`` file not sorted correctly by their ``@timestamp`` key. In this scenario (and perhaps in others if you wish), you can have the replayer sort the messages by their ``.header.timestamp`` value prior to publishing them. - The default behavior of the replayer is to publish messages as fast as possible. This is good if your testbed component can handle it, or if you want to do some quick testing without worrying too much about the timing between messages. However, for a replay that is more faithful to the original experimental trial, you can use The ``-r`` (for 'real-time') flag, which tells the replayer to insert delays between publishing messages that approximate the delays between messages in the original trial. .. _here: message_bus .. _jq: https://stedolan.github.io/jq/ .. _elkless_replayer: https://github.com/ml4ai/tomcat/blob/master/tools/elkless_replayer