Grounded Function Network (GrFN) JSON Specification v0.1

Introduction

GrFN, pronounced “Griffin”, is the specification format for the central representation that integrates the extracted Function Network representation of source code (the result of program analysis) and associated extracted comments, links to natural language text (the result of natural language processing), and links to equations (the result of equation extraction).

This document describes the current state of the GrFN-JSON schema, specifying the JSON format that is to be generated by program analysis and consumed by Delphi.

Here we adopt a simplified Backus-Naur Form (BNF)-inspired grammar convention combined with a convention for intuitively defining specific JSON attribute-value lists. The schema definitions and instance GrFN examples are shown in monospaced font, and interspersed with comments/discussion.

Following BNF convention, elements in <...> denote non-terminals, with ::= indicating a definition of how a non-terminal is expanded. Many of the definitions below will specify JSON attribute-value lists; when this is the case, we will decorate the element definition by adding [attrval], as follows::

<element_name>[attrval] ::=

We will then specify the structure of the JSON attribute-value list attributes (quoted strings) and their value types using a mixture of JSON and BNF. In a few places, we note in the comments anticipated extensions that may be needed using the tag 'FUTURE'.

From Source Code to Dynamic System Representation

The challenge of this project is to define a map from the semantics of program (computation) specification (as asserted in source code) to the semantics of a (discretized) dynamic system model. We must take care to define (ongoing!) technical terms and highlight which concept domain (general computation versus dynamics system model) we are dealing with.

We assume here that the source code is intended to describe the states of a dynamical system and how they evolve over time. The system is decomposed into a set of individual states (represented as random variables), where the values of the states at any given time are functions of function of the values of zero or more other states at a previous time point. As we are considering the evolution of the system over time, in general, every variable has an index. The functional relationships may be instantaneous (based on the variables indexed at the same point in time) or across time (a function of states of variables at different time indices).

Naming Conventions

Both variables and functions will be uniquely named strings that do not rely on implicit position within the source code to be identified (one reason for requiring unique names is that we are moving from the semantics of program variables as pointers to storage to a representation of variables as denoting the evolving state of a system). An important observation: the same variable name in source code could itself be used in two separate variable declarations, and so would constitute two different variable identities; source code name alone is not sufficient to identify variable identity. For this reason, we must adopt a set of conventions for capturing any such source code context. These conventions will be assumed to be associated with the following <variable_name> and <function_name> definitions. In both cases, the names should also correspond to legal Python variable and function names that could appear in Python code -- so they must begin with a letter or underscore followed by letters, numbers or underscores.

Variable naming convention

NOTE: For now (as of 2018-07-09), we will NOT be using the <enclosing_context> aspect of Variable naming convention, however, the Function naming convention will be used. This means that my examples of variable names below should also be read ignoring the <enclosing_context>:

<variable_name> ::= <string>

A top level source variable named ABSORPTION would then simply have the <source_variable_name> as the string making up the <variable_name>::

"ABSORPTION"

If there are two separate instances of new variable declarations in the same context using the same name, then we’ll add an underscore and number to distinguish them. For example, if ABSORPTION is defined twice, then the first (in order in the source code) is named:

"ABSORPTION_1"

And the second:

"ABSORPTION_2"

Variable Reference

<variable_reference>[attrval] ::= 
    "variable" : <variable_name>
    "index" : <integer>

In addition to capturing source code variable environment context in variable declarations, we also need a mechanism to disambiguate specific instances of the same variable within the same context to accurately capture the logical order of variable value updates. In this case, we consider this as a repeated reference to the same variable. The semantics of repeated reference is captured by the variable "index" attribute of a <variable_reference>. The index integer serves disambiguate the execution order of variable state references, as determined during program analysis.

Function naming conventions

Function names, like variable names, are ultimately strings, and will also follow a conventional structure used to capture context information. Also like variable names, they should be legal Python function names that could show up in working Python code (as will be the case when used in Lambda function references; see below). The general string format is::

<enclosing_context>__<function_type>[___<var_affected>]

Similar to variable naming, we need to use the function name string to uniquely identify the function, and as some functions extracted by program analysis will be expressions defined within other functions (loops and conditions), we need to capture the "context" in which the function is defined. The <enclosing_context> represents the source code environment context within which the function is defined. When a function is declared at the "top level" (as will often be the case for container functions), the <enclosing_context> is just the "top level" and so is empty. Assigns, conditions and loop_plates, however, will often be declared within another source code function. In those cases, the <enclosing_context> capture the enclosing function name. For example, for an assignment within the UPDATE_EST function, the <enclosing_context> is "UPDATE_EST".

Next, the <function_type> is the string representing which of the four types the function belongs to (the types are described in more detail, below): "assign", "condition", "container", "loop_plate". In the case of a loop_plate, we will name the specific loop using the generic name "loop", and include a number if there is more than one loop.

Finally, <var_affected> will only be relevant for assign and condition function types, and the name of the variable affected will be added after the <function_type> and 3 underscores. For example, a condition variable occurring within the function UPDATE_EST function and setting the (inferred) boolean variable IF_1 would have the name: "UPDATE_EST__condition___IF_1".

Here are example function names for each function types:

Assign: An assignment of the variable UPDATE_EST__YIELD_EST in the context of function UPDATE_EST::
```
UPDATE_EST__assign___UPDATE_EST__YIELD_EST
```
Condition: A condition within the function UPDATE_EST assigning the (inferred) boolean variaaible IF_1::
```
UPDATE_EST__condition___IF_1
```
Container: A container function called CROP_YIELD:
```
CROP_YIELD__container
```
Loop_plate:
- A single loop within the function CROP_YIELD::
```
CROP_YIELD__loop
```
- The third of three loops within the function CROP_YIELD::
```
CROP_YIELD__loop_3
```
- A loop nested in the context of another loop in CROP_YIELD::
```
CROP_YIELD__loop_1__loop_2
```
- An assignment within a single loop in CROP_YIELD::
```
CROP_YIELD__loop__assign___CROP_YIELD__RAIN
```

NOTE: There is some redundancy in the above examples between the <enclosing_context> of the name of the function and the <enclosing_context> of the name of the variable, however we think that both are ultimately needed.

Top-level GrFN specification

The top-level structure of the GrFN is the <grfn_spec> and is itself a JSON attribute-value list, with the following schema definition::

<grfn_spec>[attrval] ::=
    "start": <string>
    "name" : <string>
    "dateCreated" : <string>
    "functions" : list of <function_spec>

The "start" attribute holds the name of the entry point of the (Fortran) source code i.e. the PROGRAM module. In the absence of this module, this string will remain empty. The "name" attribute is used to denote the (Fortran) source code that has been analyzed. The "dateCreated" attribute is a string representing the date+time that the current GrFN was generated (to represent versioning).

FUTURE:

We may need to extend "name" value to accommodate multiple source files.
It may also be desirable to add an attribute to represent the program analysis code version used to generate the GrFN (as presumably the program analysis code could evolve and have different properties) -- although "dateCreated" may be sufficient.

A (partial) example instance of a JSON attribute-value list generated following the <grfn_spec>:

{
    "start": "MAIN"
    "name": "crop_yield.py",
    "dateCreated": "20180623",
    "functions": [... function_specs go here...]
}

Variable specification

<variable_spec>[attrval] ::=
    "name" : <variable_name>
    "domain" : <variable_domain_type>

The purpose of the list of <variable_spec>'s in the <grfn_spec> "variables" attribute value is to list all of the variables defined within the code we are analyzing, and associate each with their domain type. This list should include all variables whose values get updated by computation, and will be derived from variables that are explicitly asserted in source code, such as those used for explicit value assignment or used as loop indices, and other variables that program analysis may introduce (infer) as part of analyzing conditionals.

Variable value domain

<variable_domain_type> ::= <string>

The "domain" attribute of a <variable_spec> specifies what values the variable can be assigned to. To start, we will keep things simple and restrict ourselves to four types that can be specified as strings:

"real" (i.e. a floating-point number)
"integer"
"boolean"
"string"

(The idea of the variable domain is intended to be close to the idea of the "support" of a random variable, although should also correspond to standard data types.)

TODO: Need to extend to accommodate arrays.

FUTURE:

May also need to accommodate other structures (How far can this go? Unions, composite data structures, classes?).
We see augmenting the domain specification to also allow representing whether there are bounds on the values (e.g., positive integers, or real values in (0,10], etc.). When we move to doing this, the value of "domain" will itself become a new JSON attrval type.

Python is a strongly-typed language, but is also a dynamically typed language. However, that's not to say that there is no type specification in Python. Python 3 now provides nascent support for explicit typing via type hints.

TODO: Explore whether/how this gets represented in the AST.

For our purposes in the near term, we do want to capture what type and value-domain information is available; there are two main sources of this information:

Fortran: Does statically specify types. If we also want to capture this in program-analysis-generated code, then there is question of how to communicate this in the Python source representation; possibly through the new typing mentioned above; possibly as docstrings in program-analysis-generated code.
Docstrings: Possibly types and value ranges can be inferred from what is specified in a docstring.

<variable_spec> examples

Here are three examples of <variable_spec> objects:

Example of a "standard" variable MAX_RAIN within the CROP_YIELD function:

{
    "name": "CROP_YIELD__MAX_RAIN",
    "domain": "real"
}

Example of loop index variable DAY in the context of the second instance of a loop in the function CROP_YIELD
```
{
    "name": "CROP_YIELD__LOOP_2__DAY"
    "domain": "integer"
}
```
Example of variable introduced (inferred) when analyzing a conditional statement that is within the named function UPDATE_EST:
```
{
    "name": "IF_1"
    "domain": "boolean"
}
```

Note that we do not include the <enclosing_context> of the UPDATE_EST function in this case, as this is an inferred conditional boolean variable (per our naming convention, described above).

Function specification

Next we have the <function_spec>. There are four types of functions; two types can be expressed using the same attributes in their JSON attribute-value list (<function_assign_spec>), while the others (<function_container_spec>, <function_loop_plate>) require different attributes. So the means there are three specializations of the <function_spec>, one of which (<function_assign_spec>) will be used for two function types.:

<function_spec> ::=
    <function_assign_spec>     # either type "assign" or "condition:
| <function_container_spec> # type "container"
| <function_loop_plate>     # type "loop_plate"

All three specs will have a "type" attribute that will unambiguously identify which type of function is being specified. The four possible types are:

"assign"
"condition" (a special case of "assign")
"container"
"loop_plate"

All <function_spec>s will also have a name attribute with a unique string value (across <function_spec>s), as described above under the Function naming convention section; as described in that section, the function name will include the function type, but having the explicit type attribute make parsing easier.

Function Assign Specification

A <function_assign_spec> denotes the setting of the value of a variable. The values are assigned to the "target" variable (denoted by a <variable_reference> or <variable_name>) and the value is determined by the "body" of the assignment, which itself may either be a literal value (specified by <function_assign_body_literal_spec>) or a lambda function (specified by <function_assign_body_lambda_spec>).:

<function_assign_spec>[attrval] ::=
    "name" : <function_name>
    "type" : "assign" | "condition" # note that either is a literal/terminal value 
                                    # of the grammar
    "sources" : list of [ <function_source_reference> | <variable_name> ]
    "target" : <function_source_reference> | <variable_name>
    "body" : <function_assign_body_literal_spec> 
        | <function_assign_body_lambda_spec>

In the general case of variable assignment/setting, the attribute type should be "assign". In the special case where we are representing the assignment of a boolean value as the result of a condition (if-statement), then program analysis will infer a new boolean target variable, and the computation of the condition itself will be represented by the assignment function; in this case, we will use the more specific "condition" value for the "type" attribute of the <function_assign_spec>. Semantically, this is nothing more than an assignment of a boolean variable, but conceptually it will be useful to distinguish assignments used for conditions from other assignments.

For "sources" and "target": when there is no need to refer to the variable by its relative index, then <variable_name> is sufficient, and index will be assumed to be 0 (if at all relevant). In other cases, the variables will be referenced using the <function_source_reference>. There may also be cases where the sources can be a function, either built-in or user-defined. These two will be referenced using <function_source_reference> defined as:

<function_source_reference> ::=
   "name" : [ <variable_name> | <function_name> ]
   "type" : "variable" | "function"

Function assign body Literal

The <function_assign_body_literal_spec> asserts the assignment of a <literal_value> to the target variable. The <literal_value> has a data type (corresponding to one of our four domain types), and the value itself will be represented generically in a string (the string will be parsed to extract the actual value according to its data type).:

<function_assign_body_literal_spec>[attrval] ::=
    "type" : "literal"
    "value" : <literal_value>

<literal_value>[attrval] ::=
    "dtype" : "real" | "integer" | "boolean" | "string"
    "value" : <string>

Function assign body Lambda

When more computation is done to determine the value that is being assigned to the variable in the <function_assign_spec>, then <function_assign_body_lambda_spec> is used.:

<function_assign_body_lambda_spec>[attrval] ::=
    "type" : "lambda"
    "name" : <function_name>
    "reference" : <source_code_reference>

Eventually, we can expand this part of the grammar to accommodate a restricted set of arithmetic operations involved in computing the final value (this is now of interest in the World Modelers program and we're interested in supporting this in Delphi). But for now, we will start by having the lambda function reference the source code that does the computation, in the translated Python generated by program analysis. Any variables that are involved in the computation must be listed in the "source" list of variables (<variable_name> references) in the <function_assign_spec>.:

<source_code_reference> := <string>

To start, the <source_code_reference> string could just be the line number or a tuple denoting the range of line numbers over which the Python source code for the corresponding operations are defined.

Function Decision Specification

Handles representation of simple binary condition block::

If condition_variable:
    Condition1 variable_reference
Else

Function Container Specification

A <function_container_spec> is the generic, "top level" way to specify how a set of variables that are related by functions are "wired up" by those functions. (I previously referred to this as the "top", but here I'm renaming it a "container" as that's more descriptive of how it functions.):

<function_container_spec>[attrval] ::=
    "name" : <function_name>
    "type" : "container"
    "DOCS" : <STRING>
"input" : list of [ <variable_reference> | <variable_name> ]
    "variables" : list of <variable_spec>
    "output" : list of <variable_reference> | <variable_name>
    "body" : list of <function_reference_spec>

Case 1: subroutine

def foo1_subroutine(x,y):
    x = y

def foo2_subroutine():
    Integer z, y, w
    y = 5
    foo1(z,y)
    foo1(w,y)

now z = 5 and w = 5

Case 2: fortran function with simple return

def foo():
    x <-
    return x

def foo2():
    y = foo()

Case 3: fortran function with return expression

def foo():
    return x+1

becomes...

def foo():
  foo_return1 = x+1

return foo_return1

Case 4: conditional return statements

def foo(): #fortran function
    if(x):
        return x
    else:
        return y

There will be a container function for each source code function. For this reason, we need an "input" variable list (of 0 or more variables) as well as an "output" variable. In Python, a function only returns a value if there is an explicit return expression. Otherwise it returns None.

TODO: Can there be nested functions in Fortran?

Function Reference Specification

<function_reference_spec>[attrval] ::=
    "function" : <function_name>
    "input" : list of [ <variable_reference> | <variable_name> ]
    "output" : <variable_reference> | <variable_name>

The <function_reference_spec> defines the "wiring" between functions and their input and output variable(s).

Function Loop Plate Specification

<function_loop_plate>[attrval] ::=
    "name" : <function_name>
    "type" : "loop_plate"
    "input" : list of <variable_name>
    "index_variable" : <variable_name>
    "index_iteration_range" : <index_range>
    "body" : list of <function_reference_spec>

The "input" list of <variable_name> objects should list all variables that are set in the scope outside of the loop_plate.

The "index_variable" is the named variable that stores the iteration state of the loop; the naming convention of this variable is described above, in the Variable naming convention section. The only new element introduced is the <index_range>::

<index_range>[attrval] ::=
    "start" : <integer> | <variable_reference> | <variable_name>
    "end" : <integer> | <variable_reference> | <variable_name>

This definition permits loop iteration bounds to be specified either as literal integers, or as the values of variables.

TODO: we think Fortran is restricted to integer values for iteration variables, which would include iteration over indexes into arrays. Need to double check this.

AutoMATES