Grounded Function Network (GrFN) JSON Specification

Version 0.1.m3

Changes from previous version:

Introduction

GrFN, pronounced “Griffin”, is the specification format for the central representation that integrates the extracted Function Network representation of source code (the result of program analysis) and associated extracted comments, links to natural language text (the result of natural language processing), and links to equations (the result of equation extraction).

Specification Conventions

This document describes the GrFN JSON schema, specifying the JSON format that is to be generated by program analysis and consumed by Delphi.

In this document we adopt a simplified Backus-Naur Form (BNF)-inspired grammar convention combined with a convention for intuitively defining specific JSON attribute-value lists. The schema definitions and instance GrFN examples are shown in monospaced font, and interspersed with comments/discussion.

Following BNF convention, elements in <...> denote nonterminals, with ::= indicating a definition of how a nonterminal is expanded. We will use some common nonterminals with standard expected interpretations, such as <string> for strings, <integer> for integers, etc. Many of the definitions below will specify JSON attribute-value lists; when this is the case, we will decorate the nonterminal element definition by adding [attrval], as follows::

<element_name>[attrval] ::= 

We will then specify the structure of the JSON attribute-value list attributes (quoted strings) and their value types using a mixture of JSON and BNF.

We also use the following conventions in the discussion below:

From source code to dynamic system representation

The goal of GrFN is to provide the end-point target for a translation from the semantics of program (computation) specification (as asserted in source code) to the semantics of a (discretized) dynamic system model.

A key assumption is that the program source code we are analyzing is intended to model aspects of some target physical domain, and that this target physical domain is a dynamical system that evolves over time.

The system is decomposed into a set of individual states (represented as random variables), where the values of the states at any given time are a function of the values of zero or more other states at the current and/or previous time point(s). Because we are considering the evolution of the system over time, in general every variable has an index. The functional relationships may be instantaneous (based on the variables indexed at the same point in time) or a function of states of variables at different time indices.

Identifiers: grounding, scopes, namespaces and gensyms

An identifier is a symbol used to uniquely identify a program element in code, where a program element is a

More than one identifier can be used to denote the same program element, but an identifier can only be associated with one program element at a time.

Grounding

Identifiers play a key role in connecting the model as implemented in source code to the target domain that it models. Grounding is the task of inferring what aspect of the target domain a program element may correspond to. Identifiers, by their (base) name(s), their declaration and use (i.e., where they occur in code, through their scope and namespace), and the doc and comment strings that occur around them, provide clues to what program elements are intended to represent in the target domain. For this reason, we need to associate with identifiers several pieces of information. This information will be collected during program analysis and associated with the identifier declaration:

Base Name

The <base_name> is intended to correspond to the identifier token name as it appears in the source language (e.g., Fortran). The <base_name> is itself a string

<base_name> ::= <string>

but follows the conventions of python identifier specification rules (which includes Fortran naming syntax).

FUTURE: may extend this as more source languages are supported.

Scope and Namespace Paths

Identifiers may have the same <base_name> (as it appears in source code) but be distinguished by either (or both) the "scope" and "namespace" within which they are defined in the source code.

Each source language has its own rules for specifying scope and namespace, and it will be the responsibility of each program analysis module (e.g., Fortran for2py) to identify the hierarchical structure of the context that uniquely identifies the specific scope and/or namespace within which an identifier <base_name> is defined. However, generally scopes and namespaces may be defined hierarchically, such that the name for each level of the hierarchy taken together uniquely define the context. A "path" of names appears to be sufficient to generally represent the hierarchical context for either a specific scope or namespace. In general, names for a path are listed in order from general (highest level in the hierarchy) to specific.

Examples:

Path Strings

It will be convenient to be able to express <scope_path>s and <namespace_path>s using single strings within GrFN (particularly when building an identifier string). For this we introduce a special string notation in which the string names that make up a path are expressed in order but separated by periods. These representations will be referred to as the <scope_path_string> and <namespace_path_string>, respectively. The string representations of the <scope_path> and <namespace_path> examples above would be:

Identifier String

Identifiers are uniquely defined by their <base_name>, <scope_path>, and <namespace_path>. It will be convenient to refer unambiguously to any identifier using a single string, outside of the identifier specification declaration (defined below). We define an <identifier_string> by combining the namespace, scope and base_name (in that order) within a single string by separating the <namespace_path_string>, <scope_path_string> and <base_name> by double-colons:

<identifier_string> ::= "<namespace_path_string>::<scope_path_string>::<base_name>"

<identifier_string>s will be used to denote identifiers as they are used in variable and function specifications (described below).

Identifier Gensym

One of the outputs of program analysis is a functionally equivalent version of the original source code and lambda functions (described below), both expressed in Python (as the intermediate target language). All identifiers in the output Python must match identifiers in GrFN. Since capturing the semantics (particularly the namespace and scope context) results in a representation that does not appear to be consistently expressible in legal Python symbol names, we will use <gensym>s that can be represented (generally more compactly) as legal Python names and associated uniquely with identifiers.

FUTURE: Create a hashing function that can translate uniquely back and forth between <gensym>s and identifier strings.

FOR NOW: Generate <gensym>s as Python names that start with a letter followed by a unique integer. The letter could be 'g' for a generic gensym, or 'v' to indicate a variable identifier and 'f' to indicate a function identifier.

Each identifier will be associated one-to-one with a unique <gensym>.

Identifier Specification

Each identifier within a GrFN specification will have a single <identifier_spec> declaration. An identifier will be declared in the GrFN spec JSON by the following attribute-value list:

<identifier_spec>[attrval] ::=
    "base_name" : <base_name>
    "scope" : <scope_path>
    "namespace" : <namespace_path>
    "aliases" : list of <string>
    "source_references" : list of <source_code_reference>
    "gensym" : <gensym>

Variable and Function Identifiers and References

Variable Naming Convention

A variable name will be an identifier:

<variable_name> ::= <identifier_string>

A top level source variable named ABSORPTION would then simply have the <base_name> of “ABSORPTION” plus the relevant <namespace_path_string> and <scope_path_string>.

If there are two (or more) separate instances of new variable declarations in the same context (same namespace and scope) using the same name, then we’ll add an underscore and number to the <base_name> to distinguish them. For example, if ABSORPTION is defined twice in the same namespace and scope, then the <base_name> of the first (in order in the source code) is named:

"ABSORPTION_1"

And the second:

"ABSORPTION_2"

Finally, in some cases (described below), program analysis will introduce variables (e.g., when analyzing conditionals). The naming conventions for the <base_name> of such introduced variables are described below.

Variable Reference

<variable_reference>[attrval] ::= 
    "variable" : <variable_name>
    "index" : <integer>

In addition to capturing source code variable environment context in variable declarations, we also need a mechanism to disambiguate specific instances of use of the same variable within the same context to accurately capture the logical order of variable value updates. In this case, we consider this as a repeated reference to the same variable. The semantics of repeated reference is captured by the variable "index" attribute of a <variable_reference>. The index integer serves to disambiguate the execution order of the variable state references, as determined during program analysis.

Function Naming Conventions

Function names, like variable names, are ultimately identifiers (and therefore include their <namespace_path> and <scope_path>), but there are additional rules for determining the <base_name> of the function. Because of this particular set of rules, the <base_name> of the function name will be referred to as a <function_base_name>.

The general string format for a <function_base_name> is:

<function_base_name> ::= <function_type>[$[<var_affected>|<code_given_name>]]

The <function_type> is the string representing which of the four types the function belongs to (the types are described in more detail, below): "assign", "condition", , "decision", "container", "loop_plate". In the case of a loop_plate, we will name the specific loop using the generic name "loop" along with an integer (starting with value 1) uniquely distinguishing loops within the same namespace and scope.

The optional <code_given_name> is used when the function identified by program analysis has also been given a name within source code. For example, in this python example:

def foo():
    ...

the function foo is a type of “container” and its <code_given_name> is "foo", making the <function_base_name> be

"container$foo"

When a <code_given_name> is available, it occurs first in the function base_name, followed by 2 underscores.

The optional <var_affected> will only be relevant for assign, condition and decision function types, and the name of the variable affected will be added after the <function_type> and $. For example, a condition variable and setting the (inferred) boolean variable IF_1 would have the <function_base_name>:

"condition$IF_1"

Here are example function names for each function type. In each example, we assume the function is defined in the scope of the function UPDATE_EST and the namespace CROP_YIELD.

Top-level GrFN Specification

The top-level structure of the GrFN specification is the <grfn_spec> and is itself a JSON attribute-value list, with the following schema definition:

<grfn_spec>[attrval] ::=
    "date_created" : <string>
    "source" : list of <source_code_file_path>
    "start": list of <string>
    "identifiers" : list of <identifier_spec>
    "functions" : list of <function_spec>

The "date_created" attribute is a string representing the date+time that the current GrFN was generated (this helps resolve what version of the program analysis code (e.g., for2py) was used).

There may be a single GrFN spec file for multiple source code files.

CHOICE: The issue is that there are some source files that define many identifiers and program elements that may be used in many system program unit components. If we have a single GrFN spec for each "Program", then we will be redundantly reproducing many identifiers and other program element declarations (variables, functions). The alternatives are: (1) A single GrFN spec for a given program unit and get the redundancies. (2) Have a single GrFN spec for each source progam file, and develop method for importing/using GrFN specs that are used by other GrFN specs. (3) Develop an alternative mapping of source code to GrFN representation that allows for single GrFN spects for reused components that could be imported/used by other GrFN files, but still grouping source files by program.

FOR NOW: Go with Option (1): The main target of a GrFN spec file is all of the source code files involved in defining a program.

FUTURE: Add ability for GrFN specs to "import" and/or "use" other GrFN specs of other modules.

FOR NOW: the "source" attribute is a list of one or more <source_code_file_path>s. The <source_code_file_path> identifying a source file is represented the same way as a <namespace_path> (as described above), except that the final name (the name of the file itself) will include the file extension.

It is also the case that there may be multiple "start" points (or none at all) for a given program. For this reason, the "start" attribute is a list of zero or more names of the entry point(s) of the (Fortran) source code (for example, the PROGRAM module). These will be function <identifier_string>s. In the absence of any entry point, this value will be an empty list: [].

The "identifiers" attributes contains a list of <identifer_spec>, as has been defined above in the section on Identifiers.

FUTURE: It may also be desirable to add an attribute to represent the program analysis code version used to generate the GrFN (as presumably the program analysis code could evolve and have different properties) -- although "dateCreated" may be sufficient.

NOTE: variables are not declared at the top-level <grfn_spec>, but will be defined in <function_spec>s, described below.

A (partial) example instance of the JSON generated for a <grfn_spec> of an analyzed file in the path 'crop_system/yield/crop_yield.py' is:

{
    "dateCreated": "20190127",
    "source": [["crop_system", "yield", "crop_yield.py"]],
    "start": ["MAIN"],
    "identifiers": [... identifier_specs go here...]
    "functions": [... function_specs go here...]
}

Variable Specification

<variable_spec>[attrval] ::=
    "name" : <variable_name>
    "domain" : <variable_domain_type>

Variables specifications will be associated with the functions, whose scope contain the variable declarations in the source code. The list of <variable_spec>s should include all variables whose values get updated by computation within the function, and will be derived from variables that are explicitly asserted in source code, such as those used for explicit value assignment or used as loop indices, and other variables that program analysis may introduce (infer) as part of analyzing conditionals. As defined above, the <variable_name> is itself an <identifier_string>.

Some languages (including Fortran and Python) provide mechanisms for making variable declarations private (such as Python’s name mangling, by prepending an underscore to a variable name).

FOR NOW: Our hypothesis is that simply prepending another underscore (following python name mangling) will make the "private" variable <base_name> unique from other variable names. Also, program analysis will "capture" the semantics of privacy by ensuring there are no outer-scope references to a private variable, and this will carry through in explicit references (or not, in this case) captured in the GrFN spec.

Variable Value Domain

<variable_domain_type> ::= <string>

The "domain" attribute of a <variable_spec> specifies what values the variable can be assigned to. To start, we will keep things simple and restrict ourselves to four types that can be specified as strings:

(The idea of the variable domain is intended to be close to the idea of the "support" of a random variable, although should also correspond to standard data types.)

FUTURE: Need to extend to accommodate arrays.

FUTURE:

  • May also need to accommodate other structures (How far can this go? Unions, composite data structures, classes?).
  • We see augmenting the domain specification to also allow representing whether there are bounds on the values (e.g., positive integers, or real values in (0,10], etc.). When we move to doing this, the value of "domain" will itself become a new JSON attrval type.

Python is a strongly-typed language, but is also a dynamically typed language. However, that's not to say that there is no type specification in Python. Python 3 now provides nascent support for explicit typing via type hints.

FUTURE: Explore whether/how type hints get represented in the AST. This will matter when we get to adding more explicit variable domain semantics in later model analysis.

For our purposes in the near term, we do want to capture what type and value-domain information is available; there are two main sources of this information:

  1. Fortran: Does statically specify types. If we also want to capture this in program-analysis-generated code, then there is question of how to communicate this in the Python source representation; possibly through the new typing mentioned above; possibly as docstrings in program-analysis-generated code.
  2. Docstrings: Possibly types and value ranges can be inferred from what is specified in a docstring.

<variable_spec> examples

Here are three examples of <variable_spec> objects:

Function Specification

Next we have the <function_spec>. There are four types of functions; two types can be expressed using the same attributes in their JSON attribute-value list (<function_assign_spec>), while the others (<function_container_spec>, <function_loop_plate>) require different attributes. So this means there are three specializations of the <function_spec>, one of which (<function_assign_spec>) will be used for two function types.

<function_spec> ::=
    <function_assign_spec>       # either type "assign" or "condition:
    | <function_container_spec>  # type "container"
    | <function_loop_plate>      # type "loop_plate"

All three specs will have a "type" attribute that will unambiguously identify which type of function is being specified. The five possible types are:

All <function_spec>s will also have a "name"” attribute with a unique <identifier_string> (across <function_spec>s), as described above under the Function naming convention section; as described in that section, the function name will include the function type, but having the explicit type attribute makes JSON parsing easier.

Function Assign Specification (assign, condition, decision)

A <function_assign_spec> denotes the setting of the value of a variable. The values are assigned to the "target" variable (denoted by a <variable_reference> or <variable_name>) and the value is determined by the "body" of the assignment, which itself may either be a literal value (specified by <function_assign_body_literal_spec>) or a lambda function (specified by <function_assign_body_lambda_spec>).

<function_assign_spec>[attrval] ::=
    "name" : <function_name>
    "type" : "assign" | "condition" | "decision"
        # note that the value of the "type" is a literal/terminal 
        # value of the grammar
    "sources" : list of [ <function_source_reference> | <variable_name> ]
    "target" : <function_source_reference> | <variable_name>
    "body" : <function_assign_body_literal_spec> 
             | <function_assign_body_lambda_spec>

There are three types of assign functions, distinguished by the value of the attribute "type".

The identifier conventions for assign, condition and decision functions is described above in the section on Function naming conventions.

For "sources" and "target": When there is no need to refer to the variable by its relative index, then <variable_name> (itself an <identifier_string>) is sufficient, and index will be assumed to be 0 (if at all relevant). In other cases, the variables will be referenced using the <function_source_reference>, to indicate the return value of the function. There may also be cases where the sources can be a function, either built-in or user-defined. These two will be referenced using <function_source_reference> defined as:

<function_source_reference> ::=
   "name" : [ <variable_name> | <function_name> ]
   "type" : "variable" | "function"

Function assign body Literal

The <function_assign_body_literal_spec> asserts the assignment of a <literal_value> to the target variable. The <literal_value> has a data type (corresponding to one of our four domain types), and the value itself will be represented generically in a string (the string will be parsed to extract the actual value according to its data type).

<function_assign_body_literal_spec>[attrval] ::=
    "type" : "literal"
    "value" : <literal_value>

<literal_value>[attrval] ::=
    "dtype" : "real" | "integer" | "boolean" | "string"
    "value" : <string>

Function assign body Lambda

When more computation is done to determine the value that is being assigned to the variable in the <function_assign_spec>, then <function_assign_body_lambda_spec> is used.

<function_assign_body_lambda_spec>[attrval] ::=
    "type" : "lambda"
    "name" : <function_name>
    "reference" : <source_code_reference>

FUTURE: Eventually, we can expand this part of the grammar to accommodate a restricted set of arithmetic operations involved in computing the final value (this is now of interest in the World Modelers program and we're interested in supporting this in Delphi).

FOR NOW: have the lambda function reference the source code that does the computation, in the translated Python generated by program analysis. Any variables that are involved in the computation must be listed in the "source" list of variables (<variable_name> references) in the <function_assign_spec>.

As noted above, due to the more semantically rich identifier specification and <identifier_string> representation, it is not straightforward to use the <identifier_string> as the python symbol in the translated Python generated by program analysis. Instead, function and variable identifiers will be represented in the generated Python using their gensym. For debugging and visualization purposes, the generated Python code may be displayed with <identifier_string> (or some version that is closer to legal Python naming, although in general it does not appear to be possible to create "safe" Python names directly from <identifier_string>s).

Function Container Specification

A <function_container_spec> represents the grouping of a set of variables and how they are updated by functions. Generally the "function container" corresponds to functions (or subroutines) defined in source code. A "function container" is also defined for the "top" level of a source code file.

<function_container_spec>[attrval] ::=
    "name" : <function_name>
    "type" : "container"
    "DOCS" : <string>
    "input" : list of [ <variable_reference> | <variable_name> ]
    "variables" : list of <variable_spec>
    "output" : list of <variable_reference> | <variable_name>
    "body" : list of <function_reference_spec>

There will be a container function for each source code function. For this reason, we need an "input" variable list (of 0 or more variables) as well as an "output" variable. In Python, a function only returns a value if there is an explicit return expression. Otherwise it returns None.

Case 1: subroutine

def foo1_subroutine(x,y):
    x = y

def foo2_subroutine():
    Integer z, y, w
    y = 5
    foo1(z,y)
    foo1(w,y)

now z = 5 and w = 5

Case 2: fortran function with simple return

def foo():
    x <-
    return x

def foo2():
    y = foo()

Case 3: fortran function with return expression

def foo():
    return x+1

becomes...

def foo():
  foo_return1 = x+1

return foo_return1

Case 4: conditional return statements

def foo(): #fortran function
    if(x):
        return x
    else:
        return y

Function Reference Specification

<function_reference_spec>[attrval] ::=
    "function" : <function_name>
    "input" : list of [ <variable_reference> | <variable_name> ]
    "output" : <variable_reference> | <variable_name>

The <function_reference_spec> defines the "wiring" between functions and their input and output variable(s).

Function Loop Plate Specification

<function_loop_plate>[attrval] ::=
    "name" : <function_name>
    "type" : "loop_plate"
    "input" : list of <variable_name>
    "index_variable" : <variable_name>
    "index_iteration_range" : <index_range>
    "condition" : <loop_condition>
    "body" : list of <function_reference_spec>

The "input" list of <variable_name> objects should list all variables that are set in the scope outside of the loop_plate.

The current loop_plate specification is aimed at handling for-loops. (assumes "index_variable" and "index_iteration_range" are specified)

FUTURE: Generalize to do-while loop by just relying on the "condition" <loop_condition> to determine when loop completes. We can then remove "index_variable" and "index_iteration_range". There will still need to be a mechanism for identifying index_variable(s).

The "index_variable" is the named variable that stores the iteration state of the loop; the naming convention of this variable is described above, in the Variable naming convention section. The only new element introduced is the <index_range>:

<index_range>[attrval] ::=
    "start" : <integer> | <variable_reference> | <variable_name>
    "end" : <integer> | <variable_reference> | <variable_name>

This definition permits loop iteration bounds to be specified either as literal integers, or as the values of variables.