Grounded Function Network (GrFN) JSON Specification

Version 0.1.m3

Changes from previous version:

Revision of Introduction
Addition of identifiers: <identifier_spec>, <identifier_string>, and <gensym> (for identifiers in generated code)
Updates to naming conventions for variables and functions
General cleanup of discussion throughout

Introduction

GrFN, pronounced “Griffin”, is the specification format for the central representation that integrates the extracted Function Network representation of source code (the result of program analysis) and associated extracted comments, links to natural language text (the result of natural language processing), and links to equations (the result of equation extraction).

Specification Conventions

This document describes the GrFN JSON schema, specifying the JSON format that is to be generated by program analysis and consumed by Delphi.

In this document we adopt a simplified Backus-Naur Form (BNF)-inspired grammar convention combined with a convention for intuitively defining specific JSON attribute-value lists. The schema definitions and instance GrFN examples are shown in monospaced font, and interspersed with comments/discussion.

Following BNF convention, elements in <...> denote nonterminals, with ::= indicating a definition of how a nonterminal is expanded. We will use some common nonterminals with standard expected interpretations, such as <string> for strings, <integer> for integers, etc. Many of the definitions below will specify JSON attribute-value lists; when this is the case, we will decorate the nonterminal element definition by adding [attrval], as follows::

<element_name>[attrval] ::=

We will then specify the structure of the JSON attribute-value list attributes (quoted strings) and their value types using a mixture of JSON and BNF.

We also use the following conventions in the discussion below:

'FUTURE': Tags anticipated extensions that may be needed but not yet supported.
'CHOICE': Captures discussion of a CHOICE that does not yet have a clear resolution
'FOR NOW': Tags approach being currently taken, eiher in response to FUTURE or CHOICE.

From source code to dynamic system representation

The goal of GrFN is to provide the end-point target for a translation from the semantics of program (computation) specification (as asserted in source code) to the semantics of a (discretized) dynamic system model.

A key assumption is that the program source code we are analyzing is intended to model aspects of some target physical domain, and that this target physical domain is a dynamical system that evolves over time.

The system is decomposed into a set of individual states (represented as random variables), where the values of the states at any given time are a function of the values of zero or more other states at the current and/or previous time point(s). Because we are considering the evolution of the system over time, in general every variable has an index. The functional relationships may be instantaneous (based on the variables indexed at the same point in time) or a function of states of variables at different time indices.

Identifiers: grounding, scopes, namespaces and gensyms

An identifier is a symbol used to uniquely identify a program element in code, where a program element is a

variable (or constant)
function
type (class)

More than one identifier can be used to denote the same program element, but an identifier can only be associated with one program element at a time.

Grounding

Identifiers play a key role in connecting the model as implemented in source code to the target domain that it models. Grounding is the task of inferring what aspect of the target domain a program element may correspond to. Identifiers, by their (base) name(s), their declaration and use (i.e., where they occur in code, through their scope and namespace), and the doc and comment strings that occur around them, provide clues to what program elements are intended to represent in the target domain. For this reason, we need to associate with identifiers several pieces of information. This information will be collected during program analysis and associated with the identifier declaration:

“aliases”: It is possible for multiple identifiers to be used to denote the same program element. How this is done differs across languages, according to scoping rules and assignment. Program analysis modules for each language (e.g., the Fortran for2py analyzer) will determine how aliases are used. One general way to assign more than one identifier to the same program element is through a simple equality assignment, e.g.: y = x means that a new identifier, y, denotes the same program element that x does. A simple equality assignment just involves one identifier being equated with another, no other operations are applied; if other operations are applied (e.g., y = x + 1), then this is a new identifier as it does not represent the original value of x but a modification of it.

CHOICE: Do we declare each identifier separately, or combine them at program analysis time to treat them as a single identifier with aliases?

FOR NOW: We will only keep track of a single identifier (the first one encountered by program analysis) but associate any additional "aliases" as the names of any additional identifier introduced in code.

FUTURE: Note that once we consider pointers (e.g., DSSAT has some), it can become impossible in general to determine all aliases statically.
“source_references”: To facilitate later grounding inference, we will store a reference to the location within the source code where an identifier is declared, using a <source_code_reference>:
```
  <source_code_reference> ::= <string>
```
The string contains information to identify the location of the identifier, which is a single line number if the declaration occurs on a single line, otherwise two line numbers to indicate the span of line numbers containing the declaration. (<soure_code_references> will be used to represent the location of other program elements, such as for functions, below.)

Because an identifier may have associated aliases, the “source_references” will be a list of one or more <source_code_reference>, with one <source_code_reference> representing the location in source code of the initial identifier declaration, and then an additional <source_code_reference> for each time a new alias is initially declared (e.g., through an assignment). The order of these <source_code_reference>s will correspond the order of the introduction/declaration of the aliases in the source code (from the perspective of program analysis). (The reason for this is that the initial introduction of the alias is more likely to have associated relevant comments that might provide information about the identifier’s grounding.)

Base Name

The <base_name> is intended to correspond to the identifier token name as it appears in the source language (e.g., Fortran). The <base_name> is itself a string

<base_name> ::= <string>

but follows the conventions of python identifier specification rules (which includes Fortran naming syntax).

FUTURE: may extend this as more source languages are supported.

Scope and Namespace Paths

Identifiers may have the same <base_name> (as it appears in source code) but be distinguished by either (or both) the "scope" and "namespace" within which they are defined in the source code.

Each source language has its own rules for specifying scope and namespace, and it will be the responsibility of each program analysis module (e.g., Fortran for2py) to identify the hierarchical structure of the context that uniquely identifies the specific scope and/or namespace within which an identifier <base_name> is defined. However, generally scopes and namespaces may be defined hierarchically, such that the name for each level of the hierarchy taken together uniquely define the context. A "path" of names appears to be sufficient to generally represent the hierarchical context for either a specific scope or namespace. In general, names for a path are listed in order from general (highest level in the hierarchy) to specific.

Examples:

<scope_path>: As will be described below, program analysis will assign unique names for scopes (see discussion below under conditional, container and loop_plate functions). Given these names, the scope of the inner loop within the function foo in this example,
```
  def foo():
      for i in range(10):      # assigned name 'loop$1'
          for j in range(10):  # assigned name 'loop$1' (in the scope of the outer loop$1)
              x = i*j
```
would be uniquely specified by the following path:
```
  ["foo", "loop$1", "loop$1"]
```
In general, it is not necessary within GrFN to independently declare scopes. Instead, we simply specify the <scope_path> in an indicator declaration as a list of strings under the “scope” attribute in the identifier declaration (below).
```
  <scope_path> ::= list of <string>
```
The "top" level of the file (i.e., not enclosed within another program block context) will be assigned the default scope name of "_TOP". All other scopes are either explicitly named (such as a named function), or are assigned a unique name by program analysis according to the rules of the type of scope (such as container, loop, conditional, etc), defined below. In such cases other than top, there is no need to include the "_TOP" in the path – it will be assumed that those named scopes are all within the default top-level scope.
<namespace_path>: Different languages have different conventions for defining namespaces, but in general they are either (1) explicitly defined within source code by namespace declarations (such as Fortran “modules” or C++ “namespace”s), or (2) implicitly defined by the project directory structure within which a file is located (as in Python). In the case of namespaces defined by project directory structure, two files in different locations in the project directory tree may have the same name. To distinguish these, program analysis will capture the path of the directory tree from the root to the file. The final name in the path, which is the name of the source file, will drop the file extension. For example, the namespace for file baz.py within the following directory tree
```
  foo/
      bar/
          baz.py
```
would be the uniquely specified by the following path:
```
  ["foo", "bar", "baz"]
```
In the case of declared namespaces, the namespace declaration will determine the path (which may only consist of one string name).

Again, it is not necessary within GrFN to independently declare a namespace; like the <scope_path>, we specify the <namespace_path> within an identifier declaration as a list strings under the “namespace” attribute in the identifier declaration:
```
  <namespace_path> ::= list of <string>
```
Like the <scope_path>, the string names of the path uniquely defining the namespace are in in order from general to specific, with the last string name either being the implicit namespace defined by the source code file, or the user-defined name of the namespace.

Path Strings

It will be convenient to be able to express <scope_path>s and <namespace_path>s using single strings within GrFN (particularly when building an identifier string). For this we introduce a special string notation in which the string names that make up a path are expressed in order but separated by periods. These representations will be referred to as the <scope_path_string> and <namespace_path_string>, respectively. The string representations of the <scope_path> and <namespace_path> examples above would be:

Example <scope_path_string>:
```
  "foo.loop$1.loop$1"
```
Example <namespace_path_string>:
```
  "foo.bar.baz"
```

Identifier String

Identifiers are uniquely defined by their <base_name>, <scope_path>, and <namespace_path>. It will be convenient to refer unambiguously to any identifier using a single string, outside of the identifier specification declaration (defined below). We define an <identifier_string> by combining the namespace, scope and base_name (in that order) within a single string by separating the <namespace_path_string>, <scope_path_string> and <base_name> by double-colons:

<identifier_string> ::= "<namespace_path_string>::<scope_path_string>::<base_name>"

<identifier_string>s will be used to denote identifiers as they are used in variable and function specifications (described below).

Identifier Gensym

One of the outputs of program analysis is a functionally equivalent version of the original source code and lambda functions (described below), both expressed in Python (as the intermediate target language). All identifiers in the output Python must match identifiers in GrFN. Since capturing the semantics (particularly the namespace and scope context) results in a representation that does not appear to be consistently expressible in legal Python symbol names, we will use <gensym>s that can be represented (generally more compactly) as legal Python names and associated uniquely with identifiers.

FUTURE: Create a hashing function that can translate uniquely back and forth between <gensym>s and identifier strings.

FOR NOW: Generate <gensym>s as Python names that start with a letter followed by a unique integer. The letter could be 'g' for a generic gensym, or 'v' to indicate a variable identifier and 'f' to indicate a function identifier.

Each identifier will be associated one-to-one with a unique <gensym>.

Identifier Specification

Each identifier within a GrFN specification will have a single <identifier_spec> declaration. An identifier will be declared in the GrFN spec JSON by the following attribute-value list:

<identifier_spec>[attrval] ::=
    "base_name" : <base_name>
    "scope" : <scope_path>
    "namespace" : <namespace_path>
    "aliases" : list of <string>
    "source_references" : list of <source_code_reference>
    "gensym" : <gensym>

Variable and Function Identifiers and References

Variable Naming Convention

A variable name will be an identifier:

<variable_name> ::= <identifier_string>

A top level source variable named ABSORPTION would then simply have the <base_name> of “ABSORPTION” plus the relevant <namespace_path_string> and <scope_path_string>.

If there are two (or more) separate instances of new variable declarations in the same context (same namespace and scope) using the same name, then we’ll add an underscore and number to the <base_name> to distinguish them. For example, if ABSORPTION is defined twice in the same namespace and scope, then the <base_name> of the first (in order in the source code) is named:

"ABSORPTION_1"

And the second:

"ABSORPTION_2"

Finally, in some cases (described below), program analysis will introduce variables (e.g., when analyzing conditionals). The naming conventions for the <base_name> of such introduced variables are described below.

Variable Reference

<variable_reference>[attrval] ::= 
    "variable" : <variable_name>
    "index" : <integer>

In addition to capturing source code variable environment context in variable declarations, we also need a mechanism to disambiguate specific instances of use of the same variable within the same context to accurately capture the logical order of variable value updates. In this case, we consider this as a repeated reference to the same variable. The semantics of repeated reference is captured by the variable "index" attribute of a <variable_reference>. The index integer serves to disambiguate the execution order of the variable state references, as determined during program analysis.

Function Naming Conventions

Function names, like variable names, are ultimately identifiers (and therefore include their <namespace_path> and <scope_path>), but there are additional rules for determining the <base_name> of the function. Because of this particular set of rules, the <base_name> of the function name will be referred to as a <function_base_name>.

The general string format for a <function_base_name> is:

<function_base_name> ::= <function_type>[$[<var_affected>|<code_given_name>]]

The <function_type> is the string representing which of the four types the function belongs to (the types are described in more detail, below): "assign", "condition", , "decision", "container", "loop_plate". In the case of a loop_plate, we will name the specific loop using the generic name "loop" along with an integer (starting with value 1) uniquely distinguishing loops within the same namespace and scope.

The optional <code_given_name> is used when the function identified by program analysis has also been given a name within source code. For example, in this python example:

def foo():
    ...

the function foo is a type of “container” and its <code_given_name> is "foo", making the <function_base_name> be

"container$foo"

When a <code_given_name> is available, it occurs first in the function base_name, followed by 2 underscores.

The optional <var_affected> will only be relevant for assign, condition and decision function types, and the name of the variable affected will be added after the <function_type> and $. For example, a condition variable and setting the (inferred) boolean variable IF_1 would have the <function_base_name>:

"condition$IF_1"

Here are example function names for each function type. In each example, we assume the function is defined in the scope of the function UPDATE_EST and the namespace CROP_YIELD.

Assign: An assignment of the variable with the <identifier_string> “CROP_YIELD::UPDATE_EST::YIELD_EST” (which denotes the identifier with <base_name> "YIELD_EST" in the scope of the function UPDATE_EST declared in the namespace CROP_YIELD) has the function_base_name>:
```
"assign$CROP_YIELD::UPDATE_EST::YIELD_EST"
```
If this assignment takes place in the function UPDATE_EST and the namespace CROP_YIELD, then the full <identifier_string> of the function identifier would be:
```
"CROP_YIELD::UPDATE_EST::assign$CROP_YIELD::UPDATE_EST::YIELD_EST"
```
Condition: A condition assigning the (inferred) boolean variable IF_1 in the scope of the function UPDATE_EST of the namespace CROP_YIELD would have the <identifier_string>:
```
"CROP_YIELD::UPDATE_EST::condition$IF_1"
```
Decision: A decision function assigns a variable a value based on the (outcome) state of a condition variable. If the variable "YIELD_EST" (from the namespace "CROP_YIELD" and scope "UPDATE_EST") is being updated as a result of a conditional outcome in the namespace "CROP_YIELD" and scope "DERIVE_YIELD", then the <identifier_string> would be:
```
"CROP_YIELD::DERIVE_YIELD::decision$CROP_YIELD::UPDATE_EST::YIELD_EST"
```
Container: A container function declared in source code to have the name CROP_YIELD would then have the <code_given_name> of CROP_YIELD, and if this was declared at the top level of a file (defining the namespace) called CROP_YIELD would have the <function_base_name>:
```
"container$CROP_YIELD"
```
and the full <identifier_string> of:
```
"CROP_YIELD::NULL::container$CROP_YIELD"
```
(Note that the first occurrence of "CROP_YIELD" in the string is for the namespace, the NULL is because it’s defined at the top level, and then the second occurrence of "CROP_YIELD" is the <code_given_name> of CROP_YIELD.)
Loop_plate: Loops themselves are not assigned identifiers within source code, so identifiers will be assigned during program analysis. As described above, the <function_base_name> of the loop_plate function type is “loop” followed by a '$' and an integer starting from 1 that distinguishes the loop from any other loops occurring in the same namespace and scope.
- A single loop within the function CROP_YIELD of the namespace CROP_YIELD has the <identifier_string>:
```
"CROP_YIELD::CROP_YIELD::loop$1"
```
- The third of three loops within the function CROP_YIELD of namespace CROP_YIELD:
```
"CROP_YIELD::CROP_YIELD::loop$3"
```
- A loop nested in the context of the second loop, "loop$2", in the CROP_YIELD function within the CROP_YIELD namespace:
```
"CROP_YIELD::CROP_YIELD.loop$2::loop$1"
```
- An assignment of the variable "CROP_YIELD::_TOP::RAIN" (i.e., the variable "RAIN" was defined in the default "top" level scope, within the namespace CROP_YIELD) within a single loop in the CROP_YIELD function in the CROP_YIELD namespace:
```
"CROP_YIELD::CROP_YIELD.loop$1::assign$CROP_YIELD::_TOP::RAIN"
```
  (Note that the above string is still unambiguous to parse to recover the components pieces of the name: the first two names separated by '::' are the <namespace_string> followed by the <scope_string>, with the rest being the <function_base_name> of the function, which itself is an "assign" of a variable that itself is a complete <identifier_string>)

Top-level GrFN Specification

The top-level structure of the GrFN specification is the <grfn_spec> and is itself a JSON attribute-value list, with the following schema definition:

<grfn_spec>[attrval] ::=
    "date_created" : <string>
    "source" : list of <source_code_file_path>
    "start": list of <string>
    "identifiers" : list of <identifier_spec>
    "functions" : list of <function_spec>

The "date_created" attribute is a string representing the date+time that the current GrFN was generated (this helps resolve what version of the program analysis code (e.g., for2py) was used).

There may be a single GrFN spec file for multiple source code files.

CHOICE: The issue is that there are some source files that define many identifiers and program elements that may be used in many system program unit components. If we have a single GrFN spec for each "Program", then we will be redundantly reproducing many identifiers and other program element declarations (variables, functions). The alternatives are: (1) A single GrFN spec for a given program unit and get the redundancies. (2) Have a single GrFN spec for each source progam file, and develop method for importing/using GrFN specs that are used by other GrFN specs. (3) Develop an alternative mapping of source code to GrFN representation that allows for single GrFN spects for reused components that could be imported/used by other GrFN files, but still grouping source files by program.

FOR NOW: Go with Option (1): The main target of a GrFN spec file is all of the source code files involved in defining a program.

FUTURE: Add ability for GrFN specs to "import" and/or "use" other GrFN specs of other modules.

FOR NOW: the "source" attribute is a list of one or more <source_code_file_path>s. The <source_code_file_path> identifying a source file is represented the same way as a <namespace_path> (as described above), except that the final name (the name of the file itself) will include the file extension.

It is also the case that there may be multiple "start" points (or none at all) for a given program. For this reason, the "start" attribute is a list of zero or more names of the entry point(s) of the (Fortran) source code (for example, the PROGRAM module). These will be function <identifier_string>s. In the absence of any entry point, this value will be an empty list: [].

The "identifiers" attributes contains a list of <identifer_spec>, as has been defined above in the section on Identifiers.

FUTURE: It may also be desirable to add an attribute to represent the program analysis code version used to generate the GrFN (as presumably the program analysis code could evolve and have different properties) -- although "dateCreated" may be sufficient.

NOTE: variables are not declared at the top-level <grfn_spec>, but will be defined in <function_spec>s, described below.

A (partial) example instance of the JSON generated for a <grfn_spec> of an analyzed file in the path 'crop_system/yield/crop_yield.py' is:

{
    "dateCreated": "20190127",
    "source": [["crop_system", "yield", "crop_yield.py"]],
    "start": ["MAIN"],
    "identifiers": [... identifier_specs go here...]
    "functions": [... function_specs go here...]
}

Variable Specification

<variable_spec>[attrval] ::=
    "name" : <variable_name>
    "domain" : <variable_domain_type>

Variables specifications will be associated with the functions, whose scope contain the variable declarations in the source code. The list of <variable_spec>s should include all variables whose values get updated by computation within the function, and will be derived from variables that are explicitly asserted in source code, such as those used for explicit value assignment or used as loop indices, and other variables that program analysis may introduce (infer) as part of analyzing conditionals. As defined above, the <variable_name> is itself an <identifier_string>.

Some languages (including Fortran and Python) provide mechanisms for making variable declarations private (such as Python’s name mangling, by prepending an underscore to a variable name).

FOR NOW: Our hypothesis is that simply prepending another underscore (following python name mangling) will make the "private" variable <base_name> unique from other variable names. Also, program analysis will "capture" the semantics of privacy by ensuring there are no outer-scope references to a private variable, and this will carry through in explicit references (or not, in this case) captured in the GrFN spec.

Variable Value Domain

<variable_domain_type> ::= <string>

The "domain" attribute of a <variable_spec> specifies what values the variable can be assigned to. To start, we will keep things simple and restrict ourselves to four types that can be specified as strings:

"real" (i.e. a floating-point number)
"integer"
"boolean"
"string"

(The idea of the variable domain is intended to be close to the idea of the "support" of a random variable, although should also correspond to standard data types.)

FUTURE: Need to extend to accommodate arrays.

FUTURE:

May also need to accommodate other structures (How far can this go? Unions, composite data structures, classes?).

We see augmenting the domain specification to also allow representing whether there are bounds on the values (e.g., positive integers, or real values in (0,10], etc.). When we move to doing this, the value of "domain" will itself become a new JSON attrval type.

Python is a strongly-typed language, but is also a dynamically typed language. However, that's not to say that there is no type specification in Python. Python 3 now provides nascent support for explicit typing via type hints.

FUTURE: Explore whether/how type hints get represented in the AST. This will matter when we get to adding more explicit variable domain semantics in later model analysis.

For our purposes in the near term, we do want to capture what type and value-domain information is available; there are two main sources of this information:

Fortran: Does statically specify types. If we also want to capture this in program-analysis-generated code, then there is question of how to communicate this in the Python source representation; possibly through the new typing mentioned above; possibly as docstrings in program-analysis-generated code.
Docstrings: Possibly types and value ranges can be inferred from what is specified in a docstring.

<variable_spec> examples

Here are three examples of <variable_spec> objects:

Example of a "standard" variable MAX_RAIN within the CROP_YIELD function of the CROP namespace:
```
{
    "name": "CROP::CROP_YIELD::MAX_RAIN",
    "domain": "real"
}
```
Example of loop index variable DAY in the context of the second instance of a loop in the function CROP_YIELD (in the CROP namespace):
```
{
    "name": "CROP::CROP_YIELD.loop$2::DAY"
    "domain": "integer"
}
```
Example of variable introduced (inferred) when analyzing a conditional statement that is within the named function UPDATE_EST of the CROP namespace:
```
{
    "name": "CROP::UPDATE_EST::IF_1"
    "domain": "boolean"
}
```

Function Specification

Next we have the <function_spec>. There are four types of functions; two types can be expressed using the same attributes in their JSON attribute-value list (<function_assign_spec>), while the others (<function_container_spec>, <function_loop_plate>) require different attributes. So this means there are three specializations of the <function_spec>, one of which (<function_assign_spec>) will be used for two function types.

<function_spec> ::=
    <function_assign_spec>       # either type "assign" or "condition:
    | <function_container_spec>  # type "container"
    | <function_loop_plate>      # type "loop_plate"

All three specs will have a "type" attribute that will unambiguously identify which type of function is being specified. The five possible types are:

"assign"
- "condition" (a special case of "assign")
- "decision" (special case of "assign")
"container"
"loop_plate"

All <function_spec>s will also have a "name"” attribute with a unique <identifier_string> (across <function_spec>s), as described above under the Function naming convention section; as described in that section, the function name will include the function type, but having the explicit type attribute makes JSON parsing easier.

Function Assign Specification (assign, condition, decision)

A <function_assign_spec> denotes the setting of the value of a variable. The values are assigned to the "target" variable (denoted by a <variable_reference> or <variable_name>) and the value is determined by the "body" of the assignment, which itself may either be a literal value (specified by <function_assign_body_literal_spec>) or a lambda function (specified by <function_assign_body_lambda_spec>).

<function_assign_spec>[attrval] ::=
    "name" : <function_name>
    "type" : "assign" | "condition" | "decision"
        # note that the value of the "type" is a literal/terminal 
        # value of the grammar
    "sources" : list of [ <function_source_reference> | <variable_name> ]
    "target" : <function_source_reference> | <variable_name>
    "body" : <function_assign_body_literal_spec> 
             | <function_assign_body_lambda_spec>

There are three types of assign functions, distinguished by the value of the attribute "type".

"assign": This represents the general case of assignment of a variable to some value.
"condition": In the special case where program analysis is analyzing a conditional (i.e., "if") statement, then program analysis will infer a new boolean target variable, and the computation of the condition itself will be represented by the assignment function. Semantically, this is nothing more than an assignment of a boolean variable, but conceptually it will be useful to distinguish assignments used for conditions from other assignments.
"decision": Also as part of analyzing a conditional, any variables whose values are updated as a result of the condition outcome must have their values updated. These will be updated by "decision" assignment functions, whose target is the variable being updated, and the computations will involve the state of the conditional variable, the previous state of the variable being updated, and possibly other variable values. Again, semantically this is nothing more than an assignment, but is useful to distinguish from other assignments.

The identifier conventions for assign, condition and decision functions is described above in the section on Function naming conventions.

For "sources" and "target": When there is no need to refer to the variable by its relative index, then <variable_name> (itself an <identifier_string>) is sufficient, and index will be assumed to be 0 (if at all relevant). In other cases, the variables will be referenced using the <function_source_reference>, to indicate the return value of the function. There may also be cases where the sources can be a function, either built-in or user-defined. These two will be referenced using <function_source_reference> defined as:

<function_source_reference> ::=
   "name" : [ <variable_name> | <function_name> ]
   "type" : "variable" | "function"

Function assign body Literal

The <function_assign_body_literal_spec> asserts the assignment of a <literal_value> to the target variable. The <literal_value> has a data type (corresponding to one of our four domain types), and the value itself will be represented generically in a string (the string will be parsed to extract the actual value according to its data type).

<function_assign_body_literal_spec>[attrval] ::=
    "type" : "literal"
    "value" : <literal_value>

<literal_value>[attrval] ::=
    "dtype" : "real" | "integer" | "boolean" | "string"
    "value" : <string>

Function assign body Lambda

When more computation is done to determine the value that is being assigned to the variable in the <function_assign_spec>, then <function_assign_body_lambda_spec> is used.

<function_assign_body_lambda_spec>[attrval] ::=
    "type" : "lambda"
    "name" : <function_name>
    "reference" : <source_code_reference>

FUTURE: Eventually, we can expand this part of the grammar to accommodate a restricted set of arithmetic operations involved in computing the final value (this is now of interest in the World Modelers program and we're interested in supporting this in Delphi).

FOR NOW: have the lambda function reference the source code that does the computation, in the translated Python generated by program analysis. Any variables that are involved in the computation must be listed in the "source" list of variables (<variable_name> references) in the <function_assign_spec>.

As noted above, due to the more semantically rich identifier specification and <identifier_string> representation, it is not straightforward to use the <identifier_string> as the python symbol in the translated Python generated by program analysis. Instead, function and variable identifiers will be represented in the generated Python using their gensym. For debugging and visualization purposes, the generated Python code may be displayed with <identifier_string> (or some version that is closer to legal Python naming, although in general it does not appear to be possible to create "safe" Python names directly from <identifier_string>s).

Function Container Specification

A <function_container_spec> represents the grouping of a set of variables and how they are updated by functions. Generally the "function container" corresponds to functions (or subroutines) defined in source code. A "function container" is also defined for the "top" level of a source code file.

<function_container_spec>[attrval] ::=
    "name" : <function_name>
    "type" : "container"
    "DOCS" : <string>
    "input" : list of [ <variable_reference> | <variable_name> ]
    "variables" : list of <variable_spec>
    "output" : list of <variable_reference> | <variable_name>
    "body" : list of <function_reference_spec>

There will be a container function for each source code function. For this reason, we need an "input" variable list (of 0 or more variables) as well as an "output" variable. In Python, a function only returns a value if there is an explicit return expression. Otherwise it returns None.

Case 1: subroutine

def foo1_subroutine(x,y):
    x = y

def foo2_subroutine():
    Integer z, y, w
    y = 5
    foo1(z,y)
    foo1(w,y)

now z = 5 and w = 5

Case 2: fortran function with simple return

def foo():
    x <-
    return x

def foo2():
    y = foo()

Case 3: fortran function with return expression

def foo():
    return x+1

becomes...

def foo():
  foo_return1 = x+1

return foo_return1

Case 4: conditional return statements

def foo(): #fortran function
    if(x):
        return x
    else:
        return y

Function Reference Specification

<function_reference_spec>[attrval] ::=
    "function" : <function_name>
    "input" : list of [ <variable_reference> | <variable_name> ]
    "output" : <variable_reference> | <variable_name>

The <function_reference_spec> defines the "wiring" between functions and their input and output variable(s).

Function Loop Plate Specification

<function_loop_plate>[attrval] ::=
    "name" : <function_name>
    "type" : "loop_plate"
    "input" : list of <variable_name>
    "index_variable" : <variable_name>
    "index_iteration_range" : <index_range>
    "condition" : <loop_condition>
    "body" : list of <function_reference_spec>

The "input" list of <variable_name> objects should list all variables that are set in the scope outside of the loop_plate.

The current loop_plate specification is aimed at handling for-loops. (assumes "index_variable" and "index_iteration_range" are specified)

FUTURE: Generalize to do-while loop by just relying on the "condition" <loop_condition> to determine when loop completes. We can then remove "index_variable" and "index_iteration_range". There will still need to be a mechanism for identifying index_variable(s).

The "index_variable" is the named variable that stores the iteration state of the loop; the naming convention of this variable is described above, in the Variable naming convention section. The only new element introduced is the <index_range>:

<index_range>[attrval] ::=
    "start" : <integer> | <variable_reference> | <variable_name>
    "end" : <integer> | <variable_reference> | <variable_name>

This definition permits loop iteration bounds to be specified either as literal integers, or as the values of variables.

AutoMATES