GrFN, pronounced “Griffin”, is the specification format for the central representation that integrates the extracted Function Network representation of source code (the result of program analysis) and associated extracted comments, links to natural language text (the result of natural language processing), and links to equations (the result of equation extraction).
This document describes the current state of the GrFN-JSON schema, specifying the JSON format that is to be generated by program analysis and consumed by Delphi.
Here we adopt a simplified Backus-Naur Form
(BNF)-inspired
grammar convention combined with a convention for intuitively defining
specific JSON attribute-value lists. The schema definitions and instance
GrFN examples are shown in monospaced font
, and interspersed with
comments/discussion.
Following BNF convention, elements in <...>
denote non-terminals, with
::=
indicating a definition of how a non-terminal is expanded. Many of
the definitions below will specify JSON attribute-value lists; when this
is the case, we will decorate the element definition by adding
[attrval]
, as follows::
<element_name>[attrval] ::=
We will then specify the structure of the JSON attribute-value list attributes (quoted strings) and their value types using a mixture of JSON and BNF. In a few places, we note in the comments anticipated extensions that may be needed using the tag 'FUTURE'.
The challenge of this project is to define a map from the semantics of program (computation) specification (as asserted in source code) to the semantics of a (discretized) dynamic system model. We must take care to define (ongoing!) technical terms and highlight which concept domain (general computation versus dynamics system model) we are dealing with.
We assume here that the source code is intended to describe the states of a dynamical system and how they evolve over time. The system is decomposed into a set of individual states (represented as random variables), where the values of the states at any given time are functions of function of the values of zero or more other states at a previous time point. As we are considering the evolution of the system over time, in general, every variable has an index. The functional relationships may be instantaneous (based on the variables indexed at the same point in time) or across time (a function of states of variables at different time indices).
Both variables and functions will be uniquely named strings that do not
rely on implicit position within the source code to be identified (one
reason for requiring unique names is that we are moving from the
semantics of program variables as pointers to storage to a
representation of variables as denoting the evolving state of a system).
An important observation: the same variable name in source code could
itself be used in two separate variable declarations, and so would
constitute two different variable identities; source code name alone is
not sufficient to identify variable identity. For this reason, we must
adopt a set of conventions for capturing any such source code context.
These conventions will be assumed to be associated with the following
<variable_name>
and <function_name>
definitions. In both cases, the
names should also correspond to legal Python variable and function names
that could appear in Python code -- so they must begin with a letter or
underscore followed by letters, numbers or underscores.
NOTE: For now (as of 2018-07-09), we will NOT be using the <enclosing_context> aspect of Variable naming convention, however, the Function naming convention will be used. This means that my examples of variable names below should also be read ignoring the <enclosing_context>:
<variable_name> ::= <string>
A top level source variable named ABSORPTION would then simply have the
<source_variable_name>
as the string making up the
<variable_name>::
"ABSORPTION"
If there are two separate instances of new variable declarations in the same context using the same name, then we’ll add an underscore and number to distinguish them. For example, if ABSORPTION is defined twice, then the first (in order in the source code) is named:
"ABSORPTION_1"
And the second:
"ABSORPTION_2"
<variable_reference>[attrval] ::=
"variable" : <variable_name>
"index" : <integer>
In addition to capturing source code variable environment context in variable
declarations, we also need a mechanism to disambiguate specific instances of the
same variable within the same context to accurately capture the logical order of
variable value updates. In this case, we consider this as a repeated reference
to the same variable. The semantics of repeated reference is captured by the
variable "index" attribute of a <variable_reference>
. The index integer
serves disambiguate the execution order of variable state references, as
determined during program analysis.
Function names, like variable names, are ultimately strings, and will also follow a conventional structure used to capture context information. Also like variable names, they should be legal Python function names that could show up in working Python code (as will be the case when used in Lambda function references; see below). The general string format is::
<enclosing_context>__<function_type>[___<var_affected>]
Similar to variable naming, we need to use the function name string to
uniquely identify the function, and as some functions extracted by
program analysis will be expressions defined within other functions
(loops and conditions), we need to capture the "context" in which the
function is defined. The <enclosing_context>
represents the source
code environment context within which the function is defined. When a
function is declared at the "top level" (as will often be the case for
container functions), the <enclosing_context>
is just the "top
level" and so is empty. Assigns, conditions and loop_plates, however,
will often be declared within another source code function. In those
cases, the <enclosing_context>
capture the enclosing function name.
For example, for an assignment within the UPDATE_EST function, the
<enclosing_context>
is "UPDATE_EST".
Next, the <function_type>
is the string representing which of the four
types the function belongs to (the types are described in more detail,
below): "assign", "condition", "container", "loop_plate". In
the case of a loop_plate, we will name the specific loop using the
generic name "loop", and include a number if there is more than one
loop.
Finally, <var_affected>
will only be relevant for assign and condition
function types, and the name of the variable affected will be added
after the <function_type>
and 3 underscores. For example, a condition
variable occurring within the function UPDATE_EST function and setting
the (inferred) boolean variable IF_1 would have the name:
"UPDATE_EST__condition___IF_1"
.
Here are example function names for each function types:
Assign: An assignment of the variable UPDATE_EST__YIELD_EST in the context of function UPDATE_EST::
UPDATE_EST__assign___UPDATE_EST__YIELD_EST
Condition: A condition within the function UPDATE_EST assigning the (inferred) boolean variaaible IF_1::
UPDATE_EST__condition___IF_1
Container: A container function called CROP_YIELD:
CROP_YIELD__container
Loop_plate:
A single loop within the function CROP_YIELD::
CROP_YIELD__loop
The third of three loops within the function CROP_YIELD::
CROP_YIELD__loop_3
A loop nested in the context of another loop in CROP_YIELD::
CROP_YIELD__loop_1__loop_2
An assignment within a single loop in CROP_YIELD::
CROP_YIELD__loop__assign___CROP_YIELD__RAIN
NOTE: There is some redundancy in the above examples between the
<enclosing_context>
of the name of the function and the
<enclosing_context>
of the name of the variable, however we think that
both are ultimately needed.
The top-level structure of the GrFN is the <grfn_spec>
and is itself a
JSON attribute-value list, with the following schema definition::
<grfn_spec>[attrval] ::=
"start": <string>
"name" : <string>
"dateCreated" : <string>
"functions" : list of <function_spec>
The "start" attribute holds the name of the entry point of the (Fortran) source code i.e. the PROGRAM module. In the absence of this module, this string will remain empty. The "name" attribute is used to denote the (Fortran) source code that has been analyzed. The "dateCreated" attribute is a string representing the date+time that the current GrFN was generated (to represent versioning).
FUTURE:
A (partial) example instance of a JSON attribute-value list generated
following the <grfn_spec>
:
{
"start": "MAIN"
"name": "crop_yield.py",
"dateCreated": "20180623",
"functions": [... function_specs go here...]
}
<variable_spec>[attrval] ::=
"name" : <variable_name>
"domain" : <variable_domain_type>
The purpose of the list of <variable_spec>
's in the <grfn_spec>
"variables" attribute value is to list all of the variables defined
within the code we are analyzing, and associate each with their domain
type. This list should include all variables whose values get updated by
computation, and will be derived from variables that are explicitly
asserted in source code, such as those used for explicit value
assignment or used as loop indices, and other variables that program
analysis may introduce (infer) as part of analyzing conditionals.
<variable_domain_type> ::= <string>
The "domain" attribute of a <variable_spec>
specifies what values
the variable can be assigned to. To start, we will keep things simple
and restrict ourselves to four types that can be specified as strings:
(The idea of the variable domain is intended to be close to the idea of the "support" of a random variable, although should also correspond to standard data types.)
TODO: Need to extend to accommodate arrays.
FUTURE:
Python is a strongly-typed language, but is also a dynamically typed language. However, that's not to say that there is no type specification in Python. Python 3 now provides nascent support for explicit typing via type hints.
TODO: Explore whether/how this gets represented in the AST.
For our purposes in the near term, we do want to capture what type and value-domain information is available; there are two main sources of this information:
Here are three examples of <variable_spec>
objects:
Example of a "standard" variable MAX_RAIN within the CROP_YIELD function:
{
"name": "CROP_YIELD__MAX_RAIN",
"domain": "real"
}
Example of loop index variable DAY in the context of the second instance of a loop in the function CROP_YIELD
{
"name": "CROP_YIELD__LOOP_2__DAY"
"domain": "integer"
}
Example of variable introduced (inferred) when analyzing a conditional statement that is within the named function UPDATE_EST:
{
"name": "IF_1"
"domain": "boolean"
}
Note that we do not include the <enclosing_context>
of the UPDATE_EST
function in this case, as this is an inferred conditional boolean
variable (per our naming convention, described above).
Next we have the <function_spec>
. There are four types of functions;
two types can be expressed using the same attributes in their JSON
attribute-value list (<function_assign_spec>
), while the others
(<function_container_spec>
, <function_loop_plate>
) require different
attributes. So the means there are three specializations of the
<function_spec>, one of which (<function_assign_spec>
) will be used
for two function types.:
<function_spec> ::=
<function_assign_spec> # either type "assign" or "condition:
| <function_container_spec> # type "container"
| <function_loop_plate> # type "loop_plate"
All three specs will have a "type" attribute that will unambiguously identify which type of function is being specified. The four possible types are:
All <function_spec>s will also have a name attribute with a unique string value (across <function_spec>s), as described above under the Function naming convention section; as described in that section, the function name will include the function type, but having the explicit type attribute make parsing easier.
A <function_assign_spec>
denotes the setting of the value of a
variable. The values are assigned to the "target" variable (denoted by
a <variable_reference>
or <variable_name>
) and the value is
determined by the "body" of the assignment, which itself may either be
a literal value (specified by <function_assign_body_literal_spec>
) or
a lambda function (specified by <function_assign_body_lambda_spec>
).:
<function_assign_spec>[attrval] ::=
"name" : <function_name>
"type" : "assign" | "condition" # note that either is a literal/terminal value
# of the grammar
"sources" : list of [ <function_source_reference> | <variable_name> ]
"target" : <function_source_reference> | <variable_name>
"body" : <function_assign_body_literal_spec>
| <function_assign_body_lambda_spec>
In the general case of variable assignment/setting, the attribute type
should be "assign". In the special case where we are representing the
assignment of a boolean value as the result of a condition
(if-statement), then program analysis will infer a new boolean target
variable, and the computation of the condition itself will be
represented by the assignment function; in this case, we will use the
more specific "condition" value for the "type" attribute of the
<function_assign_spec>
. Semantically, this is nothing more than an
assignment of a boolean variable, but conceptually it will be useful to
distinguish assignments used for conditions from other assignments.
For "sources" and "target": when there is no need to refer to the
variable by its relative index, then <variable_name>
is sufficient,
and index will be assumed to be 0 (if at all relevant). In other cases,
the variables will be referenced using the
<function_source_reference>
. There may also be cases where the sources
can be a function, either built-in or user-defined. These two will be
referenced using <function_source_reference>
defined as:
<function_source_reference> ::=
"name" : [ <variable_name> | <function_name> ]
"type" : "variable" | "function"
The <function_assign_body_literal_spec>
asserts the assignment of a
<literal_value>
to the target variable. The <literal_value>
has a
data type (corresponding to one of our four domain types), and the value
itself will be represented generically in a string (the string will be
parsed to extract the actual value according to its data type).:
<function_assign_body_literal_spec>[attrval] ::=
"type" : "literal"
"value" : <literal_value>
<literal_value>[attrval] ::=
"dtype" : "real" | "integer" | "boolean" | "string"
"value" : <string>
When more computation is done to determine the value that is being
assigned to the variable in the <function_assign_spec>
, then
<function_assign_body_lambda_spec>
is used.:
<function_assign_body_lambda_spec>[attrval] ::=
"type" : "lambda"
"name" : <function_name>
"reference" : <source_code_reference>
Eventually, we can expand this part of the grammar to accommodate a restricted set of arithmetic operations involved in computing the final value (this is now of interest in the World Modelers program and we're interested in supporting this in Delphi). But for now, we will start by having the lambda function reference the source code that does the computation, in the translated Python generated by program analysis. Any variables that are involved in the computation must be listed in the "source" list of variables (<variable_name> references) in the <function_assign_spec>.:
<source_code_reference> := <string>
To start, the <source_code_reference>
string could just be the line
number or a tuple denoting the range of line numbers over which the
Python source code for the corresponding operations are defined.
Handles representation of simple binary condition block::
If condition_variable:
Condition1 variable_reference
Else
A <function_container_spec>
is the generic, "top level" way to
specify how a set of variables that are related by functions are "wired
up" by those functions. (I previously referred to this as the "top",
but here I'm renaming it a "container" as that's more descriptive of
how it functions.):
<function_container_spec>[attrval] ::=
"name" : <function_name>
"type" : "container"
"DOCS" : <STRING>
"input" : list of [ <variable_reference> | <variable_name> ]
"variables" : list of <variable_spec>
"output" : list of <variable_reference> | <variable_name>
"body" : list of <function_reference_spec>
Case 1: subroutine
def foo1_subroutine(x,y):
x = y
def foo2_subroutine():
Integer z, y, w
y = 5
foo1(z,y)
foo1(w,y)
now z = 5 and w = 5
Case 2: fortran function with simple return
def foo():
x <-
return x
def foo2():
y = foo()
Case 3: fortran function with return expression
def foo():
return x+1
becomes...
def foo():
foo_return1 = x+1
return foo_return1
Case 4: conditional return statements
def foo(): #fortran function
if(x):
return x
else:
return y
There will be a container function for each source code function. For this reason, we need an "input" variable list (of 0 or more variables) as well as an "output" variable. In Python, a function only returns a value if there is an explicit return expression. Otherwise it returns None.
TODO: Can there be nested functions in Fortran?
<function_reference_spec>[attrval] ::=
"function" : <function_name>
"input" : list of [ <variable_reference> | <variable_name> ]
"output" : <variable_reference> | <variable_name>
The <function_reference_spec>
defines the "wiring" between functions
and their input and output variable(s).
<function_loop_plate>[attrval] ::=
"name" : <function_name>
"type" : "loop_plate"
"input" : list of <variable_name>
"index_variable" : <variable_name>
"index_iteration_range" : <index_range>
"body" : list of <function_reference_spec>
The "input" list of <variable_name>
objects should list all
variables that are set in the scope outside of the loop_plate.
The "index_variable" is the named variable that stores the iteration
state of the loop; the naming convention of this variable is described
above, in the Variable naming convention section. The only new element
introduced is the <index_range>
::
<index_range>[attrval] ::=
"start" : <integer> | <variable_reference> | <variable_name>
"end" : <integer> | <variable_reference> | <variable_name>
This definition permits loop iteration bounds to be specified either as literal integers, or as the values of variables.
TODO: we think Fortran is restricted to integer values for iteration variables, which would include iteration over indexes into arrays. Need to double check this.