Code generation in DRAKON Editor

Note: starting from version 1.15 DRAKON Editor uses a new code generation algorithm that is not based on search.

DRAKON Editor generates source code from diagrams for a number of programming languages. As of version 1.11, the following languages are supported: C, C++, Java, C#, Python, Tcl, Javascript and Erlang. Since these languages have a lot in common, most of the generation is done by the same algorithm. The language-specific routines mostly take care of the differences in syntax. Code generation has four steps:

Build the graph of the algorithm from the items on the diagram.
Process the graph.
Generate the tree for the output source file.
Print the tree using the syntax of the selected programming language.

Building the Graph of the Algorithm

A diagram in DRAKON Editor is composed of geometrical shapes of two types: icons and line segments. The shapes are not connected to each other and the first task of the code generator is to glue them into a network:

The icons become the nodes of the graph and the line segments become its edges. Some additional nodes are created at line intersections and joints and also at angles.
Every pair of line segments that fully or partially overlap is turned into one big segment.
Some errors are detected at this time, for example dangling line segments (those that have at least one free end) and overlapping icons. The graph is also checked for presence of isolated subgraphs.

Once the graph of geometric shapes is constructed, an abstract graph of the algorithm is extracted from it. The graph of the algorithm is directed while the graph of geometrical shapes is not.

First, we check that the rules of DRAKON language are duly followed: the graph is repeatedly scanned and every configuration of icons and lines which is not allowed gets reported as an error.
After that, consecutive "Action" icons are merged.
Branch cycles and "for" cycles are transformed into normal cycles (those with an arrow pointing up).
Each "Switch" icon is replaced with a series of "If" icons.
Patterns of logical expression are found and simplified.

Simplifying a logical expression:

At this point, the graph has only two kinds of nodes: "If" nodes and "Action" nodes. Each node can have many incoming edges. Each "Action" node has one outgoing edge, each "If" node has two outgoing edges.

What happens next depends on the programming language which we are emitting code for. C and C++ have the famous goto statement which makes code generation pretty straightforward. The languages that don't have goto require more effort.

Generating Code for C and C++

Code generation for a language that has the goto construct is easy: each node of the algorithm graph receives a label and each edge turns into a goto. Some optimizations are made to reduce the number of gotos:

The graph is sorted in the order of execution.
gotos are not inserted between an "If" icon and its immediate "Action" children if there are no other incoming edges into the children.

A special action is taken to ensure that the first icon in the diagram becomes the first line of the generated code. This is necessary for placing the variable declarations at the beginning of the function body. Although C++ does not strictly require that all the variables are declared in the beginning of the function, putting some of the variables there can sometimes be necessary.

Generating Code for the Languages That Don't Have goto

DRAKON Editor employs search to generate source code for the languages that do not have goto. Generation is done in two steps. First, the graph is prepared for the search procedure. Then the search procedure attempts to find the code tree that implements the algorithm specified by the diagram. If the search fails, a fallback routine builds the code tree by emulating goto.

Preparing the Algorithm Graph for Search

The graph is scanned for cycles. If an edge points to a node that has already been visited on the way to this edge, this edge is considered "backward". The edges that "go back" receive a special mark. After this stage the graph does not have cycles if the edges with the "backward" mark are not counted.

Then technical nodes are added to the graph:

The start node inserted before the beginning of the graph.
The end node inserted after the end of the graph.
A special "loop" node is inserted before the first node of each cycle. The first node of a cycle is a node that at least one "backward" edge points to.

Afterwards, branching in the algorithm is normalized by bringing the "If" icons and their outcomes to the standard "If-Then-Else" form.

Normalizing “If” branching:
Normalizing If branching

If necessary, nodes and edges are copied. At this stage, the algorithm could be directly mapped to a nested tree of "If-Then-Else" constructs if the "backward" edges are ignored.

Search

If there were no loops we would not need to do search. But they are there!

The overall goal of the search is transforming loops into this form:

while true: 
    do stuff 
    if must exit: 
        break

Each loop should turn into an infinite "while" construct that has one or more "break" statements as the exits. The "while" construct can also have "continue" statements.

The search transforms the algorithm graph into the code tree. The code tree is built of two kinds of constructs: "if" and "while". It is easy to determine where to insert "if", "while true" and "continue" statements:

The "If" nodes become "if" constructs.
The "loop" nodes become "while true" constructs.
The "backward" edges become "continue" constructs.

Therefore, the specific goal of the search is to find the places where to put "break" statements.

The search has the following properties:

The search space is the set of all possible code trees.
The starting state of the search is an empty tree.
During the search, nodes are removed from the algorithm graph and added to the code tree.
The search has completed when all the nodes have been removed from the graph and added to the tree.
The search has completed successfully when the result tree does not have unfinished "if" and "while" nodes. An "if" or "while" node is considered "finished" when all paths under this node lead to another single node.

The fundamental question that is asked for each node during the search is: "is it possible to put a break here?".

The branching of the search (finding the next states that are reachable from the current one) is done by:

selecting the next node from the graph in different order;
adding or not adding a "break" statement.

Depth-first search proved to fit this task well while A* has performed poorly.

This algorithm fails to find a solution for diagrams that contain certain structures, for example:

An exit from a nested "for" loop.
A loop that has more than one entry into it. By the way, such loops are not allowed by the rules of DRAKON language.
A loop that contains a jump back into the middle of the loop body.
It is practical to have a limit on the number of iterations in this algorithm, because it blows up when the iteration count reaches a certain value. The tests have shown that if the algorithm cannot find the solution after 500 steps, it will not find it after 5000. Therefore, this limit is rarely reached for diagrams that do not have forbidden structures.

goto Emulation

The search fails once in a while and that is why we must have an emergency exit. There is a pretty straightforward way to emulate goto in a programming language that does not have this construct:

An integer variable called "nextItem" is declared early in the body of the procedure.
The whole algorithm is wrapped in one big "while" statement that loops forever.
This infinite loop contains a list of "if" statements.
Each node of the algorithm graph is assigned a number and placed in one of the "if" statements. The condition in the "if" statement compares the "nextItem" variable to the number of the node.
The initial value of the "nextItem" variable is the number of the first node in the algorithm.
The jump to the next node is performed by assigning the number of the next node to the "nextItem" variable. If the code block corresponding to the next node is located before the current code block, a "continue" statement is inserted.

Here is what the Fibonacci algorithm looks like with goto emulation:

def fibonacci(n):
    _next_item_ = 153
    while True: 
        if _next_item_ == 153: 
            _sw_153 = n 
            if _sw_153 == 0: 
                result = [0] 
                _next_item_ = 168
            elif _sw_153 == 1: 
                result = [0, 1] 
                _next_item_ = 168
            else:
                result = [0, 1] 
                i = 2 
                _next_item_ = 1630002
        if _next_item_ == 1630002: 
            if i <= n: 
                f2 = result[i - 2] 
                f1 = result[i - 1] 
                fib = f1 + f2 
                result.append(fib) 
                i += 1 
                _next_item_ = 1630002
                continue
            else: 
                _next_item_ = 168
        if _next_item_ == 168: 
            return result

The goto emulation has a performance cost: the code becomes approximately two times slower.

This goto emulation algorithm was borrowed from somewhere on the internet but unfortunately the exact URL is lost.

Conclusions

The absence of goto in programming languages makes code generation for them really hard. But at least we know that the intentions of language designers were good (for the most part).
Having a single routine for code generation for several programming languages (including a functional one!) makes us jump to a conclusion that all popular languages are conceptually the same. The differences between them are mostly about the types of trouble they cause to the programmer.
The most difficult thing in code generation is not the generation itself but testing. How much must we test the generator before we believe that the generated code reliably makes sense?