AST rewriting

Can we compare abstract syntax trees to verify refactor correctness?

In the previous part of this series, we introduced the following 3 examples which have equivalent behaviour, and found that type checking alone is insufficient to verify the correctness of rewrites.

def guarded_handler(event: dict[str, str]) -> int | None:
    if (payload := event.get("payload")) is None:
        return None
    return len(payload)


def ternary_handler(event: dict) -> int | None:
    return None if (payload := event.get("payload")) is None else len(payload)


def handler(event: dict) -> int | None:
    payload = event.get("payload")
    if payload is None:
        output = None
    else:
        output = len(payload)
    return output

The 2nd and 3rd (ternary_handler and handler) are more alike than the 1st, as they both have a single return value, so it makes sense for us to try to verify them first.

import ast
from enum import Enum
from pathlib import Path

class Handlers(Enum):
    guarded = "guarded_handler.py"
    ternary = "ternary_handler.py"
    linear = "handler.py"

all_handler_programs = [ast.parse(Path(h.value).read_text()) for h in Handlers]
all_handler_functions = [tree.body[1] for tree in all_handler_programs]
assert all("handler" in f.name for f in all_handler_functions)

Comparing two Abstract Syntax Trees

I found some code to dict-ify the AST on StackOverflow (ironically using early returns itself), which I updated for Python 3, to pretty-print the ASTs for review:

def simplify(node):
    """
    Modified version of AST pprinter https://stackoverflow.com/a/19598419/
    """
    ignored_node_attributes = (
        "lineno end_lineno col_offset end_col_offset ctx type_comment keywords"
    )
    if isinstance(node, ast.AST):
        res = vars(node).copy()
        for k in ignored_node_attributes.split():
            res.pop(k, None)
        for k, v in res.items():
            res[k] = simplify(v)
        res["__type__"] = type(node).__name__
        return res
    elif isinstance(node, list):
        return list(map(simplify, node))
    else:
        return node

For the ternary handler it gives:

[{'value': {'test': {'left': {'target': {'id': 'payload', '__type__': 'Name'},
                              'value': {'func': {'value': {'id': 'event',
                                                           '__type__': 'Name'},
                                                 'attr': 'get',
                                                 '__type__': 'Attribute'},
                                        'args': [{'value': 'payload',
                                                  'kind': None,
                                                  '__type__': 'Constant'}],
                                        '__type__': 'Call'},
                              '__type__': 'NamedExpr'},
                     'ops': [{'__type__': 'Is'}],
                     'comparators': [{'value': None,
                                      'kind': None,
                                      '__type__': 'Constant'}],
                     '__type__': 'Compare'},
            'body': {'value': None, 'kind': None, '__type__': 'Constant'},
            'orelse': {'func': {'id': 'len', '__type__': 'Name'},
                       'args': [{'id': 'payload', '__type__': 'Name'}],
                       '__type__': 'Call'},
            '__type__': 'IfExp'},
  '__type__': 'Return'}]

And for the linear handler it gives:

[{'targets': [{'id': 'payload', '__type__': 'Name'}],
  'value': {'func': {'value': {'id': 'event', '__type__': 'Name'},
                     'attr': 'get',
                     '__type__': 'Attribute'},
            'args': [{'value': 'payload',
                      'kind': None,
                      '__type__': 'Constant'}],
            '__type__': 'Call'},
  '__type__': 'Assign'},
 {'test': {'left': {'id': 'payload', '__type__': 'Name'},
           'ops': [{'__type__': 'Is'}],
           'comparators': [{'value': None,
                            'kind': None,
                            '__type__': 'Constant'}],
           '__type__': 'Compare'},
  'body': [{'targets': [{'id': 'output', '__type__': 'Name'}],
            'value': {'value': None, 'kind': None, '__type__': 'Constant'},
            '__type__': 'Assign'}],
  'orelse': [{'targets': [{'id': 'output', '__type__': 'Name'}],
              'value': {'func': {'id': 'len', '__type__': 'Name'},
                        'args': [{'id': 'payload', '__type__': 'Name'}],
                        '__type__': 'Call'},
              '__type__': 'Assign'}],
  '__type__': 'If'},
 {'value': {'id': 'output', '__type__': 'Name'}, '__type__': 'Return'}]

From glancing between these trees, and thinking about what they represent, it seems simple enough to construct an equivalence from three simplifications:

  1. The assignment with := should be considered as a preceding assignment expression. This is represented in the AST by a NamedExpr node.

  2. The ternary value assignment should be considered as equivalent to if/else clauses. Both are represented in the AST by a test node.

  3. The unnamed return value (with the ternary expression) should be considered as equivalent to the named value. That is, consider a 'direct' value in the return statement as equivalent to a reference.

That seems to break the problem down sufficiently to tackle it.

  1. The NamedExpr node in the ternary handler can be pulled out of the left side of the IfExp and be considered for purpose of comparison as a preceding Assign node, for which it should match the same node in the linear handler. Where the value was used, it can be considered equivalent to any named variable with the same value for args (i.e. the variable is not renamed).

  2. The test node in the ternary handler (which says simply payload is None):

{'target': {'id': 'payload', '__type__': 'Name'} ... 
'ops': [{'__type__': 'Is'}], 'comparators': [{'value': None

matches the test node in the linear handler:

{'test': {'left': {'id': 'payload', '__type__': 'Name'},
'ops': [{'__type__': 'Is'}], 'comparators': [{'value': None,
  1. The direct return value of the ternary handler
'body': {'value': None, 'kind': None, '__type__': 'Constant'},
'orelse': {'func': {'id': 'len', '__type__': 'Name'},
           'args': [{'id': 'payload', '__type__': 'Name'}], '__type__': 'Call'},

is the same as the return value from the variable output in the linear handler:

{'value': {'id': 'output', '__type__': 'Name'}, '__type__': 'Return'}]

which comes from one of two Assign nodes in the body and orelse of the If node.

'body': ... {'value': None, 'kind': None, '__type__': 'Constant'},
'orelse': ... {'func': {'id': 'len', '__type__': 'Name'},
               'args': [{'id': 'payload', '__type__': 'Name'}], '__type__': 'Call'},

So far so easy to make rules for: but what about early returns?


Revisiting the guarded handler

Now we can return to the guarded handler:

[{'test': {'left': {'target': {'id': 'payload', '__type__': 'Name'},
                    'value': {'func': {'value': {'id': 'event',
                                                 '__type__': 'Name'},
                                       'attr': 'get',
                                       '__type__': 'Attribute'},
                              'args': [{'value': 'payload',
                                        'kind': None,
                                        '__type__': 'Constant'}],
                              '__type__': 'Call'},
                    '__type__': 'NamedExpr'},
           'ops': [{'__type__': 'Is'}],
           'comparators': [{'value': None,
                            'kind': None,
                            '__type__': 'Constant'}],
           '__type__': 'Compare'},
  'body': [{'value': {'value': None, 'kind': None, '__type__': 'Constant'},
            '__type__': 'Return'}],
  'orelse': [],
  '__type__': 'If'},
 {'value': {'func': {'id': 'len', '__type__': 'Name'},
            'args': [{'id': 'payload', '__type__': 'Name'}],
            '__type__': 'Call'},
  '__type__': 'Return'}]

To get to the linear handler from this will require all of the 3 rules used for the ternary handler, plus:

  1. An if statement with a return statement inside entails an implicit else clause for everything that follows.

This lets it be 'translated' into a program with an if/else clause that returns once with a variable assigned in the condition blocks rather than returning early.

The only other rule needed to 'rewrite' the AST into the linear handler's AST is the 3rd one we already defined (consider anonymous and named return values as equivalent).


Specifying the precise ast.AST node conditions

To actually use these rules, we need to be more specific. We can rewrite them with reference to the AST nodes themselves as:

  1. An ast.NamedExpr walrus assignment ⇒ a preceding ast.Assign.
  2. An ast.Assignment from an ast.IfExp ternary value ⇒ an ast.If if/else statement with 2 ast.Assign statements (in body and orelse).
  3. Unnamed ast.Return value + name ⇒ ast.Return by id (named value)
  4. An ast.Compare if block with an ast.Return in the body ⇒ an ast.orelse clause for everything that follows.

This post is the 3rd of a series on Refactor verification, investigating how to verify the correctness of refactors (or automating the human error away). In the next section, we'll get into writing some real refactor rules, then come up with a schema to validate our rules without running them (so we can store and use them confidently).