TDD for software that runs on code

How do you neatly test code that runs on code?

When it comes to software that runs on code, there are two main ways you'd supply test cases:


The popular code linter black takes the 1st approach a step further by referring to its own source files as the test cases (listing them as a constant named SOURCES).


The style guide/quality checker flake8 uses a middle ground between the 2 approaches. Unlike black it doesn't simply need to confirm that a bunch of files pass the check, but give particular recommendations, e.g. output diffs, so it can't just take a simple list of files to check. However, the hard-coded strings are written as temporary files using the pytest tmpdir fixture (which doesn't need to be imported, it's just detected automatically).


The import reordering tool isort is another diff-producing linter tested with Pytest, but it also relies heavily on property-based testing library hypothesis. The tests are all specified within decorators, coupling the config and code in a way that has been noticed among data science tools lately too.


The 'dead code' finding tool Vulture has a few dozen test modules assessing various aspects of the library, with checking the report formatted output, again in Pytest. I find it unusual that it calls its own module via subprocess with python -m vulture.

I imagine the vulture tests are self-explanatory to the author, but some would benefit from module strings to document what is specifically being tested in each module (e.g. test_scavenging). Vulture does not use parameterised fixtures, and hard-codes the input program strings as arguments to the function calls directly (e.g. in test_unreachable). Despite this I think it has straightforward enough tests.


I hadn't heard of pydocstyle before ("docstring style checker"), but it has a nice approach to testing. This approach is what I initially expected most of the linters to use. Unusually its tests live in src/tests alongside the package directory (rather than under the top level).

All of the test functions have docstrings (naturally for a docstring linting library) and rather than just using plain strings to represent code, it wraps them all in a class CodeSnippet, which then uses the textwrap library's dedent function to remove the indenting you get from writing multiline strings within an indented code block [within the test function]. This is a really neat idea I've never seen before! The CodeSnippet class also wraps the snippet as a file-like object (allowing it to be treated as if it were loaded from a file), which avoids having to write anything to a Pytest tmpdir.

More complex cases are stored in standalone files within src/tests/test_cases, with some peculiar positional arg-related handling decorators that get placed on the functions being checked themselves.

I'm not sure whether I like this second part in practice (because the decorators change the code being checked...) even if I like the idea in theory (of keeping the expected results coupled closely to the source being checked). The machinery to run the test_cases subdirectory's modules is also a neat approach (the test case module names provided as a pytest mark.parameterize list), stored in test_definitions.


My personal preference after reviewing the options would be for pydocstyle's approach of string literals wrapped in dedent and an io.FileIO file-like object wrapper. I would go one step further, to parametrise the creation of the strings involved, and potentially refactor these into classes that eliminate as much of the repetition of the tests as possible.

For example, here we're considering a library that does imports, so rather than write out these imports by hand we could use ast.unparse to generate them for us.

def make_import_string(imports: dict[str,dict[str,str[]):
    for import, import_dict in imports
    return ast.unparse()

This post is the 3rd of Mapping inter-module imports with importopoi, a series covering how I developed a library to depict the graph of module imports for any Python package (or set of scripts) for clarity about how package computation is structured at a glance, and subsequent ease of jumping in to edit or review a particular part. Read on for discussion of Four