Why Tree-Sitter Is Inadequate for Program Analysis

A great thing about Cubix is that it integrates well with many third-party parsers. In the latest update: if you have a Tree-sitter grammar for your language, then you can get Cubix support for your language relatively quickly.

So you might see this and ask "I already have a Tree-sitter parser, and I only care about one language; why can't I just use Tree-sitter directly?"

Unless you're building a syntax highlighter: terrible, terrible choice.

Here's why building Cubix's Tree-sitter integration was many times harder than expected, and a taste of all the pain you're avoiding if you choose to use Cubix with a Tree-sitter parser instead of Tree-sitter directly.

You Can't Tell Addition From Multiplication

Consider this Sui Move code:

let a = x + y;
let b = x * y;

Parse both lines with Tree-sitter. Look at the AST for the right-hand sides. They are identical:

(binary_expression
  left: (identifier)
  right: (identifier))

The + and * tokens are gone. Tree-sitter classified them as "anonymous nodes" and every tool in the ecosystem silently discards them. You cannot write a refactoring tool, a static analyzer, or a formula extractor that distinguishes addition from multiplication — the most basic semantic distinction in arithmetic — because Tree-sitter doesn't preserve it.

This is not a bug in a particular grammar. Tree-Sitter is full of constructs that throw away information, and grammars throughout the ecosystem use them. After all, if you just want a syntax-highlighter or go-to-definition, then this is wasted information. But any kind of deeper analysis needs this.

Want more examples? You also can't tell move x from copy x. You can't tell public from public(friend) from public(package). You can't tell which of the four abilities — copy, drop, store, key — a type declares. You can't even tell true from false. All of these depend on tokens that Tree-sitter discards.

Background

Tree-sitter is a popular incremental parsing library designed for syntax highlighting and editor features. It produces a Concrete Syntax Tree (CST) optimized for the needs of text editors: fast incremental re-parsing, fault tolerance on incomplete input, and enough structure to colorize tokens. It was never designed for semantic analysis, program transformation, or roundtrip source code manipulation.

The problems fall into three categories: major issues where Tree-sitter actively destroys information you need, structural issues where a CST is the wrong representation for the job, and minor issues that add friction.

This article is going to be full of references to the Sui Move grammar, as that is the first language supported via our Tree Sitter integration. But the problems here show up across languages.

Major Issues

Anonymous nodes are silently discarded

Tree-sitter distinguishes between named nodes (like binary_expression, function_definition) and anonymous nodes (operators +, *, ||, punctuation (, ), ,, keywords let, if). All mainstream Tree-sitter libraries — GitHub's semantic, the newer hs-Tree-sitter — only traverse named nodes. Anonymous tokens are thrown away.

This is catastrophic for any tool that needs to understand what code does:

Operators vanish. x + y and x * y become the same tree. a == b and a != b become the same tree. You cannot build a symbolic executor, a formula extractor, or even a linter that checks operator usage.
Punctuation needed for roundtripping vanishes. Parentheses, brackets, commas, semicolons — all gone. You cannot reconstruct source code from the AST.
Keywords that distinguish constructs vanish. In Sui Move, modifier is a named node, but the keywords inside it — public, package, friend, entry, native — are all anonymous. A tool walking named children sees a modifier node with zero children for all five visibility levels. The same pattern recurs throughout the grammar: ability wraps copy/drop/store/key with no named children; primitive_type wraps nine different types (u8 through u256, bool, address, signer) with no named children; move_or_copy_expression makes move and copy semantics indistinguishable; even bool_literal makes true and false identical.

The Sui Move grammar makes the operator problem especially severe. Binary expressions are defined in grammar.js using JavaScript spread syntax that bakes each operator into a separate alternative:

...table.map(([operator, precedence, associativity]) =>
  prec[associativity](precedence, seq(
    field('left', $._expression),
    field('operator', operator),   // Anonymous token -- discarded!
    field('right', $._expression)
  ))
)

Twenty different binary operators, all producing structurally identical AST nodes. The only way to distinguish them is through the anonymous token that every library drops.

No pretty-printer — no roundtripping

Tree-sitter is a one-way street. It parses source code into a tree. It provides zero mechanism for going back — from a tree to source code.

The roundtrip property parse(pretty(parse(text))) = parse(text) is a fundamental requirement for any program transformation tool. If you can't render your modified tree back to valid source, your tool is useless. Tree-sitter provides no support for the pretty half of this equation.

A custom pretty-printer must be written from scratch for every language, working from the same grammar definition but in the opposite direction. Tree-sitter offers no help.

Evil aliases

For some reason, Tree-sitter supports an alias() rule that lets one grammar rule appear under a different name in the CST. We never figured out why this is useful, but enough grammars have it that it must have some use.

For example, the Sui Move grammar uses aliases to give contextual names to shared rules — a generic _variable_identifier is aliased to bind_var in binding position, giving downstream tools a way to know that this identifier is being used as a binder rather than a reference.

But aliases cannot be straightforwardly processed by downstream tools because they introduce a layer of indirection that isn't cleanly reflected in grammar.json.

We wound up having to use a short jq script that preprocesses grammars to remove aliases — the only place where we had to change a Tree-sitter grammar.

This means that every contextual name the grammar author assigned — every attempt to say "this identifier is a bind_var, not just an identifier" — is erased before processing begins. The semantic distinctions that the grammar author carefully encoded through aliasing are destroyed.

FFI memory safety hazards

So we were pretty far along working with the Haskell bindings for Tree-sitter when we slammed into a segfault. Uh oh. How?

You see, Tree-sitter is a C library. Its TSNode struct contains raw pointers back to the TSTree and TSLanguage objects that created it. When accessed from a garbage-collected language, these pointers create a hidden dependency: if the runtime garbage-collects the tree while node references still exist, you get non-deterministic segfaults.

The standard Haskell bindings use ForeignPtr with a finalizer that calls ts_tree_delete. This means the GC doesn't see the dependency between nodes and trees, so it frees the tree while nodes still hold dangling pointers. The resulting crashes are intermittent, appearing to correlate with unrelated code changes, and took weeks to diagnose.

Any language with automatic memory management faces a version of this problem when integrating with Tree-sitter at a low level.

You still want an AST, not a CST

Compilers, static analyzers, and programming tools of all stripes havy historically relied on ASTs (abstract syntax trees). Abstract syntax trees condense a program into its core meaningful syntax. Non-semantic differences such as extra parentheses, or the difference between 0xFF and 255, get stripped away, so that tools work with something simpler. They can also perform other normalizations, such as removing the difference between an if-statement with no else block, vs. an if-statement with an empty else block.

But Tree-sitter does not provide an AST. It instead produces CSTs (concrete syntax trees). Before the introduction of Tree-sitter, CSTs were virtually unknown outside of researchers in language engineering. CSTs, as generated by tools such as Rascal and SDF, are very useful for applications that require reconstructing the original source code. They can also be generated directly from a grammar, reducing the need for additional information about what parts of a syntax to ignore. Unlike ASTs, they can also preserve comments.

Tree-sitter, in introducing concrete syntax trees to a larger audience, has made a number of interesting choices that make it very effective for building syntax highlighters, while reducing its usefulness for most other application. Unlike traditional CSTs, Tree-sitter trees are very lossy (as explained above), which reduces memory consumption but destroys its utility for analysis and transformation. They also lose a lot of the information in the grammar, which allows a simplified API and further reduces memory consumption, at the price of making analysis extra difficult.

My mentor Ira Baxter, who has been building program transformation tools for about 40 years, wrote "Having a parser [and getting an AST] is like climbing the foothills of the Himalayas when the problem is climbing Everest." But today, thanks to Tree-sitter, many tool builders do not even get that far.

Here are some more issues of Tree-sitter, related to its lack of AST production.

Children are just a flat list

Tree-sitter's grammar definitions encode rich structure — seq, repeat, optional, choice — that describes precisely how children are grouped and ordered. But the CST throws all of this away. Every node's children are just a flat, untyped list.

Consider how the Sui Move grammar defines a block:

block: $ => seq(
  '{',
  repeat($.use_declaration),
  repeat($.block_item),
  optional($._expression),
  '}'
)

The grammar says: first come use declarations, then block items (statements ending with ;), then optionally a trailing expression (the block's return value, without ;). But Tree-sitter's node-types.json describes the block node as having "fields": {} and children that can be any of 40+ types in any order. The repeat/optional/seq structure is completely erased.

Or consider function signatures, which allow up to three optional modifiers:

_function_signature: $ => seq(
  optional($.modifier),
  optional($.modifier),
  optional($.modifier),
  'fun',
  ...
)

The grammar defines three distinct modifier slots. The CST gives you 0–3 modifier children with no positional information about which slot each came from.

This means a tool consuming the CST must re-derive the grammar's grouping logic. Given a block node, you must figure out on your own where the statements end and the trailing return expression begins. Given a function, you must figure out which modifiers are present by inspecting their content rather than their position.

A proper AST (and, really, a proper CST too) makes the structure explicit in the type:

data Block e l where
  Block
    :: e [UseDeclarationL]
    -> e [BlockItemL]
    -> e (Maybe HiddenExpressionL)
    -> Block e BlockL

Statements and the return expression are in separate fields. Pattern matching enforces the distinction. There is no ambiguity to resolve at runtime.

The grammar is richer than the parse output

The flat-children problem is a symptom of a deeper issue: Tree-sitter's grammar definition encodes far more structure than its CST preserves.

The grammar uses choice() to define alternatives:

block_item: $ => seq(
  choice(
    $._expression,
    $.let_statement,
  ),
  ';'
)

This says a block item is either an expression or a let statement, followed by a semicolon. But the CST has no wrapper node for the choice(). You see the concrete child directly — a let_statement or a call_expression — with no indication that these were alternatives in a two-way choice.

A proper AST extracts this into a named sum type:

data BlockItemInner e l where
  BlockItemExpression   :: e ExpressionL   -> BlockItemInner e BlockItemInnerL
  BlockItemLetStatement :: e LetStatementL -> BlockItemInner e BlockItemInnerL

Tree-sitter's grammar also uses hidden rules (prefixed with _) like _expression, _type, _bind. These define important categorical groupings — "an expression is one of: call, binary, if, while, ..." — but Tree-sitter actively suppresses these nodes in the CST. Where the grammar says _expression, the CST just shows the concrete child (call_expression, binary_expression, etc.) with no wrapper.

This means the grammar author's intent — "these 16 node types are all expressions" — is lost. A tool must reconstruct these categories by maintaining its own tables of which concrete node types belong to which abstract categories.

Primitive and literal information is structurally invisible

Tree-sitter represents leaf nodes like number literals as opaque pattern nodes — text matching a regex. The number 255, the hex literal 0xFF, and the separator-formatted 1_000_000 are all just a pattern node. There is no structural distinction.

For syntax highlighting, this is fine — they're all numbers, color them blue. For program analysis, it's a problem. A tool that needs to normalize numeric representations, verify literal formats, or preserve the programmer's formatting intent cannot get this information from the Tree-sitter CST without falling back to raw text matching. An AST can represent these as distinct constructors with parsed values.

Minor Issues

`sepBy` doesn't mean `sepBy`

Tree-sitter's sep() / sep1() combinators, which are supposed to represent comma-separated lists and similar patterns, actually implement sepEndBy semantics — they allow trailing separators. The documentation doesn't call this out.

This seems minor but causes real parsing failures when building tools that assume standard semantics. A parameter list (a, b, c) and (a, b, c,) parse identically, and a tool that consumes the Tree-sitter output and applies standard sepBy logic will fail on trailing commas.

The grammar source is JavaScript, not data

Tree-sitter grammars are defined in JavaScript files that use the full power of the language — higher-order functions, spread operators, computed tables. The grammar.json that Tree-sitter produces from these is a flattened, desugared version that has lost the high-level structure.

The Sui Move grammar's binary expression definition uses table.map() with spread — clear and readable in JavaScript, but the resulting grammar.json contains 20 nearly-identical alternatives with no trace of the table structure. Any tool consuming the grammar must reverse-engineer patterns that were obvious in the source.

The Bottom Line

Tree-sitter was designed to make editors fast and responsive. It excels at that. But for program analysis and transformation, it is actively hostile:

What you need	What Tree-sitter gives you
Semantic distinctions between operators	Identical nodes for `+`, `*`, `==`, `!=`
Roundtrip parse/print	One-way parsing only
Structured, typed children	Flat list with grouping information erased
Contextual node names via aliases	Aliases that must be stripped to process the grammar
Stable memory management	Dangling pointers across GC boundaries
Rich structural representations	Hidden rules suppressed, inline choices flattened

We found that Tree-sitter is useful as a tokenizer — a fast, reliable way to break source code into a token stream. But every layer above that — structural parsing, type-safe representation, pretty-printing, roundtripping, transformation — must be built from scratch on top of it. The gap between what Tree-sitter provides and what program analysis requires is far larger than it appears.

One Tool, Many Languages

Why Tree-Sitter Is Inadequate for Program Analysis

You Can't Tell Addition From Multiplication

Background

Major Issues

Anonymous nodes are silently discarded

No pretty-printer — no roundtripping

Evil aliases

FFI memory safety hazards

You still want an AST, not a CST

Children are just a flat list

The grammar is richer than the parse output

Primitive and literal information is structurally invisible

Minor Issues

`sepBy` doesn't mean `sepBy`

The grammar source is JavaScript, not data

The Bottom Line

Cubix Framework Copyright 2016-present James Koppel. All rights reserved.

One Tool, Many Languages

Why Tree-Sitter Is Inadequate for Program Analysis

You Can't Tell Addition From Multiplication

Background

Major Issues

Anonymous nodes are silently discarded

No pretty-printer — no roundtripping

Evil aliases

FFI memory safety hazards

You still want an AST, not a CST

Children are just a flat list

The grammar is richer than the parse output

Primitive and literal information is structurally invisible

Minor Issues

sepBy doesn't mean sepBy

The grammar source is JavaScript, not data

The Bottom Line

Cubix Framework Copyright 2016-present James Koppel. All rights reserved.

`sepBy` doesn't mean `sepBy`