Accelerating Giant-Scale Check Migration with LLMs | by Charles Covey-Brandt | The Airbnb Tech Weblog

Autonomous coding brokers: A Codex instance

Refactoring with Codemods to Automate API Modifications

Airbnb just lately accomplished our first large-scale, LLM-driven code migration, updating almost 3.5K React part check information from Enzyme to make use of React Testing Library (RTL) as an alternative. We’d initially estimated this might take 1.5 years of engineering time to do by hand, however — utilizing a mixture of frontier fashions and strong automation — we completed the whole migration in simply 6 weeks.

On this weblog put up, we’ll spotlight the distinctive challenges we confronted migrating from Enzyme to RTL, how LLMs excel at fixing this specific sort of problem, and the way we structured our migration tooling to run an LLM-driven migration at scale.

In 2020, Airbnb adopted React Testing Library (RTL) for all new React part check growth, marking our first steps away from Enzyme. Though Enzyme had served us effectively since 2015, it was designed for earlier variations of React, and the framework’s deep entry to part internals now not aligned with trendy React testing practices.

Nonetheless, due to the basic variations between these frameworks, we couldn’t simply swap out one for the opposite (learn extra in regards to the variations right here). We additionally couldn’t simply delete the Enzyme information, as evaluation confirmed this might create important gaps in our code protection. To finish this migration, we wanted an automatic strategy to refactor check information from Enzyme to RTL whereas preserving the intent of the unique exams and their code protection.

In mid-2023, an Airbnb hackathon crew demonstrated that giant language fashions might efficiently convert tons of of Enzyme information to RTL in just some days.

Constructing on this promising consequence, in 2024 we developed a scalable pipeline for an LLM-driven migration. We broke the migration into discrete, per-file steps that we might parallelize, added configurable retry loops, and considerably expanded our prompts with further context. Lastly, we carried out breadth-first immediate tuning for the lengthy tail of complicated information.

We began by breaking down the migration right into a collection of automated validation and refactor steps. Consider it like a manufacturing pipeline: every file strikes by way of phases of validation, and when a test fails, we deliver within the LLM to repair it.

We modeled this move like a state machine, transferring the file to the following state solely after validation on the earlier state handed:

Diagram reveals refactor steps from Enzyme refactor, fixing Jest, fixing lint and tsc, and marking file as full.

This step-based strategy supplied a strong basis for our automation pipeline. It enabled us to trace progress, enhance failure charges for particular steps, and rerun information or steps when wanted. The step-based strategy additionally made it easy to run migrations on tons of of information concurrently, which was vital for each rapidly migrating easy information, and chipping away on the lengthy tail of information later within the migration.

Early on within the migration, we experimented with totally different immediate engineering methods to enhance our per-file migration success fee. Nonetheless, constructing on the stepped strategy, we discovered the simplest route to enhance outcomes was merely brute drive: retry steps a number of occasions till they handed or we reached a restrict. We up to date our steps to make use of dynamic prompts for every retry, giving the validation errors and the newest model of the file to the LLM, and constructed a loop runner that ran every step as much as a configurable variety of makes an attempt.

*Diagram of a retry loop. For a given step N, if the file has errors, we retry validation and try to repair errors except we hit the max retries or the file now not accommodates errors.*

With this easy retry loop, we discovered we might efficiently migrate a lot of our simple-to-medium complexity check information, with some ending efficiently after a number of retries, and most by 10 makes an attempt.

For check information as much as a sure complexity, simply growing our retry makes an attempt labored effectively. Nonetheless, to deal with information with intricate check state setups or extreme indirection, we discovered one of the best strategy was to push as a lot related context as potential into our prompts.

By the tip of the migration, our prompts had expanded to wherever between 40,000 to 100,000 tokens, pulling in as many as 50 associated information, a complete host of manually written few-shot examples, in addition to examples of current, well-written, passing check information from inside the identical mission.

Every immediate included:

The supply code of the part below check
The check file we had been migrating
Validation failures for the step
Associated exams from the identical listing (sustaining team-specific patterns)
Basic migration pointers and customary options

Right here’s how that seemed in follow (considerably trimmed down for readability):

// Code instance reveals a trimmed down model of a immediate 
// together with the uncooked supply code from associated information, imports, 
// examples, the part supply itself, and the check file emigrate.const immediate = [
'Convert this Enzyme test to React Testing Library:',
`SIBLING TESTS:n${siblingTestFilesSourceCode}`,
`RTL EXAMPLES:n${reactTestingLibraryExamples}`,
`IMPORTS:n${nearestImportSourceCode}`,
`COMPONENT SOURCE:n${componentFileSourceCode}`,
`TEST TO MIGRATE:n${testFileSourceCode}`,
].be a part of('nn');

This wealthy context strategy proved extremely efficient for these extra complicated information — the LLM might higher perceive team-specific patterns, widespread testing approaches, and the general structure of the codebase.

We must always word that, though we did some immediate engineering at this step, the principle success driver we noticed was selecting the proper associated information (discovering close by information, good instance information from the identical mission, filtering the dependencies for information that had been related to the part, and many others.), reasonably than getting the immediate engineering good.