Part 12 of 13

Research Notes and Bibliography

Academic papers, GitHub issues, and empirical findings that informed every decision in this methodology. Each entry includes the finding, how it was validated, and how it shaped the system.

~8 minute read

R1: LLM-Generated Tests Have Shallow Assertions

Finding

LLM-generated tests systematically capture actual program behavior rather than expected behavior. Without structural guardrails, mutation testing scores for AI-generated tests average around 40%, compared to 80%+ for well-written human suites.

Key Sources

MuTAP (Dakhel et al., 2024): Introduced mutation-guided test augmentation for LLMs. By feeding surviving mutants back into the prompt, test effectiveness improved to 93.57% mutation score on synthetic bugs. However, generated tests required post-processing to fix syntax and functional errors. Published in Information and Software Technology, vol. 171.

Meta ACH (Foster et al., FSE 2025): Meta deployed Automated Compliance Hardening, combining LLM-generated mutants with LLM-generated tests. In a trial across Facebook, Instagram, WhatsApp, and Meta wearables from October to December 2024, privacy engineers accepted 73% of generated tests. The LLM equivalence detector achieved 0.95 precision and 0.96 recall with simple static analysis preprocessing for filtering equivalent mutants.

Wukong (comprehensive LLM mutation study, 2024-2025): Evaluated six LLMs across 851 real-world Java bugs. GPT-4o-generated mutants achieved a 93.4% fault detection rate, compared to 51.3% for PIT and 74.4% for Major (traditional tools). Found that few-shot learning with real-world bug examples produced the most effective mutations.

How This Shaped the System

Three-layer defense against shallow assertions: prevention (spec self-validation bans shallow matchers), detection (ESLint rules in ex-test), verification (StrykerJS mutation testing in ex-verify with 70% threshold). The mutation-guided feedback approach from MuTAP is structurally implemented: ex-verify analyzes surviving mutations and deepens assertions before signing off.

R2: Row-Level Security Testing Gaps

Finding

RLS policies provide database-level enforcement that application bugs cannot bypass. However, RLS policies themselves can have subtle bugs: wrong column reference, missing policy on a new table, policy using the wrong session variable. Standard unit tests do not catch these because tests typically run as the database superuser, which bypasses RLS entirely.

How This Shaped the System

Every spec that creates a new table must include an explicit tenant isolation test: insert data for Tenant A, switch to Tenant B's database context (using the app-level connection, not the admin connection), assert zero results. This test pattern is enforced in the spec template checklist. Integration tests use a dedicated app_user role that has RLS applied, separate from the neon_superuser role used for schema operations.

R3: React 19 useOptimistic Incompatibility with TanStack Query

Finding

React 19's useOptimistic hook is incompatible with libraries that use useSyncExternalStore, including TanStack Query. Optimistic state rebases incorrectly when a query refetch finishes during an action: the optimistic update is applied on top of the new server state instead of replacing it, causing a brief incorrect value before reverting.

Key Source

TanStack Query GitHub Issue #9742 (October 2025). Detailed reproduction: click button, optimistic count shows 2 (correct), refetch completes with server state 2, React rebases optimistic state on top: 2 + 1 = 3 (incorrect), action finishes, reverts to 2 (correct). The bug is in how React's transition system interacts with external store synchronization.

How This Shaped the System

The methodology explicitly bans useOptimistic in any project using TanStack Query. Instead, three optimistic patterns are defined (Via UI, Via Cache, None) using TanStack Query's own mutation callbacks (onMutate, onError, onSettled). Every mutation in every phase plan must specify which pattern to use. The spec verification command (verify-spec) checks for useOptimistic usage and flags it as a critical violation.

R4: AI Agent Context Window Degradation

Finding

AI coding agent sessions degrade in quality after extended use. Two mechanisms: (1) Memory leaks cause process growth from 200MB to 12-23GB, triggering system memory pressure and swapping. (2) After 3-4 auto-compactions of the context window, the agent loses nuanced understanding of earlier decisions, leading to inconsistent code, reimplemented solutions, and contradictory patterns.

Validation

Empirically observed across multiple Claude Code sessions lasting 4+ hours. Quality metrics (mutation score, fix-retest cycles, deviation count) consistently worsen after the third compaction event. No published academic study quantifies this specific degradation, but the pattern is widely reported in developer communities.

How This Shaped the System

The three-command model (ex-test, ex-impl, ex-verify) ensures each execution phase starts with a fresh session and clean context. Proactive compaction at 60% (not 95%) preserves quality during each session. The spec header includes a context budget estimate so specs that would exceed 50% of usable context are split before execution begins.

R5: Playwright 1.57+ Chrome for Testing Memory Regression

Finding

Playwright 1.57 switched from open-source Chromium to Chrome for Testing builds. On macOS with Apple Silicon, a single Chrome for Testing instance uses far more RAM than the previous Chromium builds. With 3 workers, each instance grows to approximately 20GB, pushing system load over 27 and triggering severe memory pressure.

Key Source

Playwright GitHub Issue #38489 (December 2025). Reporter: macOS 26.1, M1 Pro, 16GB RAM. browserName: 'chromium' should launch open-source Chromium but instead launches Chrome for Testing. The Playwright team confirmed this was intentional ("Playwright now runs on Chrome for Testing builds") but the memory impact was not anticipated.

How This Shaped the System

Playwright is pinned below version 1.57 (~1.56.0) in all projects. The machine setup script calculates Playwright worker counts based on available memory after accounting for AI agent overhead. Headed browsers (used by ex-verify for exploratory testing) get fewer workers than headless (used by ex-impl for automated tests). Browser launch args include memory optimization flags (--disable-dev-shm-usage, --disable-gpu, --disable-extensions).

R6: Architectural Fitness Functions

Finding

Architectural rules encoded as documentation drift over time. Rules encoded as executable tests (fitness functions) are enforced continuously. ArchUnit (Java) and ArchUnitTS (TypeScript) allow expressing package boundary rules, import restrictions, and structural constraints as test assertions.

Key Property

A critical safety behavior: if a fitness function rule matches zero files (due to a typo in the path or a renamed package), the test fails rather than silently passing. This prevents false confidence. A rule that validates nothing is as dangerous as no rule at all.

How This Shaped the System

Import boundaries (apps never import other apps, only packages/db uses the ORM, only packages/ai uses LLM SDKs) are encoded as ArchUnitTS tests that run in CI. These catch AI agents that accidentally bypass the architectural constraints. The consolidation phase per milestone explicitly reviews and updates fitness functions to cover new packages and patterns.

Consolidated Bibliography

IDCitationUsed In
R1aDakhel, A.M., Nikanjam, A., Majdinasab, V., Khomh, F., Desmarais, M.C. (2024). "Effective test generation using pre-trained Large Language Models and mutation testing." Information and Software Technology, 171, 107468.Part 7
R1bFoster, C., Gulati, A., Harman, M., et al. (2025). "Mutation-guided LLM-based test generation at Meta." FSE 2025. arXiv:2501.12862.Part 7
R1cWukong mutation testing study (2024-2025). "On the Use of Large Language Models in Mutation Testing." arXiv:2406.09843v4.Part 7
R3TanStack Query GitHub Issue #9742 (October 2025). "Is React Query incompatible with React Actions/Transitions/useOptimistic?"Part 3, 6
R5Playwright GitHub Issue #38489 (December 2025). "No way to use open-source Chromium, Chrome for Testing causes high memory usage (20GB+ per instance)."Part 8
R6ArchUnit / ArchUnitTS documentation and community patterns for architectural fitness functions.Part 11
R7Wang, J., et al. (2024). "Software Testing with Large Language Models: Survey, Landscape, and Vision." Comprehensive survey of LLM testing approaches.Background
R8Yuan, Z., et al. (2024). "Evaluating and improving ChatGPT for unit test generation." Proc. ACM Softw. Eng., 1(FSE).Background
R9Alshahwan, N., et al. (2024). "Automated unit test improvement using Large Language Models at Meta." FSE 2024.Background