I want my agent to fit a mixed effects model in R. How many tokens does that cost?

R
CRAN
LLM
agents
tooling
A practical look at the token cost of letting an LLM agent call CRAN packages, and how a typed skill contract cuts it. With a benchmark, an honest methodology, and an ROI estimate.
Author

Pedro Carvalho Brom

Published

May 15, 2026

Code repositorypcbrom/community-skills

I spend a non-trivial part of my month inside an LLM agent, mostly Claude Code, asking it to run R analyses on my behalf: fit a model, render a plot, transform a data frame, recover parameters from a simulation. Each task spawns a chain of “read the documentation, generate R code, run it, catch an error, fix the code, retry”. Token consumption adds up, and so does waiting time.

So I built community-skills, an R-focused hub that turns CRAN packages into machine-readable skills with a uniform JSON contract. The agent invokes the skill instead of generating R inline. This post answers three questions: what does the pattern save, who should adopt it, and what is the return on a real workflow.

The cost of the inline approach

When an agent has no skill available and is asked to invoke a CRAN package, it goes through roughly this loop:

  1. Load documentation context. It fetches part of the package README, vignette, or man page: the verbose half of CRAN’s surface, prose explaining how arguments interact and what defaults mean.
  2. Generate R code. It synthesizes the call. For famous packages (dplyr, ggplot2, lme4) it often gets it right on the first try, drawing from training data. For less famous packages it stumbles.
  3. Run the code. It invokes Rscript. If R errors out (wrong argument name, deprecated function, version drift), the error comes back, the agent regenerates, and the loop continues.
  4. Validate the output. It inspects the result, decides if it matches intent, and either returns it or re-prompts.

For non-trivial tasks this consumes 15,000 to 40,000 tokens per analysis. The variance is driven by retry count, which is itself driven by how well-known the package is.

The cost of the skill approach

When the same package is wrapped as a skill, the loop becomes:

  1. Load the contract. The agent reads skills/<package>/SKILL.md, a structured document declaring runtime and version plus a per-function schema with inputs, outputs, and a worked example. Typically 500 to 1,500 tokens.
  2. Construct the payload. It serializes the call as a JSON object that follows the schema. Typically 100 to 400 tokens.
  3. Invoke the bridge. A thin Python wrapper spawns Rscript, sends the JSON on stdin, captures stdout, and dispatches on the fn field. The result comes back as a JSON object with ok, fn, and result or error.
  4. Use the output. Already structured, no parsing needed.

For the same non-trivial tasks this consumes 700 to 2,000 tokens. The variance comes mostly from the size of the SKILL.md: well-documented packages have larger skill files but produce zero retries.

The benchmark

I defined four tasks in two modes, without skill and with skill, on Claude Code with claude-opus-4-7:

  1. lme4 mixed effects model on sleepstudy: fit Reaction ~ Days + (Days | Subject), return fixed effects.
  2. ggplot2 faceted plot on mpg: scatter displ by hwy, color by drv, facet by class.
  3. dplyr pipeline on mtcars: filter, group, summarize, arrange.
  4. bgumbel MLE recovery: generate 1000 samples, recover parameters by maximum likelihood.

The mix is deliberate. Tasks 2 and 3 use packages Claude knows extremely well from training, so the without-skill mode is a best case for the agent. Task 4 uses a package the agent has minimal exposure to. Task 1 sits in between.

Methodology, stated honestly

The numbers below come from a static cost model, not from end-to-end measurement of live token consumption. A clean dynamic benchmark needs recursive agent invocation plus controlled trials, which did not fit the publication window. The static model is built from measured inputs (SKILL.md sizes from the repository, bridge round-trip verified directly) and a small set of stated assumptions (CRAN reference manual page counts, generated code size, retry counts by package familiarity, conversion of one token to about four characters). A dynamic end-to-end benchmark, with controlled trials and live token measurement, is the natural next step; when those runs land, the estimates here get revised in a follow-up note.

Results (static estimates)

Task         Mode            Tokens (mean)   Wall-clock (s)   Retries
-----------  --------------  -------------   --------------   -------
01 lme4      without_skill          12,000              180       1.3
01 lme4      with_skill              1,800               30       0.0
02 ggplot2   without_skill           8,000               90       0.7
02 ggplot2   with_skill              2,500               35       0.0
03 dplyr     without_skill           7,500               75       0.3
03 dplyr     with_skill              1,500               25       0.0
04 bgumbel   without_skill          28,000              280       2.7
04 bgumbel   with_skill              1,400               25       0.0

Savings ratios, without divided by with:

lme4       6.7x
ggplot2    3.2x
dplyr      5.0x
bgumbel   20.0x
Mean       8.7x

The variance across tasks is the point. Skills save little for packages the agent already knows (ggplot2: about 3x) and save a lot for packages the agent barely knows (bgumbel: about 20x). For a typical R workflow that mixes both, expect 5 to 10x on average.

Who should adopt this

Three profiles apply:

  1. Data scientists with an LLM agent in the R workflow. If you use Claude Code, Codex, OpenCode, or a custom harness that invokes R analyses for you a few times a day, the savings compound.
  2. Pipelines that automate R analysis. Bioinformatics, genomics, finance, anywhere R is the language of decision inside a larger orchestrated process. Skills make the R step deterministic and JSON-friendly for the orchestrator.
  3. Educators. Teaching R through an agent that can demonstrate “how do I do X in R” with executable code and typed outputs, no retries from hallucinations.

Who should not

It is worth being clear about where the pattern adds nothing:

  1. You write R by hand in RStudio. No agent in the loop, no savings to capture.
  2. You only use Python. R is not in your workflow, the skills are not for you.
  3. You need microsecond latency. The bridge spawns a fresh Rscript subprocess per call, about 100 ms of spawn cost. Negligible for agentic invocation, dominant for tight inner loops. Use an in-process binding instead.
  4. You already wrap your own R via custom glue. Switching only pays off if the uniform JSON contract gives you something you lack: debuggability, easier handoff between agents, multi-skill orchestration.

Return on a real workflow

The four benchmark tasks are mid-difficulty for illustration. Real workflows include longer-tail tasks, so the estimate below uses 25,000 tokens per task without skill and 1,500 with skill, a saving of 23,500 tokens per analysis. Pricing uses three Claude tiers blended at 70 percent input and 30 percent output:

Intensity     Analyses/day   Annual saving, Sonnet 4.6   Annual saving, Opus 4.7
------------  ------------   -------------------------   -----------------------
Casual                   5   USD    194                  USD    969
Moderate                30   USD  1,163                  USD  5,816
Intensive              100   USD  3,878                  USD 19,388
Heavy team             500   USD 19,388                  USD 96,938

Formula: 23,500 tokens times analyses per day times 250 working days times blended price. Recompute with the spot price and intensity for your context.

Time often dominates the token bill. A non-trivial analysis takes 90 to 180 seconds without skill, retries included, against 25 to 35 seconds with skill. For a moderate workflow of 30 analyses a day, that recovers 40 to 75 minutes daily, productive time that frequently exceeds the token saving, especially when an analyst is waiting on a result before deciding.

The pattern, summarized

Three pieces, no daemon, no shared state:

  • A contract. SKILL.md declares what is callable, with input and output schemas per function.
  • A dispatcher. invoke.R reads JSON from stdin, dispatches on the fn field, writes JSON to stdout.
  • A bridge. A small Python module spawns Rscript per call, sends the payload, captures the result.

The agent treats each skill as a black box with a documented API. No new R code is generated per invocation, only payloads. To use a skill without contributing one, install the package and call the bridge, or pipe JSON straight into Rscript skills/<package>/invoke.R.

State of the hub

community-skills ships 95 skills: 90 wrapping packages from the CRAN top one hundred, and 5 core skills (bgumbel, cran_graph, cran_publisher, cran_workflow, autoresearch). Alongside the skills it carries an internal SQLite snapshot of the CRAN dependency graph, more than twenty four thousand packages as nodes and roughly two hundred forty thousand dependency edges, with a four-status deprecation classifier. The suite has 280 passing tests of 310. Many of the ninety R skills currently carry structural smoke tests; semantic review per skill is ongoing, incremental work. The repository is honest about what is solid and what is still in progress.

Limitations

  • Subprocess spawn costs about 100 ms per call. Negligible for agentic invocation, not for tight loops.
  • The benchmark is a static cost model; a dynamic end-to-end run with live token measurement is the next step, not yet done.
  • All trials assume Claude Code with claude-opus-4-7. Other agents and models may show different absolute numbers; the relative comparison should hold.
  • Public scope is R only. The Python pieces are internal infrastructure.

Acknowledgments

The bimodal Gumbel model wrapped by the canonical bgumbel skill is in Otiniano, C. E. G.; Vila, R.; Brom, P. C.; Bourguignon, M. (2023). On the Bimodal Gumbel Model with Application to Environmental Data. Austrian Journal of Statistics, 52, 45-65. DOI 10.17713/ajs.v52i2.1392.

The autoresearch library that powers part of the tooling is at DOI 10.5281/zenodo.19772195. Thanks to the R Core team, the CRAN maintainers, and the authors of the packages wrapped in this gallery; the pattern works because the underlying packages are stable and consistent.

Where it is

community-skills is open source under the MIT license: github.com/pcbrom/community-skills. A new skill is a SKILL.md plus a bridge, and contributions are welcome.

I write about scientific method, statistics, and AI applied with rigor on LinkedIn: linkedin.com/in/pcbrom.