I want my agent to fit a mixed effects model in R. How many tokens does that cost?
I spend a non-trivial part of my month inside an LLM agent, mostly Claude Code, asking it to run R analyses on my behalf: fit a model, render a plot, transform a data frame, recover parameters from a simulation. Each task spawns a chain of “read the documentation, generate R code, run it, catch an error, fix the code, retry”. Token consumption adds up, and so does waiting time.
So I built community-skills, an R-focused hub that turns CRAN packages into machine-readable skills with a uniform JSON contract. The agent invokes the skill instead of generating R inline. This post answers three questions: what does the pattern save, who should adopt it, and what is the return on a real workflow.
The cost of the inline approach
When an agent has no skill available and is asked to invoke a CRAN package, it goes through roughly this loop:
- Load documentation context. It fetches part of the package README, vignette, or man page: the verbose half of CRAN’s surface, prose explaining how arguments interact and what defaults mean.
- Generate R code. It synthesizes the call. For famous packages (dplyr, ggplot2, lme4) it often gets it right on the first try, drawing from training data. For less famous packages it stumbles.
- Run the code. It invokes Rscript. If R errors out (wrong argument name, deprecated function, version drift), the error comes back, the agent regenerates, and the loop continues.
- Validate the output. It inspects the result, decides if it matches intent, and either returns it or re-prompts.
For non-trivial tasks this consumes 15,000 to 40,000 tokens per analysis. The variance is driven by retry count, which is itself driven by how well-known the package is.
The cost of the skill approach
When the same package is wrapped as a skill, the loop becomes:
- Load the contract. The agent reads
skills/<package>/SKILL.md, a structured document declaring runtime and version plus a per-function schema with inputs, outputs, and a worked example. Typically 500 to 1,500 tokens. - Construct the payload. It serializes the call as a JSON object that follows the schema. Typically 100 to 400 tokens.
- Invoke the bridge. A thin Python wrapper spawns Rscript, sends the JSON on stdin, captures stdout, and dispatches on the
fnfield. The result comes back as a JSON object withok,fn, andresultorerror. - Use the output. Already structured, no parsing needed.
For the same non-trivial tasks this consumes 700 to 2,000 tokens. The variance comes mostly from the size of the SKILL.md: well-documented packages have larger skill files but produce zero retries.
The benchmark
I defined four tasks in two modes, without skill and with skill, on Claude Code with claude-opus-4-7:
- lme4 mixed effects model on
sleepstudy: fitReaction ~ Days + (Days | Subject), return fixed effects. - ggplot2 faceted plot on
mpg: scatterdisplbyhwy, color bydrv, facet byclass. - dplyr pipeline on
mtcars: filter, group, summarize, arrange. - bgumbel MLE recovery: generate 1000 samples, recover parameters by maximum likelihood.
The mix is deliberate. Tasks 2 and 3 use packages Claude knows extremely well from training, so the without-skill mode is a best case for the agent. Task 4 uses a package the agent has minimal exposure to. Task 1 sits in between.
Methodology, stated honestly
The numbers below come from a static cost model, not from end-to-end measurement of live token consumption. A clean dynamic benchmark needs recursive agent invocation plus controlled trials, which did not fit the publication window. The static model is built from measured inputs (SKILL.md sizes from the repository, bridge round-trip verified directly) and a small set of stated assumptions (CRAN reference manual page counts, generated code size, retry counts by package familiarity, conversion of one token to about four characters). A dynamic end-to-end benchmark, with controlled trials and live token measurement, is the natural next step; when those runs land, the estimates here get revised in a follow-up note.
Results (static estimates)
Task Mode Tokens (mean) Wall-clock (s) Retries
----------- -------------- ------------- -------------- -------
01 lme4 without_skill 12,000 180 1.3
01 lme4 with_skill 1,800 30 0.0
02 ggplot2 without_skill 8,000 90 0.7
02 ggplot2 with_skill 2,500 35 0.0
03 dplyr without_skill 7,500 75 0.3
03 dplyr with_skill 1,500 25 0.0
04 bgumbel without_skill 28,000 280 2.7
04 bgumbel with_skill 1,400 25 0.0
Savings ratios, without divided by with:
lme4 6.7x
ggplot2 3.2x
dplyr 5.0x
bgumbel 20.0x
Mean 8.7x
The variance across tasks is the point. Skills save little for packages the agent already knows (ggplot2: about 3x) and save a lot for packages the agent barely knows (bgumbel: about 20x). For a typical R workflow that mixes both, expect 5 to 10x on average.
Who should adopt this
Three profiles apply:
- Data scientists with an LLM agent in the R workflow. If you use Claude Code, Codex, OpenCode, or a custom harness that invokes R analyses for you a few times a day, the savings compound.
- Pipelines that automate R analysis. Bioinformatics, genomics, finance, anywhere R is the language of decision inside a larger orchestrated process. Skills make the R step deterministic and JSON-friendly for the orchestrator.
- Educators. Teaching R through an agent that can demonstrate “how do I do X in R” with executable code and typed outputs, no retries from hallucinations.
Who should not
It is worth being clear about where the pattern adds nothing:
- You write R by hand in RStudio. No agent in the loop, no savings to capture.
- You only use Python. R is not in your workflow, the skills are not for you.
- You need microsecond latency. The bridge spawns a fresh Rscript subprocess per call, about 100 ms of spawn cost. Negligible for agentic invocation, dominant for tight inner loops. Use an in-process binding instead.
- You already wrap your own R via custom glue. Switching only pays off if the uniform JSON contract gives you something you lack: debuggability, easier handoff between agents, multi-skill orchestration.
Return on a real workflow
The four benchmark tasks are mid-difficulty for illustration. Real workflows include longer-tail tasks, so the estimate below uses 25,000 tokens per task without skill and 1,500 with skill, a saving of 23,500 tokens per analysis. Pricing uses three Claude tiers blended at 70 percent input and 30 percent output:
Intensity Analyses/day Annual saving, Sonnet 4.6 Annual saving, Opus 4.7
------------ ------------ ------------------------- -----------------------
Casual 5 USD 194 USD 969
Moderate 30 USD 1,163 USD 5,816
Intensive 100 USD 3,878 USD 19,388
Heavy team 500 USD 19,388 USD 96,938
Formula: 23,500 tokens times analyses per day times 250 working days times blended price. Recompute with the spot price and intensity for your context.
Time often dominates the token bill. A non-trivial analysis takes 90 to 180 seconds without skill, retries included, against 25 to 35 seconds with skill. For a moderate workflow of 30 analyses a day, that recovers 40 to 75 minutes daily, productive time that frequently exceeds the token saving, especially when an analyst is waiting on a result before deciding.
The pattern, summarized
Three pieces, no daemon, no shared state:
- A contract.
SKILL.mddeclares what is callable, with input and output schemas per function. - A dispatcher.
invoke.Rreads JSON from stdin, dispatches on thefnfield, writes JSON to stdout. - A bridge. A small Python module spawns Rscript per call, sends the payload, captures the result.
The agent treats each skill as a black box with a documented API. No new R code is generated per invocation, only payloads. To use a skill without contributing one, install the package and call the bridge, or pipe JSON straight into Rscript skills/<package>/invoke.R.
State of the hub
community-skills ships 95 skills: 90 wrapping packages from the CRAN top one hundred, and 5 core skills (bgumbel, cran_graph, cran_publisher, cran_workflow, autoresearch). Alongside the skills it carries an internal SQLite snapshot of the CRAN dependency graph, more than twenty four thousand packages as nodes and roughly two hundred forty thousand dependency edges, with a four-status deprecation classifier. The suite has 280 passing tests of 310. Many of the ninety R skills currently carry structural smoke tests; semantic review per skill is ongoing, incremental work. The repository is honest about what is solid and what is still in progress.
Limitations
- Subprocess spawn costs about 100 ms per call. Negligible for agentic invocation, not for tight loops.
- The benchmark is a static cost model; a dynamic end-to-end run with live token measurement is the next step, not yet done.
- All trials assume Claude Code with claude-opus-4-7. Other agents and models may show different absolute numbers; the relative comparison should hold.
- Public scope is R only. The Python pieces are internal infrastructure.
Acknowledgments
The bimodal Gumbel model wrapped by the canonical bgumbel skill is in Otiniano, C. E. G.; Vila, R.; Brom, P. C.; Bourguignon, M. (2023). On the Bimodal Gumbel Model with Application to Environmental Data. Austrian Journal of Statistics, 52, 45-65. DOI 10.17713/ajs.v52i2.1392.
The autoresearch library that powers part of the tooling is at DOI 10.5281/zenodo.19772195. Thanks to the R Core team, the CRAN maintainers, and the authors of the packages wrapped in this gallery; the pattern works because the underlying packages are stable and consistent.
Where it is
community-skills is open source under the MIT license: github.com/pcbrom/community-skills. A new skill is a SKILL.md plus a bridge, and contributions are welcome.
I write about scientific method, statistics, and AI applied with rigor on LinkedIn: linkedin.com/in/pcbrom.