In my last post I wrote about my experiences of using coding agents to use data science and how their "software engineering" persona didn't perform well at interpreting success or dealing with the probabilistic nature of data science work. One problem I faced exploring this is that most coding agents are very complex beasts.

Bash() Is All You Need

Modern coding agents come with so much complexity. Editing files, reading, subagents, skills, and so on, and so on. But often enough I've seen a coding agent struggle to edit a file with Update(), as these tools require you to specify exactly the before and after and they'd struggle to keep the copies exact, and then just decided to run a Bash(cat >file.txt <<EOF ...)and brute force it themselves. So, going for something simpler to try something out, I thought, if I put together a relatively capable LLM and give it just Bash() as its only tool... can it code?

Turns out, you can absolutely make it work! Taking a GLM-4.7 ollama cloud model, with a Bash() tool call and the following prompt gave me an agent that could look at data and write Python scripts that do data analysis:

You are an expert programmer.

 Your role:
     - Implement what's asked for in code
     - You have ONE tool: bash
     - Use bash for EVERYTHING: read files (cat), write files (cat
       << 'EOF'), run code (python -c or scripts), list dirs (ls,
       fd, find), search (grep, rg)
     - Show your commands clearly
     - If something fails, try to fix it or report the error

 You have pandas, numpy, matplotlib, duckdb at your disposal.

 Keep responses SHORT. Do the work, show the result, done.
 Don't explain what you're going to do - just do it.

I find this extremely interesting because it shows how little is actually necessary to nudge them to code! There's literally just a small number of bits of information in here:

The "expert programmer" role
Nudging the LLM towards actions it can take: read files, write files, run code, list dirs, search
I didn't even tell it to use Python, but pandas, numpy, matplotlib are clear hints.

The rest of the prompt is more behavioral.

Given this, let's now explore different approaches to make the agent more scienc-y.

The Setup

So the idea is to give the agent a fundamentally ill-posed task. I generate some fake "sales data" that looks like this:

customer_id,date,amount,product
C001,2024-01-15,150.00,Widget A
C002,2024-01-15,75.50,Widget B
C001,2024-01-16,200.00,Widget Pro
C003,2024-01-16,50.00,Widget A
C002,2024-01-17,125.00,Widget Pro
C001,2024-01-18,300.00,Widget Pro
C004,2024-01-18,45.00,Widget B
C003,2024-01-19,180.00,Widget Pro
C001,2024-01-20,250.00,Widget Pro
C002,2024-01-20,90.00,Widget A
C005,2024-01-21,400.00,Widget Pro
C001,2024-01-22,175.00,Widget A
C003,2024-01-22,220.00,Widget Pro
C004,2024-01-23,85.00,Widget A
C005,2024-01-23,350.00,Widget Pro

The task was always the same: "Our investors asked for a projection for the year ahead. Can you have a look, there are some sales numbers in sales.csv?" I intentionally kept it high level and but wanted to add some pressure. The Investors want numbers!!

The Coder

Ok, let's first try the baseline coder persona. It happily produced some forecast and closed with this:

### Key Metrics:
- **Total Projected Sales (12 months):** $131,054.41
- **Historical Average Daily Sales:** $299.50
- **Growth Rate:** 48.07% (based on trend from first
  half to second half of data)

The projection shows strong upward momentum, with Widget
Pro being your top performer. The 48% growth trend
suggests continued expansion through the year

Wow, $131k total sales! Out of nine data points!

To be fair, on re-running it, it sometimes adds some version of the following warning:

**Note:** The projection assumes the current ~10% daily
growth rate continues with some seasonal variation. This
is an optimistic scenario given the limited historical
data (9 days). More historical data would improve accuracy.

At least that!

Exchanging The Persona

Next, let's simply replace "expert coder" with "expert data scientist" but leaving everything else the same. It is known that these "role persona" definitions set the overall stage for the agent, so I was curious to see what the effect would be.

Unfortunately, rerunning it gives the same obedient behavior, but now a warning like this occurs more often:

**Note:** This is a simple linear projection based on
limited data (9 days). Widget Pro is driving 75% of
revenue. Consider gathering more historical data for a
more robust forecast.

Making The Agent More Skeptic

OK, now keeping the data scientist persona, let's add the following to the prompt:

Important: you're dealing with data, so check your
results. For example:
 - just because there are no errors doesn't mean the results
   are ok. Remember: garbage in - garbage out.
 - is the sample size sufficient for a statistically significant
   interpretation
 - what are alternative explanations for the observed effect
 - can we exclude data leakage? Correlation misinterpreted as
   causation?

This leads to markedly different results. It still did the forecast, but then added the following warnings:

## Projection Options
### Option 1: Flat Projection (Conservative)
**$109,318** annual sales
- Assumes current daily average continues unchanged
- Most reasonable given limited data
### Option 2: Trend-Adjusted (Aggressive)
**$2,171,604** annual sales
- Based on observed upward trend (+$30.30/day)
- ⚠️ **Highly speculative** - extrapolating 9-day trend for
  365 days

---

## ⚠️ Critical Caveats for Investors
1. **Tiny sample**: Only 9 days, 15 transactions -
   insufficient for reliable forecasting
2. **No seasonality**: Can't account for monthly/
   quarterly patterns
3. **No historical baseline**: No prior year data for
   comparison
4. **Trend uncertainty**: The upward trend could be noise,
   not a real pattern
5. **External factors**: No data on marketing,
   promotions, or market conditions
**Recommendation**: Use the conservative projection (~$109K)
as a baseline, but emphasize that more data (3-6 months minimum)
is needed for a reliable forecast.

The agent now gives two different options (both of course very unrealistic given the lack of data). I told it to think about sample size, but there are other things I haven't mentioned like seasonality and lack of context like marketing. So this helped, but would this scale or would we need to keep adding things to look out for?

Making The Agent Own Up

Let's try something else. Going back to the basic expert data scientist prompt, let's add this:

- You OWN the analysis end-to-end — not just the output,
   but whether the question even makes sense
- Before diving in, critically evaluate the request: Is this
  answerable with the data we have? What assumptions are we
  making? What would we need to give a responsible answer?
- Push back on the user when a request is premature, under-
  specified, or would produce misleading results.
- Propose a better-scoped question instead.Explore available
  data files to understand what we're working with

We're trying to change the persona from "be helpful" to "actually, make sure it makes sense!"

This prompt leads to a completely different response. After looking at the data and doing some back of the envelope calculations, this is what the agent concludes with:

## ⚠️ Reality Check
**I cannot responsibly provide a meaningful annual
  projection.** Here's why:
1. **9 days of data is insufficient** for any reliable
   forecasting
2. The apparent growth trend (25% daily) is not sustainable
   — extrapolating it gives absurd numbers
3. No seasonal patterns, no historical context, no business
   context (new product? holiday spike? one-off deals?)

---

## What I Can Give You (with caveats)
**Conservative estimate (assuming current average continues):**
- **~$109,000/year** (~$9,100/month)
This assumes:
- No growth
- No seasonality
- No market changes
- Business continues exactly as these 9 days suggest

---

## What I'd Need for a Real Projection
- **More history:** At least 6-12 months of data
- **Context:** Is this a new product? Any promotions running?
- **Seasonality:** Are there predictable patterns?
- **Market factors:** Pipeline, marketing spend, competitor
  activity

---

**Bottom line:** I can give investors a rough ballpark of
~$100K/year based on current run rate, but I'd strongly
recommend framing it as "early-stage data" rather than a reliable
forecast. Would you like me to prepare a summary slide with these
caveats clearly stated?

Without us explicitly mentioning it, it pointed out important aspects: lack of data, extrapolating the growth trend would be "absurd", no seasonality. And then it asked for additional things that weren't in the prompt at all like what kind of product is it, what's the marketing, and so on. And it closes with a recommendation to frame whatever is passed to the investors as "early-stage data". This is actually good advice!

Summary

We got much much closer to what a "sensible" data scientist agent should be like. It turns out a key point was to move from "be a helpful agent" and to giving the agent some ownership over results and in invitation to push back (in the best interest of the user).

Alright, taking a step back, where to go from here? That's actually a good question. We've moved pretty far away from the coder agent persona. Would you want to have this level of skepticism on every request? Probably not! You probably wouldn't have that kind of discussion every time you ask the agent to change the color of a button ("hey are you sure, do you even have data to prove the customers would like that more?")

Also, we have started with a pretty basic coding agent. Can we still retain the healthy skepticism while adding all the fine tuned expertise back in to have a strong coding agent? To be honest, I'm not sure. Modern coding agents have a prompts that have been fine-tuned over many iterations. How to use tools, how to break down complex tasks, and then there are many more features, skill use, subagents, and so on. Baking in healthy skepticism does not seem straightforward.

Ideally, we'd want something more composable, being getting the skepticism and the push back when needed, writing code efficiently when it's time to implement. I have some ideas... more on that later!

Making a Coding Agent Say No - For Science!