Coding Agents doing Science... not well
Recently I tried using a coding agent for doing data science work, and I was surprised how much it struggled. It didn't struggle with the actual coding work. In fact, it was really good at it, and the interactive collaboration mode actually worked really well. It's really more like pair programming, with the agent being on the keyboard and me driving. "Can you show me the errors by class?" - "ah interesting, can you do that as a stacked bar plot" and so on. And it's a master at ad hoc scripts ("python -c"). It's really like a jupyter notebook on steroids - if it works!
Where it really struggled with, however, was NOT getting excited as soon as we got any kind of insight.
"Wow, look at this, 5% errors, amazing, shall we commit this?"
Now, maybe it's me having been in this position way too often. Everything looks great. You're already working on the paper abstract. It's two days till the deadline and you tell yourself you won't be working till midnight THIS TIME to make last fixes because you started early enough! (OK, maybe the "started early enough" part is just me...). And then, 12h to go, you realize there was a subtle bug in your preprocessing code, and you're spilling training data into the validation data. Or maybe one feature was NaN 20% of the time and the code just ignored those data points, artificially improving the accuracy. (I blogged about this years ago and suggested a software development practice to address some of the issues, ironically).
So everybody who did any amount of data work knows to double check results, over and over again, because it's so easy to get something wrong. There are things that will give you an error message like mixing up matrix dimension, but many of the mistakes you can do are silent, unless you look for them. Then there is other stuff that can go wrong. I already mention leaking training data into the test data. Or having confounders that are the real reason for the effects you are seeing. Or having seasonality in your data you didn't account for, and so on.
And I don't believe that LLMs are not aware of this, but I think the fundamental problem is that they have been prompted and trained to have a software engineering mindset. They are built to code, and for code, having no error messages means you were successful.
Here are a couple of other issues that I've experienced:
-
Saying "that's a data issue". "Data issue" is really the new "race condition." It's an easy cop out for the LLM to not dig deeper. I told the agent again and again that the data set in question is one of the most high quality curated data source that's used in dozens of use cases across the company, so if there's an issue (like "missing data"), it's probably because we haven't understood the data well enough. And still. "5% of our rows are null - probably a data issue".
-
Not considering sample size at all. I've had it confidently getting excited about 1% error - and then it turned out the class in question had only 5 data points. By itself it doesn't consider statistical relevance, it seems to me it's really not thinking "probabilistically" at all.
-
Adding timeouts to inference or training code, and ignoring the timeouts. I've had the agent add a 30s timeout without me asking for it on some simulation that then timed out 9/20 cases - and that fact was completely ignored in the summary. When I asked it why it did this, it really didn't know. Somewhere in the context there was a warning from me not to run SELECT COUNT(*) on a table because these are huge, maybe the "make sure you don't accidentally run long computations" warning was still lingering.
-
No good defaults for organizing science code. It mostly focussed on writing scripts that did exactly what they needed to do. I had to guide it towards using any kind of pattern like splitting out reusable code into modules, splitting inference scripts from evaluation scripts etc. Maybe it wasn't in the training data, I'm aware that there's no universally accepted way to do this.
Trying to make agents more science-y
Here are a couple of things I tried to make the coding agent more science-y:
-
Adding a data validation agent. I prompt the subagent to challenge the results, with a couple of typical pitfalls added, most importantly "is there any another potential explanation to the effects we have seen?" It produces good results, but it's not in the driver's seat, so you still need to keep an eye on what the coding agent is doing or manually trigger the subagent.
-
Adding lots of instructions to AGENTS.md. This helps somewhat but the original training seems to be just too strong. For example, as lot of the analysis is done in adhoc Python scripts, I often cannot see them, because they're truncated. I have instructions tagged IMPORTANT and CRITICAL that tell the agent that I cannot see this, so it needs to tell me what it's doing. It still forgets to do it 80% of the time.
-
Replacing the system prompt. I just started exploring this, but the idea is to completely exchange the personality of the agent. It's not a helpful coding agent anymore, it's a skeptical data scientist who is responsible for the analysis result, and it needs to protect the user from making "stupid mistakes". One problem could be that the sycophancy is so deeply ingrained in the training data that the agent won't ever push back. But let's see.
Why this is relevant, beyond coding agents
I think these observations are relevant beyond coding agents for the way how LLMs and "AI" has reshaped data science work. LLMs have given us technology that make it much easier to build systems that so far required deep ML knowledge on how to train models, preprocess data, build features that capture the information in the data in a way that the learning algorithm can use it. Now all it takes is "You're a senior text analyst. Your task is to measure whether the following text expresses positive or negative sentiment. Output a score from -5 to 5. Only return the score" and you've "built" a sentiment detection system. But everything we've learned about working with data and uncertainty is still true. If you want to make sure this system still works well, you still need to think about how to validate it, think about data drift, post deployment monitoring, etc.
Increasingly, we have people who haven't been trained in ML building such systems and I'm not saying they're like a coding agent, but I think there's an observation here that with a pure software engineering mindset, you won't make good decisions about building ML models.
But... if we can get coding agents to be better at doing science, we could support non-scientist to use this new technology well, maybe :)