How To Organize Data Science Teams
One question I often come across is how to best organize cross functional data science teams where engineers and data scientists are working closely together.
Typical questions and challenges are:
Do you do sprints for data science work? If you do, what is the deliverable at the end of the sprint?
Are sprints or Kanban better suited for data science work?
Data scientists say that research cannot really be estimated well, what do I do?
How to coordinate data science and engineering work?
I think data science projects are different from typical engineering projects because there is a higher level of uncertainty. Is the data available, does it have the information we need? Can we train a model that generalizes well? And so on. Therefore, the key to successful data science projects is focusing on reducing risk while keeping things timeboxed.
Dealing with uncertainty is not unknown to software engineering. Agile software development tries to uncover what the customer really wants. Releasing testable increments while keeping code quality high with constant refactoring are the key. These ideas also work for data science, with some adaptations.
Here is what I’ve seen to work well in practice:
-
Work in sprints together with engineering. Sprints provide natural timeboxing so you can review what you have done and steer the project. I know this works even for research because I have supervised several Ph.D. theses and the absolute minimum was a weekly session where we review work and decide what to do next.
-
For data science, the focus should be on insights to reduce uncertainty instead of new features. For example, a good sprint goal could be to run an experiment to compare a certain approach to a baseline. You want to make sure you gain new information that helps you to decide how to move on. This is similar to research spikes in engineering.
-
If you want to estimate, focus on steps that are clearly defined. For example, trying out a specific approach takes a certain amount of work that can often be estimated well. On the other hand, the goal of creating a method that works is relatively open ended because you don’t know what it will take to make it work. Even for more open ended goals, ask what are concrete steps that I can take next.
-
Your backlog is rather a list of uncertainties and questions you need to figure out instead of a number of features to deliver. Let’s say you realize the biggest uncertainties are data availability and whether existing models perform well on your data. Put those into your backlog. Review and adapt your plan as you uncover more information.
-
Your mileage may vary, but I think teams work best if you have data scientists and engineers who are interested in supporting the other side and also learning about the other side. Rather than handing over work going from exploration to production, make it a fluid transition where the work gradually shifts from data scientists to engineers. In some cases, it might make sense to have engineering involved right from the start (for example, when performance is critical).
-
I think it is crucial to strike the right balance so that data scientists can focus on the work only they can do (especially data analysis, modeling, training) and engineers providing a technically sound foundation. If your scale justifies it, there are even more specialized roles like data engineers or ML engineers to take care of providing the right data or deploying models to production robustly.
-
Sometimes data scientists struggle with timeboxing their work. It is in the nature of data science or research work in general that there is no clear end. You can always try something else, or try to improve the model. It is hard to know when there is really nothing else you could try. Have clear discussions about what level of quality is good enough and how much effort is worth putting into the project before you start. These questions often need to involve the business side or customer to be answered.
What are your experiences and what have you found to work best?
No data science team yet? Read about how to hire a data scientist.