Software and software engineering are relatively well-understood fields. Although technologies change, the general frameworks and ways of working have remained pretty consistent for decades.
On the surface, a data team looks like a software engineering team so you’d expect things to translate. However, this isn’t currently the case.
Because of the speed at which things are changing in data and machine learning, there aren’t many well-established best practices for data teams - and the few there are mainly only apply to larger companies rather than startups. This can make it a minefield to navigate when building out a small team.
In this post, I wanted to share some lessons that we’ve learnt at Ophelos whilst building out our data function from scratch.
For a more general insight into the challenges that data teams face globally, this blog post is great:
In a data team at large organisations you’d typically see a combination of:
That’s 7 different job roles already. And I could go on. However, if your data team is as small as 2, 5 or even 10 people, you don’t have the luxury of covering everything with separate roles. One option is to select a few of the most pertinent roles for your objective. For example, in a team of 3, you might have 1 data engineer, 1 data scientist and 1 data analyst.
What we’ve chosen to do instead is build a team of generalists where each person can work across the whole stack and cover multiple roles. So for a team of 3, you would have 3 people capable of building data pipelines, machine learning models and analysing data.
We’ve seen multiple advantages of this team structure:
The paradigm shift of moving from on-premise data warehouses to a cloud-based architecture has brought with it a huge number of benefits around scalability. However, the modern data stack has continued evolving and is becoming more and more decentralised, to the point where it can be difficult for small teams to work with.
The problem isn’t that there are a lot of players in the market and choices to choose from, but that each tool only handles a small part of the overall data stack and the expectation is that you should piece together your architecture from many different providers.
For example, you would have a different cloud-based provider for:
To visualise just how much the data tooling industry has bloated recently, check out Matt Turck’s 2023 Data landscape.
With each new provider you add to your stack you add in another layer of complexity and more $$$. Of course, you can always build parts yourself but this takes time away from the true function of your data team. Our data stack is predominantly built on Databricks which is able to handle a large chunk of the overall stack.
When defining your architecture we believe there are three key principles to stick to:
This statement is important for anyone who works at a startup, but we’ve found that constantly reminding ourselves of this has helped our team to prioritise work. Everyone who works with software wants to write perfect code and build perfect products. However, it’s important to remember the function of your team and how you actually provide value to the organisation.
There is always some way you can optimise your data pipeline or refactor code to be cleaner. But it’s all about balancing perfectionism with progression, and sometimes things don’t need to be absolutely perfect in order to drive your startup forwards.
Ophelos is an applied technology company, not just a technology company. So the primary purpose of our data team is to solve business problems. We do this through running experiments, analysing data and building machine learning models.
As I mentioned earlier, the modern data stack consists of piecing together components from multiple 3rd party providers. More providers mean more integrations and more integrations mean more integration testing. Furthermore, a lot of these tools have a strong focus on usability and have disregarded how to actually perform unit or integration tests. If you’ve ever tried to unit test a BI platform then you’ll understand the pain.
We have found success in focusing on observability over test coverage. By setting expectations of what data should look like and having alerts for job failures we’re able to quickly find out when issues have occurred and fix them quickly.
Also, to build on Lesson 3: test as much as you can, especially mission-critical components, but don't go over the top. It’s important to remember what the true purpose of your data team is. If it takes you days or weeks to write comprehensive tests for an experimental feature which might be redundant in 3 months then your time would have been better spent elsewhere.
It’s hard to tell what the future of data and AI holds. It’s also difficult to predict if the ways of working for data teams will slowly adopt the principles of software engineering teams. In any case, I hope the lessons we’ve learnt navigating this landscape provide value to other new teams out there!