How do you manage data ownership on a big data platform?
In this blog we’re going to be looking at how to manage data ownership on a big data platform. Now, this question was asked to me on LinkedIn a while back, and it was a really good question in response to a debate I’d been participating in about my very strong belief that data should only have one data owner.
I feel quite categorically from my many years of experience that you really cannot have more than one Data Owner per data set. It really doesn’t work, and I don’t recommend you try it. There’s no exception this rule (believe me, I’ve been there, done it, still have the scars). Instead, I believe that what you need to do is find one senior person within your organisation who is going to take overall accountability for that particular data, wherever it is within your organisation.
This prompted the person to get in contact and ask: ‘How do you manage this on a big data platform when you’re bringing in data from source systems which clearly have a single data owner, but are combining it with data that’s owned by somebody else on a big data platform and applying some kind of algorithm to it to create some new data?’
Now that’s a big question. And I can understand if you’re setting up a big data platform in your organisation for the first time this can seem fairly daunting from a data ownership point of view. But actually, if you stick to my simplified approach of one data owner for data, it’s quite easy to follow forward.
So, let’s break it down. In this scenario it’s quite clear that this second set of data isn’t the same data, it’s new data. If we took some data perhaps owned by sales and we combined it with some data owned by finance, we can apply some logic or an algorithm to it to create some new data.
Now, that new data didn’t previously exist so that new data can have its own data owner, and the data owner is the person who asked for that data set to be created, because they are the only person who can give you the requirements for how to create that data.
They’re also the only person who really knows what that data is going to be used for, so they’re the only ones in a position to be able to tell you what it means, what it should be used for, and if necessary, what its data quality rules should be.
That’s why I think as a general rule – it doesn’t matter whether it’s on a big data platform or any other of your source systems – always consider whether or not the data has changed. If that data is combined with another set of data or more than one set of data to create a new data set, it can have a new data owner.
Now, you might be wondering ‘what about the source data owners?’ but in my opinion, for simplicity, you have to think of this new data set is exactly that – it’s new – and you need to find a new data owner or agree a data owner based on who asked for the data to be created, and who’s going to be using it.
Now, if you have maybe two or more interested stakeholders interested in the same data set, you have two options: firstly you can get them together and facilitate a discussion to come up to a conclusion as to which is the most appropriate person to own it and the other will be a key stakeholder.
Another option is to consider splitting that dataset into subsets until you find a way of splitting it so that everybody’s happy that they are owning and responsible for the data that they really should be. Doing it any other way, I can guarantee you, is not going to work. It’s going to cause you loads of pain and is going to result in people telling you that Data Governance doesn’t work or doesn’t help them.
So, I really cannot impress upon you enough… Only one data owner per data set and it’s often better to break your data sets down smaller if necessary, so you can achieve that.