Enormous volumes of data are being gathered and analyzed at the enterprise level, and of course organizations need a place to store that information. Increasingly, these ever-growing data repositories are referred to as ‘data lakes’. But what data sources do pharma companies keep in these data lakes, what are they missing, and how can they convert all that valuable information into insights capable of informing strategy?
We caught up with Jason Smith, chief technology officer, AI and Analytics at Within3, to get to the bottom of the data lake.
Q: Let’s start off with a nice softball question: What do we mean by the term ‘data lake’?
“When we talk about data lakes, we mean an aggregated collection of one or more data sources, kept somewhere within the enterprise. So when we discuss enterprise data repositories – where companies are taking real-world data, trial data, documents, and collaborating to put all that together – that’s what’s meant by the term data lake.”
Q: And what information do pharma companies generally want to keep in their data lakes?
“It depends on the company and what phase they’re at. If you’re a large pharma company, you’re probably putting in all the data you buy from third-party sources – that’s a lot of your real-world data. There’s the script data that shows how you’re doing in sales, and you’re probably including your speaker program data, presentations, content – everything that’s approved from marketing. You might have clinical trial data in there. You probably have strategic presentations. If you’re a larger company, you may have exports of your CRM notes and notes from third-party apps detailing interactions between individuals from your organization and healthcare providers, researchers, or key opinion leaders out in the industry.”
“It really can be anything or everything.”
– Jason Smith, Chief Technology Officer, AI and Analytics, Within3
“But I always like to look at data lakes as the source. It’s then about where and how you break that lake down, and move it to a smaller body that can be parsed and leveraged for action. That might be a two-gallon tank, a five-gallon bucket, or a two-ounce bottle – depending on what you want to do with that data.”
Q: So if these organizations have access to publicly-available data, what’s missing from their data lakes?
“What you’re interrogating – what you need to get out of the data – that’s number one. Number two is additional data for third parties.”
“Generally, these data lakes are populated with a lot of internally-generated data, as well as any third-party data sets they may have acquired. But they aren’t often buying and storing social media data, and not all companies have the capacity to go out to PubMed, and to clinical trials, and to these other third-party data repositories, and pull that in too. So you end up with this disconnect between what’s happening within the four walls of the organization, and what’s happening outside those four walls.”
“That’s where companies like Within3 ultimately add value. We have a data lake composed of ‘outside the four walls’ conversations, and you have a data lake of internal data. We can connect those two lakes with our technology, and from there, extract into smaller ponds the data you need to analyze to answer your business questions.”
– Jason Smith, Chief Technology Officer, AI and Analytics, Within3
Q: It sounds like some organizations fall down by failing to gather data with sufficient intent.
“There’s sometimes this idea that gathering data will drive automatic insights. That we’ll get all the data together, we’ll analyze it, and we’ll get some magic out of that. It’s not always done in a pointed way.”
“Different business units need views into that data, too, and they have different problems to solve and different outcomes they need. Having a comprehensive understanding of all the business cases helps you inform the type of data hierarchy you need, the type of data lake you employ – as well as the types of data ponds, data rivers, and water bottles you’re going to put it in downstream.”
Q: What about this idea of pollution? How can teams prevent their data lakes from becoming ‘polluted’?
“Polluted data is data that’s added without understanding the outcome or relevance to the problem. Let’s say you have an additional claims data set that differs from your existing one. Now you have a source of truth problem, and you can’t be sure which one is at fault.”
“Because we have this need to collect as much data as possible, we actually end up polluting the data lakes – and so the analytics we’re running on top to get those answers and drive those outcomes has now become a problem, because we don’t have clean data. The other aspect is to do with data storage. A lot of pollution happens with poor data governance: how you actually store the data; the formatting of the data; the fundamentals of data collection, cleaning, and storage.”
“Finally, the third part is to do with data lineage. Where does the data come from, and how often does it refresh? Does it refresh in the same format? Are the analytics downstream able to continuously access that data? And, if that data changes the source, do we have a lineage and governance process so it doesn’t accidentally become dirty?”
“Pollution can come from any one of those issues, but generally I’d say it’s derived from poor data governance or just a fundamental lack of governance.
Q: So what are the pros and cons of building your own data set, versus supplementing it with third-party data?
“Again, we’re talking about inside the four walls versus outside the four walls here. You really do need to acquire data for your data lake to be successful. I don’t know of any company that produces enough of that internal data without the lens of a third party to add context to it all.”
“With your internal data, your ‘inside the four walls’ information, the pro is that you know exactly what you have. You know your sales figures, for example, and can pull those directly into your data lake. The con is that you don’t get to see the broader data sets that speak to your competitors, and the market more broadly. So the value in buying data is to add breadth and depth.”
“When you’re looking at making strategic decisions, you’re going to need social listening. You need data from patient advocacy groups. You need claims data. You need to see competitors’ data quarter-over-quarter. You need to understand that because that helps you paint the entire landscape.”
“From a build versus buy perspective, I don’t know if there are really pros and cons. I think it’s just a necessity. It doesn’t make sense for companies to try to build their own technology when there’s so much available off-the-shelf that can get them there more quickly.”
– Jason Smith, Chief Technology Officer, AI and Analytics, Within3
“Very few pharma companies have the understanding and discipline of data governance, data analytics, machine learning – all the processes that go in to answer those strategic questions on the data lake – to build an insights solution from scratch.”
Q: It feels like those viral videos where people try to make their own Big Macs from scratch: milling the grain for the bread, growing the cattle for the beef, etc…
“That’s a great way to put it – I might use that for my next Reuters talk! You can try to make your own Big Mac, or it takes you five minutes to go to McDonald’s…”
Q: So finally, where does Within3 fit into this picture?
“We don’t focus on the technology solution of data lakes. We focus on the outcomes that we can extract from our own large data lake of publicly-amalgamated and available third-party data in conjunction with your enterprise data lake.”
“From a build versus buy perspective, I don’t know if there are really pros and cons. I think it’s just a necessity. It doesn’t make sense for companies to try to build their own technology when there’s so much available off-the-shelf that can get them there more quickly.”
– Jason Smith, Chief Technology Officer, AI and Analytics, Within3
“Our models, our technology – and more importantly – our solution, drives that outcome. We’re not providing you with reported analytics. We offer a higher order of reporting, machine learning, and AI that no-one else in the industry has.”
Ultimately, the value of an enterprise data lake isn’t exclusively determined by what you put into it. It’s dependent on the outcomes you have in mind when you’re gathering that data, that strategic intent with which you collect it, and how you analyze and report on the data you’ve collected to generate valuable, actionable insights.
Within3 is the industry leader in AI-powered insights reporting for life science companies. We’re uniquely positioned to supplement and augment your internal data lake with our own third-party data, and apply AI insights reporting to extract powerful insights capable of informing your medical and commercial strategies. To find out more about Within3’s AI, read our blog post on how we use artificial intelligence to support insights reporting, or book a demo today.