Article January 28, 2025

Data lakes done right: challenges & mistakes to avoid in enterprise data management

We understand that building a data-driven enterprise is about more than just technology — it’s about assembling the right team to create a strong foundation for innovation, agility, and informed decision-making. With 25+ years of experience in forming high-performing tech teams, we partner with organizations to unlock the full potential of their data through expertly designed solutions.

IT Outsourcing Staff Augmentation Global Capability Centers

Our dedicated teams, made up of data engineers, DevOps specialists, and data scientists, bring expertise to every step of your data journey. From architecting streamlined data lake solutions to consolidating data into secure, unified repositories, and implementing governance frameworks for accuracy and compliance, our experts are equipped to handle the complexities of modern enterprise data needs. These scalable, resilient solutions adapt as your business grows, transforming your data infrastructure into a strategic advantage.

But expertise doesn’t stop with delivery — we’re passionate about sharing our knowledge to help businesses thrive in their data-driven transformations.

Data — the game changer

Data is a powerful enabler across industries. Hospitals use it to maximize resources and save lives, while financial organizations leverage it to prevent IT failures, ensuring smooth operations. For businesses, more data equals more opportunities to learn, save, and improve. However, as enterprises generate vast amounts of data — over 500 petabytes for some — the question arises: Where do they store it all?

An increasingly popular solution is a data lake — a centralized repository capable of storing raw, diverse data in various formats, with the flexibility needed for advanced analytics, machine learning, and other data-driven innovations. In this article, I’ll share our insights into this powerful approach to data management. Together, we’ll explore the key challenges in building and managing data lakes, uncover common mistakes that can derail projects, and highlight how expert teams can help you overcome these obstacles to achieve lasting success. Let’s dive in!

What is an enterprise data lake?

The potential for data lake innovation is endless, with the following examples of data types being stored and managed across industries:

01 Healthcare

02 Financial services

03 Travel

04 Transportation

05 Manufacturing

06 Retail and e-commerce

07 Software development

08 Government and public sector

Electronic health records, medical device logs, research data, audio and written doctor notes and interviews, DNA sequencing and diagnostic imaging data.

Stock prices, forex rates, economic indicators, transaction and ATM logs, bank and credit history, call centre recordings, reports, surveys, and forecasts.

Booking details, traveller profiles, demographics, hotel occupancy, flight schedules and delays, transportation logs, ticketing and check-in data, behavioural data, GPS data, beacons in airports and hotels, weather forecasts, etc.

GPS location, fuel consummation, engine performance, real-time traffic flow data, road conditions, IoT device data, delivery routes and shipment tracking logs.

Machine logs and sensor readings, assembly line throughput, quality control data, supplier performance metrics, shipment tracking, route optimisation, energy consumption patterns, CAD files and blueprints, market demand forecasts, regulatory compliance documents, order history and trends.

Invoices and receipts, product images, marketing collateral, IoT in-store sensor data, inventory logs from warehouse management systems, user browsing history and product views, etc.

Event logs, error reports, crash analytics, user interaction data, API logs, firewall, router and network traffic monitoring logs, and machine learning data sets.

Population demographics, text and audio from public hearings and community meetings, imagery from mapping and urban planning, scanned historical documents and multimedia content.

The role of data lakes in enterprise data management

What do these have in common: an insurer adjusting premiums for weather risks, a traveler’s app rebooking a delayed flight, and a marketer fine-tuning local campaigns and discounts for e-commerce? They all rely on data lakes to power these capabilities.

Unlike traditional data warehouses that only work with structured, organized data, data lakes can store raw, unprocessed data in its original format. This flexibility allows companies to save time and reduce costs by skipping the need for extensive data extraction, transformation, and loading (ETL). Data lakes also support real-time use of structured and semi-structured data, making them a strong foundation for data analytics and AI-driven services.

Moreover, data lakes make data accessible to non-technical users. Without requiring SQL knowledge or constant reliance on IT teams, they enable business users to explore and leverage data through self-service dashboards, query tools, or visualization applications. Features like data catalogs and glossaries further simplify the process, making it easy to browse, search, and analyze data sets using indexes and keywords.

Why centralized data lake storage is so important

Data lake architecture is essential for modern enterprises aiming to make decisions based on data, anticipate trends and user behavior, respond dynamically to changes and events, and handle huge volumes of streaming data at scale. Without it, enterprise data use is restricted to structured information, which massively hinders innovation and progress.

In healthcare, for example, only 20% of electronic health records are structured, making the remaining 80% of insights completely useless if no solution is available to pull, store, and analyze them. With data lakes that combine diverse data types, healthcare organizations unlock invaluable insights from data query and management platforms, social media, IoT and connected devices, and other systems spanning patients, medical providers, hospital operators, clinical researchers, and healthcare insurers.

This capability is particularly impactful in pharmacogenomics, for example, where vast genomic data and clinical trials inform personalized medicine, or in medical imaging analytics, working with large, unstructured image files from MRI, CT, and X-ray scans.

In retail, data lake architectures support omnichannel experiences by creating comprehensive buyer profiles favored by 7 out of 10 shoppers. They also collect and collate data from point-of-sales systems, IoT sensors, and transportation platforms for real-time inventory management and tracking, as well as power dynamic pricing and promotions.

Data lakes also enhance real-time traffic management by integrating live feeds from cameras and sensors, improving traffic flow, safety, and public transport reliability, and reducing congestion and pollution. Another interesting use of unstructured data unlocked by enterprise data lakes is precision farming, integrating satellite imagery, IoT sensors, and market data to optimize crops and reduce waste.

codioful-formerly-gradienta-UF_wwDxI6uk-unsplash

Key capabilities enabled by enterprise data lakes:

Anticipating trends and behaviors
Dynamically responding to changes
Building insights from complex, real-time information
Scalable, high-volume data streaming
Breaking down information siloses
Integrating with various apps and systems
Wider access to data for non-tech users

Five challenges in building & managing data lakes

Data lakes enable faster decision-making, providing near/real-time access to information, empowering teams, and offering flexible growth. They can be more cost-effective than traditional data warehouses, allowing for high-volume data storage and raw data ingestion.

However, implementing a data lake solution in an enterprise isn’t free from challenges. Like all innovations, it requires a multidisciplinary team of data engineers, architects, governance experts, security professionals, cloud architects, and data scientists for effective setup and implementation. Strong project leadership and change management are essential to ensure alignment and progress. I’ll get to how to find and build these capabilities later, but let’s first look at the challenges this team will face.

Data lake challenge 1: Data integration

A seamless flow of real-time, complex data is the X factor in the race for efficiency and data-driven decisions. Yet, outdated solutions are often an obstacle, incapable of handling all the complexity a modern enterprise deals with daily.

One of our clients, a global mining leader, faced this frequent scenario before adopting a data lake solution. They were using an old Azure Data Factory set up to collect data for efficiency and sustainability decisions.

To unify data from globally distributed departments and embed scalability and flexibility into the client’s enterprise data management infrastructure, Nortal assembled a multifunctional, scalable expert team. The team implemented a modern enterprise data lake, a Databricks solution running in Microsoft Azure cloud. Mixing different data types, like JSON logs from web servers with SQL tables from an ERP system, required complex transformations, powerful ETL (Extract, Transform, Load) tools, and a reliable framework for security and governance to avoid the risk of unauthorized access and misuse.

Approaching the integration strategically and knowledgeably ensured data quality and consistency, which are essential for reliable and timely data analyses. Centralizing the previously distributed and siloed data into a single lake with Unity Catalog improved control over access, metadata, and governance.

Workflow automation, impossible with the previous setup, reduced operational costs by 50% while improving data ingestion and distribution across operational systems, business intelligence, and ETL tools, social media platforms, ERP, CRM, and other platforms in near/real-time.

Data lake challenge 2: Data governance and compliance

Without proper governance, a centralized data lake can become a significant liability instead of an asset. Consider this scenario:

A large retail chain collects vast amounts of customer data — purchase histories, loyalty program data, heatmaps, clickstreams, and inventory records — to enhance customer experiences. While this approach sounds promising, it introduces significant risks without governance measures like access controls, data quality standards, and metadata management. For example, customers might receive duplicate promotional emails, irrelevant offers, or emails addressed to them incorrectly. These minor issues are just the tip of the iceberg.

The more severe consequences include unauthorized access to sensitive customer information, exposure of purchase and payment histories, and costly regulatory fines for non-compliance with standards like HIPAA, GDPR, or PCI-DSS.

Data governance experts must prioritize transparency and traceability across all data types and formats to mitigate these risks. A robust governance framework can address legal and financial risks while ensuring data is both secure and usable. Here's how we helped a mining industry client strengthen their data governance during a major infrastructure upgrade:

We established clear governance goals, assigning responsibility for each dataset. This ensures compliance and transparency across all levels.
To enhance security, we introduced role-based permissions. This controls who can view, retrieve, or manage specific data at any given time.
We created a user-friendly metadata system that allows users to easily search and understand available data. Data classification policies were added to protect sensitive information.
To safeguard data, we applied encryption for both data at rest and in transit, regularly updating protocols to counter emerging threats.
We provided comprehensive training on governance policies and best practices, empowering all users to handle data responsibly.

With these measures in place, the client transformed their data lake into a secure, transparent, and compliant resource that supports both operational and strategic goals.

Form data mess to data lakes

Global mining giant went from data chaos to a sleek, scalable solution that slashed costs by 50%. With nearshore Data Engineers streamlining pipelines, rethinking Data Governance, and automating infrastructure.

Read this story

Data lake challenge 3: Security risks

As single points of failure, data lakes are a major target for cyber threats and accidental breaches. They are meant to hold vast volumes of information and scale infinitely, making it impossible to apply one-size-fits-all security measures. Moreover, different data types and formats also require different protections, which further adds to the complexity of securing a lake.

Many things could go wrong when working with data stored in a lake. For example, a financial analyst who pulls up a sales report for an external vendor can accidentally share a dataset containing sensitive customer information. Similarly, a marketing manager in a hospital might view confidential patient health records due to misconfigured access settings. Additionally, ransomware can target data lakes, threatening all company operations by locking out access to all data.

Securing a data lake against these scenarios starts with thoughtful design. Services like Active Directory, Okta, and AWS IAM help manage user permissions, while encryption tools like Azure Key Vault and AWS Key Management Service help keep your information safe. Tokenization replaces sensitive data with tokens for added security. But the list of safeguards for enterprise data lakes doesn’t stop there.

You’ve also got data loss prevention systems, classification tools, monitoring and auditing platforms, backup and disaster recovery apps, and network security tools, all working together to ensure secure data handling.

Managing these security elements can be complex, and human factors remain a common cause of breaches, with three in four events caused by human error. To make matters worse, 21% of enterprise cyberattacks target remote employees’ home environments, making it essential to embed relevant protections in your data lake setup to protect online teams, too.

Ultimately, protecting your data lake isn’t just about embedding the right tools; it also requires regular training, raising awareness, and constant communication. Your team of data lake experts should lead the way, ensuring everyone understands the risks and how to minimize them.

Data lake challenge 4: Scalability & performance

Scalability is one of the promises of data lakes, but your infrastructure needs to be properly configured and maintained to realize it. Otherwise, you may encounter problems that can slow down your data work and affect reliability. This can lead to flawed business decisions, missed opportunities, and slower innovation, affecting your market share and profits:

Storage bottlenecks: Not allocating enough storage from the outset can lead to problems down the line, especially as your data grows. This can slow down access times and disrupt business operations and decision-making. Plus, you may end up having to put in extra effort and money to add more storage unexpectedly.
Compute bottlenecks: Sometimes, the computing power of a data lake can be an issue (affecting the processing and analysis of data stored within it). If you don’t plan your computing capacity well, you might encounter problems when handling complex queries or large datasets. This can make data processing slow and frustrating over time.
Slow data ingestion: If your data lake solution doesn’t scale well, it can also slow down the collection and bringing of data from different sources. Inefficient data transformations and loading processes can hinder how quickly you can use the data, and without quality checks in place, inconsistencies may arise.
Inefficient queries: As time goes on and the volume and complexity of data grow, the speed and efficiency of your queries can decrease if you don’t optimize them.

To keep your data lake running smoothly, no matter how much data you have, it’s important to use strategies like indexing, dividing your data into manageable parts, and optimizing your queries. This requires a mix of tools – from cloud storage solutions and large-scale data processing frameworks to tools for data ingestion and workflow management. Pulling everything together effectively requiresa combination of expert skills, including data engineering, cloud computing, data science, analytics, and big data management.

Data lake challenge 5: Cross-functional collaboration

Building an enterprise data lake is complex. It’s not a one-person job. You need a diverse team of skilled professionals to succeed:

On the technical side, experienced experts in data pipelines, big data platforms, and cloud computing are essential.
Data scientists play a crucial role, too. They handle data exploration and modelling to ensure efficient storage and querying. They also apply machine learning to automate processing, improve governance, and enhance insights.
To ensure the optimal performance of data pipelines, your team must include DevOps engineers.
As with most enterprise software undertakings, data lake projects involve business analysts who translate business requirements into data and technology terms.
To guarantee compliance and data security, you’ll also need data governance and security specialists.
On top of that, engaging business stakeholders is important. They provide insights into use cases and promote data lake solutions across their teams.
IT managers oversee the implementation, ensuring it integrates with existing systems while managing the budget and resources.

Given the scarcity of talent and the complexities of modern tech hiring, assembling a cohesive team of experts with all the necessary roles and skills can be a significant challenge. This is particularly true for positions like Data Scientists, Big Data Engineers, and Data Security Analysts, who are among the most in-demand roles in IT and tech. Meanwhile, the European tech talent shortage persists.

In these circumstances, partnering with a trusted, dedicated team provider can be a smart move. It allows you to quickly fill roles with experienced, vetted tech teams (within 4-8 weeks, on average) committed to your data lake project and business goals.

Data engineering in action

To find out what it’s really like on the frontlines of data engineering, we sat down with one of Nortal's seasoned experts. Spoiler: it's a world of tangled codebases, scalable solutions, and the occasional \"anti-pattern\" story.

Read the interview

Common mistakes to avoid when implementing data lakes

Challenges abound in enterprise data lake architectures but so do the advantages. To unlock the latter, you need the skills and talent to tackle the former.

Based on my experience with clients, I’ve noticed several common mistakes in data lake implementations that often end in failed projects or poor performance and high costs. The good news is that these issues can be avoided by getting expert guidance before starting your data lake project.

Mistake 1: Underestimating team expertise

Data lakes offer a simpler alternative to traditional enterprise data management systems, but deploying them isn’t as easy as it seems. While they can be set up faster than other solutions, the implementation still requires careful planning and an experienced team.

A common pitfall for companies is failing to involve DevOps experts and to hire the right data engineers. This can lead to slow development and errors, resulting in issues with data quality and performance. A lack of automation increases the risk of mistakes and potential data loss.

Moreover, your company may struggle with compliance and governance without automated monitoring. Relying on manual processes can raise project costs, often exceeding the cost of hiring skilled DevOps professionals. Therefore, it’s important to invest in the right expertise for a successful data lake implementation

Mistake 2: Poor data governance planning

A big mistake in setting up data lakes is not having a governance framework. Without proper oversight, data issues like inconsistencies, missing values, and duplicates can arise, creating 'data swamps' – chaotic repositories that make data access difficult instead of easy.

Poor governance can also weaken security. If access controls and procedures are unclear, it can lead to unauthorized access and data loss, resulting in expensive compliance fines.

These governance failures lead to flawed decision-making and delays. Employees waste time retrieving, validating, and analyzing data, which reduces their productivity. As a result, organizations face operational inefficiencies, resulting in slower, less reliable products and services. This erosion of customer trust ultimately undermines stakeholder and investor confidence, seriously threatening your project and teams.

Mistake 3: Overcomplicating the architecture

Many companies make their data lake systems too complex, often due to a lack of expertise. This complexity can lead to slower data processing and more errors.

For example, having too many data layers slows data processing and increases the likelihood of errors, while developing complex data ingestion pipelines requires more time and effort to build and maintain. Additionally, overly strict security measures can limit access to essential data for users who need it.

In the end, these issues can make it harder to access important insights, slow down operations, and increase costs. By working with skilled data engineers and DevOps teams, companies can simplify their data pipelines, create a more manageable security model, and design a clearer data structure. This makes systems more efficient, increases flexibility, and lowers costs.

Building a data-driven enterprise with Nortal

Data lakes offer immense potential, but navigating their complexity can be daunting. To unlock their value, you need a strong team and a well-defined plan. At Nortal, we've spent 25+ years assembling high-performing tech teams for global companies across industries. With expertise gained from over 300 team partnerships, we understand the integration, governance, and security challenges of building and managing data lake architectures.

By partnering with us, you can access expert data engineers, DevOps specialists, and data scientists who will integrate seamlessly with your organization to help you:

Design and implement optimized data lake architectures tailored to your needs, without unnecessary complexity.
Consolidate data from multiple sources into a unified repository for fast and secure access.
Establish robust data quality and security frameworks to ensure accuracy, compliance, and reliability.
Build scalable, resilient solutions that grow alongside your business.

Transform your data landscape with the right team

With no upfront costs and a refined hiring process, we connect you with the top-tier talent you need to transform your data landscape.

Ready to build a team that delivers real value? Contact us today.

Let’s chat