Article
by Predrag Djojic, Head of Engineering
Data Engineering in the Age of AI
To assess the role of data engineering at present and its future, we first need to look at the undisputed force behind the latest tech wave, AI, and more specifically, generative AI.
Victor Hugo once said, “Nothing is more powerful than an idea whose time has come.” It was a long time ago, in a very different context. Still, anyone who’s closely watching the AI landscape and development around generative AI in the past few years will probably find some analogy. It took ChatGPT just two months after its public release date to reach 100M active users, making it one of the fastest-growing applications ever. Major cloud vendors and the largest tech companies, such as Amazon, Microsoft, Google, Meta, and Apple, focused their R&D efforts on AI and dedicated their biggest tech conferences almost entirely to new AI-infused services and products. New generative AI models are revealed daily, each claiming huge advancements in accuracy and operational performance.
On the other hand, the battle of arguments around ethics and legal frameworks for the technology is in full swing, including governments, experts, “experts,” and the media. As a first result, the EU agreed on the Artificial Intelligence Act, while the White House has announced the first government-wide policy instructing every federal agency to appoint a chief AI officer.
We have had several of these “Hype Cycles" so far, always followed by proportionally intense and prolonged disappointments and de-investing periods – AI winters.
Under an avalanche of available information, it’s difficult to understand how much of that is pure hype and how much is real technology breakthrough.
There are arguments on the other side too. This is not the first dawn of AI; generative AI is just the latest incarnation of it. Over the past decades, we have had several of these “Hype Cycles”, as famously named by Gartner, so far, always followed by proportionally intense and prolonged disappointments and de-investing periods – AI winters. Some believe this time is not going to be any different. According to the 2023 Gartner Hype Cycle™ for AI, we find ourselves at “the Peak of Inflated Expectations” regarding generative AI.
This is not totally without basis. According to the recent O’Reilly survey, regarding the use of generative AI in the Enterprise, the most common reason for not using (more of) it is the difficulty in finding appropriate business cases. From the R&D side, most of the new models, including the leading one from OpenAI, are still based on refinements of Transformer, 6+ years old technology relying on massive volumes of training data. The bad news is that we already “used” most of the available and meaningful training data on those models – not to mention the environmental impact of running, until now, unseen volumes of GPUs in data centers.
Capitalizing on GenAI
For organizations aiming to stay relevant in the coming years, this opens up some important questions and topics to (re)consider.
First, a fundamental one. Is all generative AI advancement that we are witnessing truly the next industrial revolution, maybe even the final one, or is it just another hype cycle reaching its peak, soon to be followed by another AI winter?
I believe that the answer is a bit of both. Undeniably, there is lots of hype, and there will be disappointments ahead, mostly among those expecting generative AI to reach the Artificial General Intelligence (AGI) level. Still, if we tone down our expectations, the technology is already solid, operational at-scale, quickly developing, and already providing significant benefits in many places (survey).
The second question is more practical. What should organizations and businesses do to stay relevant in the coming years, mindful of AI advancements but without the ambition and resources to compete with OpenAI or the hyperscalers?
To grab opportunities in a timely manner and identify threats before it’s too late, organizations need to start/continue experimenting with AI in their respective domains, building knowledge, and following developments. In parallel, to truly benefit from those newly acquired insights and knowledge, organizations need to focus on identifying, collecting, and processing their specific business data.
On a technical level, most organizations and businesses mentioned above will rely on foundation AI models provided by OpenAI, Anthropic, or some of the cloud hyperscalers – consumed as a service. It’s safe to assume that those already advanced models will improve over time in all critical areas: correctness, comprehension, performance, and cost. However, as those advancements will be available to virtually everyone with a credit card, those can’t be considered a competitive advantage on their own. Those AI services delivered over the SaaS model by large vendors are going to be “well-behaved” and “well-educated” on many topics (trained on a large generic set of data).
Still, to make it a real and sustainable competitive advantage, organizations using the services need to make those AI models experts in a specific domain and apply it in a specific operational context. Without relevant and specific data at volume, most of the experimentations with AI will underdeliver on promises or will be limited to relatively generic use cases.
The (data) pillars carrying AI
This leads us back to our main topic. The role of data and data processing in the age of AI.
Collecting and storing data at volume is often not simple or cheap. Even then, volume alone will not cut it, as data quality is equally important, further increasing the need for advancing data processing capabilities.
Next to volume and quality, data format and variety requirements are evolving too. The best generative AI models are multimodal, relying on multiple data types and sources to provide insights and generate content. For the first time, we can process video, sound, images, and text in combination with relative ease and on top of all “traditional” data formats. The size and usefulness of heterogenic data sets will grow in significance and require even more sophisticated data processing techniques with more extensive storage and computing capacities.
Finally, we have privacy and security requirements and some truly new challenges there.
As a recent research paper shows, it’s possible to extract training data from most of today’s available generative AI models at a rate 150× higher than expected (tested on Pythia, Llama, Falcon, ChatGPT, etc.).The extracts contained a broad set of personally identifiable information (PII) as well as financial data or even security-related information acquired from used source code repositories. On the other hand, the emergence of regulative frameworks such as the recent EU Artificial Intelligence Act increases the need to establish data lineage in all cases where data is used for training, finetuning, or querying on top of generative AI models.
As we will store and use more of it, and there is no easy way to remove data from an already trained model, we also need to care more about what data we store and how it is used.
Organizations and businesses aiming to stay relevant in the future need to master their respective data domains to capture specific operational, business, and domain knowledge through data.
The new generation of AI is arriving quickly and will undeniably have a significant and broad impact on the economy. Still, data-related questions – what, where, and how we use it – are still the center of “everything.”
From a practical side, organizations and businesses aiming to stay relevant in the future, in addition to AI, need to master their respective data domains to capture specific operational, business, and domain knowledge through data.
Once captured, data must be cleaned, deduplicated, potentially anonymized, enriched and augmented, classified, stored, and secured. Even if we safely assume that Data Engineering will significantly advance due to the infusion of AI in the data processing toolchain, there is still a lot to cover, and the effort needed should not be underestimated.
If the main goal is to be future-proof, then a large part of a good AI Strategy is an up-to-date Data Strategy, with modern Data Engineering as more than a necessary enabler.
Want to know more?
We’ve been working with data and AI for more than 20 years, offering our customers comprehensive, end-to-end data and analytics services. This service for a customer can cover the entire AI lifecycle, from data collection and preparation, model development and testing, deployment, and monitoring, all the way to maintenance and support.
Learn moreAre you familiar with Nortal Tark?
Before the wide public adoption of large language models, following the leading edge with ChatGPT as the frontrunner, we built our data and analysis solution, Tark. Tark has helped our customers analyze and visualize their structured and unstructured data and generate insights and recommendations for continuous improvement.
Read moreGet in touch
Let us offer you a new perspective.