Organizations face significant challenges when moving operational data into data warehouses and data lakes

  • Top three challenges identified in new study include having infrastructure that performs reliably, protecting sensitive data, and synchronizing multiple sources of data into lakes and warehouses.
  • Organizations use a wide range of data lakes, notably Amazon S3 and/or Lake Formation, Databricks Delta Lake and Google Cloud Platform.
  • The three biggest pain points include time efficiency, schema changes and data complexity, and parallel architectures.

More organizations now rely on data streaming to support both operational and artificial intelligence (AI) systems, yet the wide range of tools that they use to ingest data into data lakes and warehouses causes significant challenges. This is one finding from a new research report by Conduktor, the intelligent data hub for streaming data and AI.

This survey interviewed 200 senior IT and data executives at large companies with an annual revenue of $50 million or more; when questioned about moving operational data into warehouses and data lakes, these respondents cited the following challenges:

  • Infrastructure: Scaling and managing pipelines that deliver data reliably
  • Security: Protecting sensitive data as it flows into lakes and warehouses
  • Integration: Connecting and synchronizing multiple sources into lakes and warehouses
  • Governance: Controlling, validating, and tracking data as it enters storage
  • An internal skills gap: Building and maintaining ingestion pipelines without in-house expertise

A wide variety of data lakes, warehouses and ingestion tools

Respondents reported using a wide range of data lakes, notably Amazon S3 and/or Lake Formation, Databricks Delta Lake, and Google Cloud Platform. They also use an assortment of data warehouses, including Google BigQuery, Amazon Redshift, Azure Synapse Analytics, and IBM Db2 Warehouse.

When it comes to moving data from streaming systems into lakes or warehouses:

  •  73% of respondents said their organizations build custom pipelines with Spark or Flink for streaming ingestion
  • 69% use Kafka Connect or similar tools
  • 50% use a fully managed service such as Firehouse or Snowpipe
  • 49% use micro-batching streaming data before loading
  • 28% use Extract, Load, Transform (ELT) or Extract, Transform, Load (ETL) tools such as Fivetran or Airbyte

The variety of tools and data lakes and warehouses creates severe pain points that slow down the ability of teams to deliver data to the people who need it, found the research. The top three pain points were identified as time efficiency – challenges in collecting, connecting, and analyzing data in a centralized, streamlined way; schema, or organization of the data, leading to a rise in data complexity; and parallel architectures requiring additional resources to manage them.

Nicolas Orban, CEO of Conduktor, said: “As data streaming adoption grows (especially for AI), organizations need to address the importance of governance. Using many different data lakes and tools – with various governance models, schema formats and latency profiles — can be difficult to manage.

“Fragmented data creates chaos, including missed signals, duplicated work, and poor decisions. With Conduktor, organizations can unify operational data into one platform for full visibility and control, improving productivity of IT teams significantly.”

According to Dataintelo, the global market size for streaming data processing system software was valued at approximately USD 9.5 billion in 2023 and is projected to reach around USD 23.8 billion by 2032, reflecting a compound annual growth rate (CAGR) of 10.8% over the forecast period.

Dataintelo says that: “The surge in the need for real-time data processing capabilities, driven by the exponential growth of data from various sources such as social media, IoT devices, and enterprise data systems, is a significant growth factor for this market.”

Learn more about Conduktor’s streaming data hub here: https://conduktor.io/