Reinvigorating the Data Warehouse Through the Data Pipeline
How can analysts ask all the questions they need and act upon this information in minutes, as it flows through Data Warehouses? As a legacy technology sometimes considered too slow for the modern Big Data and Real-Time Analytics world, what role do Data Warehouse have in enterprises today? In a recent DATAVERSITY® interview and DMRadio webinar entitled, Modern Data Pipelines – Improving Speed, Governance, and Analysis, Taylor Brown, COO and Co-founder at Fivetran speaks about how a fully managed Data Pipeline supercharges data flow through a Data Warehouse in a way that Brown puts it: “better, faster and cheaper.”
Taylor Brown said that he and his Co-founder, George Fraser found Data Scientists and Data Analysts “floundering with data because they lacked the computer programming skills required to build custom ETL pipelines.” Moreover, they discovered that before adopting Fivetran, analytics teams, “spent 40 percent of work time pulling data from different locations, 20 percent of a job analyzing this data and 40 percent of the labor getting this information out to the company.” With Fivetran, teams spend almost none of their labor wrangling data sources and instead have more time drawing insights from their data.
In the old world, data storage costs were high. Brown remarked:
“People had to optimize their data stack to store as little data as possible in order to make the queries using this data fast. To accomplish this, companies used Extract, Transform, and Load or ETL. Most transformations entail aggregation which mean you have to plan all the queries you might want to run ahead of time. With aggregations you don’t have access to the raw data anymore, so predicting your data questions, a difficult task, is essential.
The most technical people at the company, usually engineers, would build that transformation layer. However, a lot of analysts did not and still do not have access to these types of resources, leaving the analysts puzzling on how to get the data they need into their analytics systems.
“Around 2014 and 2015, Amazon’s Redshift became very popular and available,” said Brown. Redshift combines a relational database in the Cloud using massive parallel processing. This meant the Data Warehouse became a lot faster, simpler, and cheaper to use.
With more accessible Cloud Data Warehouses, Business Intelligence tools started becoming more adept with tools like Looker, Periscope, and Tableau, said Brown, to make exploring and analyzing data in relational databases and Data Warehouses user friendly. They present better visualizations of the data and do the complex querying behind the scenes.
While new Business Intelligence (BI) tools were developed that streamlined analysis, Brown noted that organizations were stuck on how to integrate an endless variety of data sources into their analysis.
Replicate-All: A Step up from ETL
Fully managed Data Pipelines, like Fivetran, take a “replicate all” approach, and swap the order of L and T in ETL. By first loading all the data into the warehouse, it becomes much easier for analysts to access their data without waiting on engineers. Also, by removing the transformation step before replicating data into the Data Warehouse, an organization is able to sidestep the dangerous planning process required with ETL.
This is all possible because columnar Cloud-based Data Warehouses, like Snowflake, Redshift, BigQuery, or SQL Data Warehouse on Azure provide a structure for BI/Analytics and low storage cost.
These fast tools and columnar Cloud-based Data Warehouses split out the extraction from the transformation step, “simplifying the whole process and making tasks like error finding, much easier,” said Brown. “Even more amazing, this combined technology allows for little thought on what data to extract as everything can be extracted.”
With the low storage costs that come with Cloud Data Warehousing, it makes sense to take advantages of replicate all data connectors. Brown noted that:
“Fivetran connects to all the different available sources, replicates the data and puts it into a Cloud Data Warehouse. Fivetran already supports replication from over 100 source types and streamlines that data into a Data Warehouse.”
Data Pipeline-as-a-Service companies are able to take advantage of technologies that make sense at scale. For example, said Brown, “we have automations in place that ensure data delivery when source schemas change.”
With the Fivetran approach, “businesses can essentially connect all their different sources in a matter of minutes,” stated Brown. This type of technology replicates data over a businesses’ existing schema, creating and updating tables and layouts (e.g. through adding columns) as the data sources change. When Fivetran deploys, noted Brown:
“It uses an Apple-type approach; high quality products and a seamless user experience. We do everything from making very serious decisions on how the data gets pulled over, how it’s taken out of the source, along with some transformations to cleanse the data depending on the source API. Users are then empowered with ready-to-query schemas from Fivetran to jump start their analytics projects.”
Data Pipeline-as-a-Service vs. ETL Orchestration Tool
Consider two approaches to the Data Pipeline. In the first, a company is provided an orchestration tool. Synonymous with a tool kit, customers get the pieces to construct connections from data source to storage. Brown declared that “these businesses need to keep building until they replicate data from everywhere.” Should something fail in this DIY kit, a company must fix it. Upon handling a new marketing service or connection, it’s a company’s responsibility to build the connection which could take usually around six months to get the data available and loaded into the warehouse. The obvious downside to this approach is time. Also, it discourages companies from adopting new tools and services even if it’s best for the business.
In the second approach, the Data Pipeline comes pre-assembled, as a service, with a ”promise of data delivery day in and day out,” stated Brown. Companies become more agile both with their analytics, but also their daily operations.
Fivetran sees itself as an enabler to add hundreds or thousands of different connections through its work, which differentiates Fivetran from others in the Data Pipeline space. By supporting a large number of systems, Fivetran promises to deliver data regularly through a plug-and-play method. “Business owners just add the new data streams. These data files are pulled and dropped into the Data Warehouse,” remarked Brown. Should a connection break down, Fivetran fixes it for the customer.
To provide an effective Data Pipeline service, keeping up with changes is crucial. This means that Fivetran is adding between two to five new connections per month, as well as rewriting existing connections, since the source systems are constantly changing. Also, connections need to be made faster, towards reducing the five-minute to one-minute lag time. Fivetran “constantly hammers away at these three aspects” said Brown.
The Future of Data Warehouse Technologies
As a conclusion to the interview, Brown discussed four changes in the future for Data Warehouse Technologies:
- Data Warehouses will keep getting better, faster and cheaper, leading to more and more people using Data Warehouses as Data Lakes.
- More customers will build Data Lakes as cold storage and use services such as S3 or Google as an intermediary to load small bits onto local hot columnar style Data Warehouse.
- Customers will drive columnar store data and production store Data Warehouses to be bound together. “That way they can run Analytics and production applications from the same tool, resulting in less overhead,” remarked Brown.
Photo Credit: Photobank gallery/Shutterstock.com