Bringing Data Lakes to Everyone in the Enterprise
Should Data Lakes be accessible to the typical less technical savvy business user? Most executives would say yes. After all, companies have and are continuing to implement Data Lakes to meet business needs, specifically to stay afloat with a fast-moving marketplace and with ever changing data patterns. In a recent DATAVERSITY® interview and DMRadio webinar titled Are Data Lakes for Business Users?, Steve Wooledge the Vice President of Marketing at Arcadia Data discussed Data Lakes and why they are important for the entire organization, not just the IT department.
Data Lakes have proven and continue to prove their benefits to organizations, providing one place where vast amount of data of various types and structures can be ingested, stored, assessed, and analyzed. “As an on-demand sandbox, Data Lakes contain hidden opportunities and nuggets of insights, guiding an organization to greater profit and growth,” said Wooledge.
However, typically, only very skilled Data Analysts and Data Scientists have been able to make sense of the raw data, the stuff making up a Data Lakes. While these specialized folks experiment and explore the Data Lake, the average business users may feel left ashore, commented Wooledge. At best, they rely on the Data Scientist or Data Analyst to dive into the Data Lake to uncover information pertaining to business queries. But, asked Wooledge doesn’t this dependence on a Data Scientist or Data Analyst defeat the purpose of having real-time access to data, so that a business user can theorize and respond? After all, the market place may change in a matter of minutes, requiring near immediate action. Likely, the non-technical employee is on the front lines with the skills to handle the actual problem.
So, how can managers, executives and front-line workers get value from the Data Lake too? A Self-Service Business Intelligence (BI) tool, connecting all workers from and to the Data Lake to do Analytics would be ideal. Architectures using Apache Hadoop, a typical Data Lake standard, have scaled up Big Data, allow many people to have access to many different types of data, quickly said Wooledge. But, these types of solutions do not always solve the requirement for Self-Service BI, as moderate skill needs to put the pieces of data together to be useful.
Perhaps constructing a few different kinds of specialized processing interiors, within the Data Lake, connected to an interface suiting different types of business needs, sounds attractive. However, this approach can be unwieldy and be unable to handle many users and slow down queries. A better construction scales out the analysis, the same way as the Data Lake has scaled out data. Then provide an application where the typical user can run the data first and do discovery with it, a schema-on-read on demand. This architectural problem is central to Arcadia Data, a company driven to connect business users to Big Data from a self-service Business Intelligence and analytics perspective.
An Accelerated Native BI Analytics Platform Within the Data Lake
Arcadia Data exists to produce a scaled-out, distributed BI Analytics platform in the Data Lake. Wooledge, who responsible for overall go-to-market strategy and marketing for Arcadia Data, emphasized that Arcadia Data construction puts the Data Lake in the hands of the business user.
Wooledge is a15-year veteran of Enterprise Software in both large public companies and early-stage start-ups and has a passion for bringing innovative technology to market. As an engineer, Wooledge understands the technology needed to bring a wide range of data, unstructured, and dynamic to a multitude of concurrent regular users needing quick analysis.
In the webinar, he discussed an ongoing assessment conducted by Eckerson Group and Bloor Group, to find the Data Lake’s value to the regular user, the business person and how Arcadia Data’s native BI tool, construction, in the Data Lake meets typical enterprises BI needs.
Wooledge explained how Arcadia Data’s native Data Lake application provides the acceleration that makes it of unique value. He says that there are three to four aspects that set Arcadia Data apart:
- Scalability: Wooledge reflected, “when it comes to scale, we are the first and only truly distributed BI platform. We run where the data sits. And all our processing is pushed down to the individual data nodes where the data sits.” He added that based on the actual queries that people execute that “we can recommend [how] to more intelligently store that data back on the cluster, [in addition to leveraging] memory constructs to [speed up] those queries over time. This allows an organization to support ten to a hundred times more users than a typical system.
- Ability to Handle Data Variety: It’s now possible to utilize unstructured data, e.g. by JSON and other data formats, where the metadata and schemas are defined within the structure. Arcadia Data, according to Wooledge, can “interpret and visualize those without requiring ETL (Extract, Transform, Load) in advance.” As a result, a stream of data, coming from a set top box, say from Apache Kafka, can be read and visualized in real time. There is no need to pre-process the data.
- End User Agility: Wooledge believes that enterprises typically try to implement two BI standards, one for the Data Warehouse and the other for the Data Lake. To enhance end user agility, firms are trying to take the BI tool, for the Data Warehouse, and apply it to the Data Lake. This tactic, though, requires sampling the data and moving it out of the Data Lake environment, for data formatting reasons. But you should not do this, he advised that:“By running a Business Analytics application natively in the Data Lake, both self-service exploration and production reports generated on dashboards, accessible to hundred or thousands of users become possible.”
- Less to Manage and Maintain: Arcadia Data’s solution provides huge benefits from an IT Governance and developer’s perspective. He noted that BI tools built from Data Warehouses often require maintaining and managing two separate hardware environments. Since Arcadia Data’s applications allow for inheritance only one environment needs to be updated. For example, if an employee changes role within the organization, security does not need to be administered from two different locations. Wooledge emphasized that this means lower cost overall, not to mention enhanced performance.
Using Machine Learning to Enhance BI Performance
Wooledge explained the two new ways Arcadia Data employs Machine Learning in its upcoming BI Tools. On the backend,
“Smart Acceleration uses Machine Learning to interpret queries and recommend caching strategies and physical data models, speeding up data discovery and all the queries.”
On the front end, Wooledge talks about Instant Visuals. Wooledge describes, “Instant Visuals look at usefully collected data dimensions and measures, as well as data’s cardinality.” Screen displays will be customized to the six to nine different ways business users visualize their data, based on Data Visualization best practices. As Wooledge eloquently put it, Instant Visuals are a “recommendation engine for Data Visualization.”
In addition, Arcadia Data has built real-time activity to Apache Kafka, called KSQL. Wooledge uses the example of cybersecurity to illustrate its power. Say a Security Analyst wants to see a denial of service attack or another breach in real-time. They will also want to “drill down [in the] details,” (e.g. the history across all the network traffic, endpoints, and users). From this analysis, declared Wooledge, it can be determined “if there’s other people involved, potentially or bad actors, or different clusters of nodes or systems that act as an entry point for some of these attacks.” The combination of viewing information in real time and drilling down, through Apache Kafka and Confluence, gives additional power to Arcadia Data’s BI Analytics solution.
The Future: Augmented Analytics
Wooledge sees Augmented Analytics as further enhancing Arcadia’s BI’s tool, by helping business users query information on the Data Lake more effectively. Augmented Analytics uses natural language in searching, giving the managers, executives, or front-line workers a different way to see data beyond the typical reports, charts, and drill downs, generated. As a result, he noted, end users can discover information more quickly and providing more value.
With enterprises choosing a new BI standard for their Data Lake environments, Augmented Analytics and Machine Learning can do so much. Fundamentally, he stated:
“I think that the laws of physics require if you are going to have distributed scale-out data Platform you need to have distributed scale-out analytic platforms. And I think that’s the inflection point that we’re creating a whole new level of intelligence, more real-time than was ever possible before with traditional technology.”
The associated podcast to this entire topic, that ties with the Steve Wooledge’s webinar is also on DMRadio. It was titled Data Warehousing and Data Lakes: The Big Picture and included guests Steve Wooledge of Arcadia Data, Wayne Eckerson of Eckerson Group, Lakshmi Randall of Denodo, and Alexandra Gutow of Cloudera.
Photo Credit: whiteMocca/Shutterstock.com