Take a closer look at your Spark implementation

Take a closer look at your Spark implementation

Apache Spark, the extremely popular data analytics execution engine, was initially released in 2012. It wasn’t until 2015 that Spark really saw an uptick in support, but by November 2015, Spark saw 50 percent more activity than the core Apache Hadoop project itself, with more than 750 contributors from hundreds of companies participating in its development in one form or another.

Spark is a hot new commodity for a reason. Its performance, general-purpose applicability, and programming flexibility combine to make it a versatile execution engine. Yet that variety also leads to varying levels of support for the product and different ways solutions are delivered.

While evaluating analytic software products that support Spark, customers should look closely under the hood and examine four key facets of how the support for Spark is implemented:

  • How Spark is utilized inside the platform
  • What you get in a packaged product that includes Spark
  • How Spark is exposed to you and your team
  • How you perform analytics with the different Spark libraries

Spark can be used as a developer tool via its APIs, or it can be used by BI tools via its SQL interface. Or Spark can be embedded in an application, providing access to business users without requiring programming skills and without limiting Spark’s utility through a SQL interface. I examine each of these options below and explain why all Spark support is not the same.

Programming on Spark

If you want the full power of Spark, you can program directly to its processing engine. There are APIs that are exposed through Java, Python, Scala, and R. In addition to stream and graph processing components, Spark offers a machine-learning library (MLlib) as well as Spark SQL, which allows data tools to connect to a Spark engine and query structured data, or programmers to access data via SQL queries they write themselves.

A number of vendors offer standalone Spark implementations; the major Hadoop distribution suppliers also offer Spark within their platforms. Access is exposed either through a command line or Notebook interface.

But performing analytics on core Spark with its APIs is a time-consuming, programming-intensive process. While Spark offers an easier programming model than, say, native Hadoop, it still requires developers. Even for organizations with developer resources, deploying them to work on lengthy data analytics projects may amount to an intolerable hidden cost. With many organizations, programming on Spark is not an actionable course for this reason.

BI on Spark

Spark SQL is a standards-based way to access data in Spark. It has been relatively easy for BI products to add support for Spark SQL to query tabular data in Spark. The dialect of SQL used by Spark is similar to that of Apache Hive, making Spark SQL akin to earlier SQL-on-Hadoop technologies.

Although Spark SQL uses the Spark engine behind the scenes, it suffers from the same disadvantages as Hive and Impala: Data must be in a structured, tabular format to be queried. This forces Spark to be treated as if it were a relational database, which cripples many of the advantages of a big data engine. Simply put, putting BI on top of Spark requires the transformation of the data into a reasonable tabular format that can be consumed by the BI tools.

Embedding Spark

Another way to leverage Spark is to abstract away its complexity by embedding it deep into a product and taking full advantage of its power behind the scenes. This allows users to leverage the speed and power of Spark without needing developers.

This architecture brings up three key questions. First, does the platform truly hide all of the technical complexities of Spark? As a customer, one needs to examine all aspects of how you would create each step of the analytic cycle — integration, preparation, analysis, visualization, and operationalization. A number of products offer self-service capabilities that abstract away Spark’s complexities, but others force the analyst to dig down and code — for example, in performing integration and preparation. These products may also require you to first ingest all your data into the Hadoop file system for processing. This adds extra length to your analytic cycles, creates fragile and fragmented analytic processes, and requires specialized skills.

Second, how does the platform take advantage of Spark? It’s critical to understand how Spark is used in the execution framework. Spark is sometimes embedded in a fashion that does not have the full scalability of a true cluster. This can limit overall performance as the volume of analytic jobs increases.

Third, how are you protected for the future? The strength of being tightly coupled with the Spark engine is also a weakness. The big data industry moves quickly. MapReduce was the predominant engine in Hadoop for six years. Apache Tez became mainstream in 2013, and now Spark has become a major engine. Assuming the technology curve continues to produce new engines at the same rate, Spark will almost certainly be supplanted by a new engine within 18 months, forcing products tightly coupled to Spark to be reengineered — a far from trivial undertaking. Even with that effort put aside, you must consider whether the redesigned product will be fully compatible with what you’ve built in the older version.

The first step to uncovering the full power of Spark is to understand that not all Spark support is created equal. It’s crucial that organizations grasp the differences in Spark implementations and what each approach means for their overall analytic workflow. Only then can they make a strategic buying decision that will meet their needs over the long haul.

Andrew Brust is senior director of market strategy and intelligence at Datameer.

 

 

[Source:- IW]

Apache Spark powers live SQL analytics in SnappyData

Apache Spark powers live SQL analytics in SnappyData

The team behind Pivotal’s GemFire in-memory transactional data store recently unveiled a new database solution powered by GemFire and Apache Spark, called SnappyData.

SnappyData is another recent example of Spark employed as a component in a larger database solution, with or without other pieces from Apache Hadoop.

Snap and spark

SnappyData — the name of both the new database and the organization producing it — was built to span two worlds. It uses the Apache Spark in-memory data-analytics engine so that it can perform live SQL analytics on both static data sets and streams. Queries against SnappyData can be written as conventional SQL or as Spark abstractions, so existing work done in both paradigms can be reused, alone or together, on the same data.

To store and retrieve the data, SnappyData has a distributed data store called Snappy-Store, derived from a variant of Pivotal’s GemFire technology. It works as either its own data store or as a sort of asynchronous write-back cache to other data sources, such as Hadoop/HDFS. This implies that existing data sets could be accessed through SnappyData without having to be formally migrated.

SnappyData also tries to offer novel solutions to problems that can arise when using streaming data. For instance, if there’s too much data coming through to get a real-time response to a query in a timely fashion, SnappyData uses approximate query processing (AQP) or a method of sampling streaming data to generate an answer.

The results are less exact than operating on the entire data set, and AQP isn’t available for every kind of query. That said, AQP queries are intended to be faster to run and are less demanding of CPU and memory than working on the full data set.

One among many

This isn’t the first time Spark has been used at the heart of a data analysis solution that covers both OLTP and OLAP workloads. In-memory database system Splice Machine was originally built on top of Hadoop components and leveraged them to scale out and be able to run both OLTP and OLAP jobs under the same hood. Version 2.0 of that product added Spark as an OLAP processing engine.

Where SnappyData diverges from Splice Machine, though, is in how Spark is used. SnappyData claims it’s extending Spark Streaming in various manners, such as allowing streams to be manipulated and queried as though they were tables, including operations like joins.

SnappyData also seems like a good environment to leverage changes that are slated for Apache Spark in the near term. For instance, Spark 2.0, scheduled to come out later this year, will heavily rework how Spark handles memory management and introduce changes to its streaming system that make it easier to pull down streaming data.

 
[Source:- Infoworld]

 

Take a closer look at your Spark implementation

Take a closer look at your Spark implementation

Apache Spark, the extremely popular data analytics execution engine, was initially released in 2012. It wasn’t until 2015 that Spark really saw an uptick in support, but by November 2015, Spark saw 50 percent more activity than the core Apache Hadoop project itself, with more than 750 contributors from hundreds of companies participating in its development in one form or another.

Spark is a hot new commodity for a reason. Its performance, general-purpose applicability, and programming flexibility combine to make it a versatile execution engine. Yet that variety also leads to varying levels of support for the product and different ways solutions are delivered.

While evaluating analytic software products that support Spark, customers should look closely under the hood and examine four key facets of how the support for Spark is implemented:

  • How Spark is utilized inside the platform
  • What you get in a packaged product that includes Spark
  • How Spark is exposed to you and your team
  • How you perform analytics with the different Spark libraries

Spark can be used as a developer tool via its APIs, or it can be used by BI tools via its SQL interface. Or Spark can be embedded in an application, providing access to business users without requiring programming skills and without limiting Spark’s utility through a SQL interface. I examine each of these options below and explain why all Spark support is not the same.

Programming on Spark

If you want the full power of Spark, you can program directly to its processing engine. There are APIs that are exposed through Java, Python, Scala, and R. In addition to stream and graph processing components, Spark offers a machine-learning library (MLlib) as well as Spark SQL, which allows data tools to connect to a Spark engine and query structured data, or programmers to access data via SQL queries they write themselves.

A number of vendors offer standalone Spark implementations; the major Hadoop distribution suppliers also offer Spark within their platforms. Access is exposed either through a command line or Notebook interface.

But performing analytics on core Spark with its APIs is a time-consuming, programming-intensive process. While Spark offers an easier programming model than, say, native Hadoop, it still requires developers. Even for organizations with developer resources, deploying them to work on lengthy data analytics projects may amount to an intolerable hidden cost. With many organizations, programming on Spark is not an actionable course for this reason.

BI on Spark

Spark SQL is a standards-based way to access data in Spark. It has been relatively easy for BI products to add support for Spark SQL to query tabular data in Spark. The dialect of SQL used by Spark is similar to that of Apache Hive, making Spark SQL akin to earlier SQL-on-Hadoop technologies.

Although Spark SQL uses the Spark engine behind the scenes, it suffers from the same disadvantages as Hive and Impala: Data must be in a structured, tabular format to be queried. This forces Spark to be treated as if it were a relational database, which cripples many of the advantages of a big data engine. Simply put, putting BI on top of Spark requires the transformation of the data into a reasonable tabular format that can be consumed by the BI tools.

Embedding Spark

Another way to leverage Spark is to abstract away its complexity by embedding it deep into a product and taking full advantage of its power behind the scenes. This allows users to leverage the speed and power of Spark without needing developers.

This architecture brings up three key questions. First, does the platform truly hide all of the technical complexities of Spark? As a customer, one needs to examine all aspects of how you would create each step of the analytic cycle — integration, preparation, analysis, visualization, and operationalization. A number of products offer self-service capabilities that abstract away Spark’s complexities, but others force the analyst to dig down and code — for example, in performing integration and preparation. These products may also require you to first ingest all your data into the Hadoop file system for processing. This adds extra length to your analytic cycles, creates fragile and fragmented analytic processes, and requires specialized skills.

Second, how does the platform take advantage of Spark? It’s critical to understand how Spark is used in the execution framework. Spark is sometimes embedded in a fashion that does not have the full scalability of a true cluster. This can limit overall performance as the volume of analytic jobs increases.

Third, how are you protected for the future? The strength of being tightly coupled with the Spark engine is also a weakness. The big data industry moves quickly. MapReduce was the predominant engine in Hadoop for six years. Apache Tez became mainstream in 2013, and now Spark has become a major engine. Assuming the technology curve continues to produce new engines at the same rate, Spark will almost certainly be supplanted by a new engine within 18 months, forcing products tightly coupled to Spark to be reengineered — a far from trivial undertaking. Even with that effort put aside, you must consider whether the redesigned product will be fully compatible with what you’ve built in the older version.

The first step to uncovering the full power of Spark is to understand that not all Spark support is created equal. It’s crucial that organizations grasp the differences in Spark implementations and what each approach means for their overall analytic workflow. Only then can they make a strategic buying decision that will meet their needs over the long haul.

 

[Source:- Infoworld]

 

Spark 2.0 takes an all-in-one approach to big data

Spark 2.0 takes an all-in-one approach to big data

Apache Spark, the in-memory processing system that’s fast become a centerpiece of modern big data frameworks, has officially released its long-awaited version 2.0.

Aside from some major usability and performance improvements, Spark 2.0’s mission is to become a total solution for streaming and real-time data. This comes as a number of other projects — including others from the Apache Foundation — provide their own ways to boost real-time and in-memory processing.

Easier on top, faster underneath

Most of Spark 2.0’s big changes have been known well in advance, which has made them even more hotly anticipated.

One of the largest and most technologically ambitious additions is Project Tungsten, a reworking of Spark’s treatment for memory and code generation. Pieces of Project Tungsten have showed up in earlier releases, but 2.0 adds more, such as applying Tungsten’s memory management to both caching and runtime execution.

For users, these changes, plus a great many other under-the-hood improvements, provide across-the-board performance gains. Spark’s developers claim a two-to-tenfold increase in speed for common DataFrames and SQL operations, thanks to a new code generation system. Window functions, used for tasks like moving averages in data, have been reimplemented natively for further speed-ups.

Spark 2.0 also brings a major shift in programming APIs. DataFrames and Datasets, previously two different ways of accessing structured data, are now the same under the hood; DataFrames are now “just a type alias for Dataset of Row,” per Spark’s release notes. R language users can also now write a small range of user-defined functions and leverage better support for existing Spark features.

These changes make Spark more powerful without unnecessary complexity, since Spark’s straightforward APIs are one of its biggest attractions.

Spark has streaming — and company

Spark has been refining its metaphors for streamed and real-time data as well, and Structured Streaming makes its proper debut in 2.0. It repurposes Spark’s existing DataFrame/Dataset API to connect with streaming data sources like Kafka 0.10, so such data can be processed live.

Streaming has long been considered one of Spark’s weaker features because it’sharder to debug and keep running than it is to get set up. But it’s emerged as a contender to another major streaming-data solution, Apache Storm, in big part because Spark’s much easier to use overall.

With version 2.0, Spark is making a bid to be an all-in-one processing framework accessed by a few overarching APIs. But in the run-up to Spark 2.0, other projects have emerged with their own conceits for how to approach streaming and batch processing — Twitter’s Heron, Apache Apex, and Apache Flink, to name a few.

All these projects have their advantages. Heron reuses Apache Storm’s metaphors for streaming to make it easier for Storm users to get on board. Apex is even easier than Spark to work with, especially when it comes to fault tolerance or event ordering. And Flink uses a native streaming model rather than a retrofitted version of Spark’s existing data model.

Still, Spark has managed to establish itself solidly over the past couple of years as an ingredient in third-party software products (SnappyData, Splice Machine) and cloud-native data systems (IBM and more). Spark 2.0 is set on making that legacy harder to displace.

 

 

[Source: Javaworld]