Take a closer look at your Spark implementation

Take a closer look at your Spark implementation

Apache Spark, the extremely popular data analytics execution engine, was initially released in 2012. It wasn’t until 2015 that Spark really saw an uptick in support, but by November 2015, Spark saw 50 percent more activity than the core Apache Hadoop project itself, with more than 750 contributors from hundreds of companies participating in its development in one form or another.

Spark is a hot new commodity for a reason. Its performance, general-purpose applicability, and programming flexibility combine to make it a versatile execution engine. Yet that variety also leads to varying levels of support for the product and different ways solutions are delivered.

While evaluating analytic software products that support Spark, customers should look closely under the hood and examine four key facets of how the support for Spark is implemented:

  • How Spark is utilized inside the platform
  • What you get in a packaged product that includes Spark
  • How Spark is exposed to you and your team
  • How you perform analytics with the different Spark libraries

Spark can be used as a developer tool via its APIs, or it can be used by BI tools via its SQL interface. Or Spark can be embedded in an application, providing access to business users without requiring programming skills and without limiting Spark’s utility through a SQL interface. I examine each of these options below and explain why all Spark support is not the same.

Programming on Spark

If you want the full power of Spark, you can program directly to its processing engine. There are APIs that are exposed through Java, Python, Scala, and R. In addition to stream and graph processing components, Spark offers a machine-learning library (MLlib) as well as Spark SQL, which allows data tools to connect to a Spark engine and query structured data, or programmers to access data via SQL queries they write themselves.

A number of vendors offer standalone Spark implementations; the major Hadoop distribution suppliers also offer Spark within their platforms. Access is exposed either through a command line or Notebook interface.

But performing analytics on core Spark with its APIs is a time-consuming, programming-intensive process. While Spark offers an easier programming model than, say, native Hadoop, it still requires developers. Even for organizations with developer resources, deploying them to work on lengthy data analytics projects may amount to an intolerable hidden cost. With many organizations, programming on Spark is not an actionable course for this reason.

BI on Spark

Spark SQL is a standards-based way to access data in Spark. It has been relatively easy for BI products to add support for Spark SQL to query tabular data in Spark. The dialect of SQL used by Spark is similar to that of Apache Hive, making Spark SQL akin to earlier SQL-on-Hadoop technologies.

Although Spark SQL uses the Spark engine behind the scenes, it suffers from the same disadvantages as Hive and Impala: Data must be in a structured, tabular format to be queried. This forces Spark to be treated as if it were a relational database, which cripples many of the advantages of a big data engine. Simply put, putting BI on top of Spark requires the transformation of the data into a reasonable tabular format that can be consumed by the BI tools.

Embedding Spark

Another way to leverage Spark is to abstract away its complexity by embedding it deep into a product and taking full advantage of its power behind the scenes. This allows users to leverage the speed and power of Spark without needing developers.

This architecture brings up three key questions. First, does the platform truly hide all of the technical complexities of Spark? As a customer, one needs to examine all aspects of how you would create each step of the analytic cycle — integration, preparation, analysis, visualization, and operationalization. A number of products offer self-service capabilities that abstract away Spark’s complexities, but others force the analyst to dig down and code — for example, in performing integration and preparation. These products may also require you to first ingest all your data into the Hadoop file system for processing. This adds extra length to your analytic cycles, creates fragile and fragmented analytic processes, and requires specialized skills.

Second, how does the platform take advantage of Spark? It’s critical to understand how Spark is used in the execution framework. Spark is sometimes embedded in a fashion that does not have the full scalability of a true cluster. This can limit overall performance as the volume of analytic jobs increases.

Third, how are you protected for the future? The strength of being tightly coupled with the Spark engine is also a weakness. The big data industry moves quickly. MapReduce was the predominant engine in Hadoop for six years. Apache Tez became mainstream in 2013, and now Spark has become a major engine. Assuming the technology curve continues to produce new engines at the same rate, Spark will almost certainly be supplanted by a new engine within 18 months, forcing products tightly coupled to Spark to be reengineered — a far from trivial undertaking. Even with that effort put aside, you must consider whether the redesigned product will be fully compatible with what you’ve built in the older version.

The first step to uncovering the full power of Spark is to understand that not all Spark support is created equal. It’s crucial that organizations grasp the differences in Spark implementations and what each approach means for their overall analytic workflow. Only then can they make a strategic buying decision that will meet their needs over the long haul.

Andrew Brust is senior director of market strategy and intelligence at Datameer.

 

 

[Source:- IW]

Google Cloud SQL provides easier MySQL for all

Google Cloud SQL aims to provide easier MySQL for all

With the general availability of Google Cloud Platform’s latest database offerings — the second generation of Cloud SQL, Cloud Bigtable, and Cloud Datastore — Google is setting up a cloud database strategy founded on a basic truth of software: Don’t get in the customer’s way.

For an example, look no further than the new iteration of Cloud SQL, a hosted version of MySQL for Google Cloud Platform. MySQL is broadly used by cloud applications, and Google is trying to keep it fuss-free — no small feat for any piece of software, let alone a database notorious in its needs for tweaks to work well.

Most of the automation around MySQL in Cloud SQL involves items that should be automated anyway, such as updates, automatic scaling to meet demand, autofailover between zones, and backup/roll-back functionality. This automation all comes via a recent version of MySQL, 5.7, not via an earlier version that’s been heavily customized by Google to support these features.

The other new offerings, Cloud Datastore and Cloud Bigtable, are fully managed incarnations of NoSQL and HBase/Hadoop systems. These systems have fewer users than MySQL, but are likely used to store gobs more data than with MySQL. One of MySQL 5.7’s new features, support for JSON data, provides NoSQL-like functionality for existing MySQL users. But users who are truly serious about NoSQL are likely to do that work on a platform designed to support it from the ground up.

The most obvious competition for Cloud SQL is Amazon’s Aurora service. When reviewed by InfoWorld’s Martin Heller in October 2015, it supported a recent version of MySQL (5.6) and had many of the same self-healing and self-maintaining features as Cloud SQL. Where Google has a potential edge is in the overall simplicity of its platform — a source of pride in other areas, such as a far less sprawling and complex selection of virtual machine types.

Another competitor is Snowflake, the cloud data warehousing solution designed to require little user configuration or maintenance. Snowflake’s main drawback is that it’s a custom-build database, even if it is designed to be highly compatible with SQL conventions. Cloud SQL, by contrast, is simply MySQL, a familiar product with well-understood behaviors.

 

 

 

[Source:- IW]

MySQL zero-day exploit puts some servers at risk of hacking

A zero-day exploit could be used to hack MySQL servers.

A publicly disclosed vulnerability in the MySQL database could allow attackers to completely compromise some servers.

The vulnerability affects “all MySQL servers in default configuration in all version branches (5.7, 5.6, and 5.5) including the latest versions,” as well as the MySQL-derived databases MariaDB and Percona DB, according to Dawid Golunski, the researcher who found it.

The flaw, tracked as CVE-2016-6662, can be exploited to modify the MySQL configuration file (my.cnf) and cause an attacker-controlled library to be executed with root privileges if the MySQL process is started with the mysqld_safe wrapper script.

The exploit can be executed if the attacker has an authenticated connection to the MySQL service, which is common in shared hosting environments, or through an SQL injection flaw, a common type of vulnerability in websites.

Golunski reported the vulnerability to the developers of all three affected database servers, but only MariaDB and Percona DB received patches so far. Oracle, which develops MySQL, was informed on Jul. 29, according to the researcher, but has yet to fix the flaw.

Oracle releases security updates based on a quarterly schedule and the next one is expected in October. However, since the MariaDB and Percona patches are public since the end of August, the researcher decided to release details about the vulnerability Monday so that MySQL admins can take actions to protect their servers.

Golunski’s advisory contains a limited proof-of-concept exploit, but some parts have been intentionally left out to prevent widespread abuse. The researcher also reported a second vulnerability to Oracle, CVE-2016-6663, that could further simplify the attack, but he hasn’t published details about it yet.

The disclosure of CVE-2016-6662 was met with some criticism on specialized discussion forums, where some users argued that it’s actually a privilege escalation vulnerability and not a remote code execution one as described, because an attacker would need some level of access to the database.

“As temporary mitigations, users should ensure that no mysql config files are owned by mysql user, and create root-owned dummy my.cnf files that are not in use,” Golunski said in his advisory. “These are by no means a complete solution and users should apply official vendor patches as soon as they become available.”

Oracle didn’t immediately respond to a request for comments on the vulnerability.

 

 

[Source:- IW]

SQL Server 2016 heads for release, but Linux version is still under wraps

Linux version of SQL Server 2016 still under wraps

SQL Server 2016, Microsoft’s newest database software, is set to become available on June 1 along with a no-cost, developers-only version.

With its new features and revised product editions, Microsoft is determined to expand SQL Server appeal to the largest possible number of customers running in a range of environments. But there’s still no word on the promised SQL Server for Linux, a version of the popular database that Microsoft is hoping will open SQL Server to an entirely new audience.

A broader SQL Server market awaits

Much of what’s new in SQL Server 2016 is aimed at roughly two classes of users: those doing their data collection and storage in the cloud (or moving to the cloud) and those doing analytics work that benefits from being performed in-memory. Features like Stretch Database will appeal to the former, as SQL Server tables can be expanded incrementally into Microsoft Azure — a more appealing option than a disruptive all-or-nothing migration.

Big data features include expanded capabilities for the Hekaton in-memory functions introduced in SQL Server 2014, plus in-memory columnstore functions for real-time analytics. And SQL Server’s close integration with the R language tools that Microsoft recently acquired opens up the database to a range of new applications from a thriving software ecosystem.

The forthcoming Linux version of SQL Server, though, is how Microsoft really plans to expand to an untapped market. And not just Linux users, but a specific kind of Linux user: those who use Oracle on Linux but are tired of Oracle’s unpredictable licensing. Oracle has been trying to change its tune, but there’s a lot to be said for being able to run SQL Server without also needing to run Windows.

Which versions and when?

Two big questions still remain about SQL Server for Linux. The first is when will it see the light of day; Microsoft hasn’t provided a timeframe yet. (A Microsoft spokesperson could provide no new comment.)

The second is what its pricing and SKUs will look like; will the feature set match what’s available on Windows or will it be a stripped-down version? Microsoft has versions of SQL Server to match most any workload or budget, from the free-to-use Express edition to the full-blown Enterprise variety.

With SQL Server 2014 — and now with 2016 as well — the company introduced a free-to-use developer version of the Enterprise SKU intended solely for dev and testing work. It’s unclear whether SQL Server on Linux will also include a developer version or only include editions specifically for commercial use.

Whatever happens with SQL Server on Linux, Microsoft’s already making aggressive efforts to woo Oracle users into its camp. The company has a limited-time Oracle-to-SQL-Server migration offer, where Microsoft Software Assurance customers can swap Oracle licenses for SQL Server licenses at no cost. It’ll be intriguing if a similar offer pops up again after Microsoft releases SQL Server for Linux.

 

 
[Source:- Infoworld]

Apache Spark powers live SQL analytics in SnappyData

Apache Spark powers live SQL analytics in SnappyData

The team behind Pivotal’s GemFire in-memory transactional data store recently unveiled a new database solution powered by GemFire and Apache Spark, called SnappyData.

SnappyData is another recent example of Spark employed as a component in a larger database solution, with or without other pieces from Apache Hadoop.

Snap and spark

SnappyData — the name of both the new database and the organization producing it — was built to span two worlds. It uses the Apache Spark in-memory data-analytics engine so that it can perform live SQL analytics on both static data sets and streams. Queries against SnappyData can be written as conventional SQL or as Spark abstractions, so existing work done in both paradigms can be reused, alone or together, on the same data.

To store and retrieve the data, SnappyData has a distributed data store called Snappy-Store, derived from a variant of Pivotal’s GemFire technology. It works as either its own data store or as a sort of asynchronous write-back cache to other data sources, such as Hadoop/HDFS. This implies that existing data sets could be accessed through SnappyData without having to be formally migrated.

SnappyData also tries to offer novel solutions to problems that can arise when using streaming data. For instance, if there’s too much data coming through to get a real-time response to a query in a timely fashion, SnappyData uses approximate query processing (AQP) or a method of sampling streaming data to generate an answer.

The results are less exact than operating on the entire data set, and AQP isn’t available for every kind of query. That said, AQP queries are intended to be faster to run and are less demanding of CPU and memory than working on the full data set.

One among many

This isn’t the first time Spark has been used at the heart of a data analysis solution that covers both OLTP and OLAP workloads. In-memory database system Splice Machine was originally built on top of Hadoop components and leveraged them to scale out and be able to run both OLTP and OLAP jobs under the same hood. Version 2.0 of that product added Spark as an OLAP processing engine.

Where SnappyData diverges from Splice Machine, though, is in how Spark is used. SnappyData claims it’s extending Spark Streaming in various manners, such as allowing streams to be manipulated and queried as though they were tables, including operations like joins.

SnappyData also seems like a good environment to leverage changes that are slated for Apache Spark in the near term. For instance, Spark 2.0, scheduled to come out later this year, will heavily rework how Spark handles memory management and introduce changes to its streaming system that make it easier to pull down streaming data.

 
[Source:- Infoworld]

 

Microsoft SQL Server 2016 finally gets a release date

Microsoft SQL Server 2016 finally gets a release date

Database fans, start your clocks: Microsoft announced Monday that its new version of SQL Server will be out of beta and ready for commercial release on June 1.

The news means that companies waiting to pick up SQL Server 2016 until its general availability can start planning their adoption.

SQL Server 2016 comes with a suite of new features over its predecessor, including a new Stretch Database function that allows users to store some of their data in a database on-premises and send infrequently used  data to Microsoft’s Azure cloud. An application connected to a database using that feature can still see all the data from different sources, though.

Another marquee feature is the new Always Encrypted function, which makes it possible for users to encrypt data at the column level both at rest and in memory. That’s still only scratching the surface of the software, which also supports creating mobile business intelligence dashboards and new functionality for big data applications.

SQL Server 2016 will come in four editions: Enterprise, Standard, Developer and Express. The latter two will be available for free, similar to what Microsoft offered with SQL Server 2014.

In addition to its on-premises release, Microsoft will also have a virtual machine available on June 1 through its Azure cloud platform that will make it easy for companies to deploy SQL Server 2016 in the cloud.

Many of the new features in SQL Server 2016 like Always Encrypted and Stretch Database are already available in Microsoft’s Azure SQL Database managed service, but the virtual machine will be useful for companies that prefer to manage their own database infrastructure or that plan to roll out SQL Server 2016 on premises and want to test it in the cloud.

All of this comes a few months after Microsoft shocked the world by announcing that it would also release SQL Server on Linux in the future. That’s a powerful sign of Microsoft’s strategy of making its tools available to users on a wide variety of platforms, even those that the company doesn’t control.

 

 

[Source:- Infoworld]

CrateDB packs NoSQL flexibility, SQL familiarity

CrateDB packs NoSQL flexibility, SQL familiarity

CrateDB, an open source, clustered database designed for missions like fast text search and analytics, released its first full 1.0 version last week after three years in development.

It’s built upon several existing open source technologies — Elasticsearch and Lucene, for instance — but no direct knowledge of them is needed to deploy it, as CrateDB offers more than a repackaging of those products.

The database caught the attention of InfoWorld’s Peter Wayner back in 2015 because it promised “a search engine like [Apache] Lucene [and ‘its larger, scalable, and distributed cousin Elasticsearch’], but with the structure and querying ease of SQL.”

The idea is to provide more than a full-text search system. CrateDB’s use cases include big data analytics and scalable aggregations across large data sets. It allows querying via standard ANSI SQL, but it uses a distributed, horizontally scalable architecture, so that any number of nodes can be spun up and run side by side with minimal work.

CrateDB gets two major advantages from the NoSQL side. One is support for unstructured data via JSON documents and BLOB storage, with JSON data queryable through SQL as well. Another is support for high-speed writing, to make the database a suitable target for high-speed data ingestion a la Hadoop.

But CrateDB’s biggest draw may be the setup process and the overall level of get-in-and-go usability. The only prerequisite is Java 8, or you can use Docker to run a provided container image. Nodes automatically discover each other as long as they’re on a network that supports multicast. The web UI can bootstrap a cluster with sample data (courtesy of Twitter), and the command-line shell uses conventional SQL syntax for inserting and querying data. Also included is support for PostgreSQL’s wire protocol, although any actual SQL commands sent through it need to adhere to CrateDB’s implementation of SQL.

CrateDB’s one of a flood of recent database products that all address specific issues that have sprung up: scalability, resiliency, mixing modalities (NoSQL vs. SQL, document vs. graph), high-speed writes, and so on. The philosophy behind such products generally runs like this: Existing solutions are too old, hidebound, or legacy-oriented to solve current and future problems, so we need a clean slate. The trick will be to see whether the benefits of the clean slate outweigh the difficulties of moving to it — hence, CrateDB’s emphasis on usability and quick starts.

 

[Source:- Infoworld]

 

91% off Microsoft Certified Solutions Associate: SQL Server Certification Bundle – Deal Alert

sql course

Whether or not you’ve dabbled with queries or databases, earning a MCSA certification will attract the eyes and wallets of company execs across the states and beyond. SQL is a go-to software for implementing data warehouses, as well as efficiently managing massive amounts of data. In this bundle, currently discounted 91%, you’ll access three courses:

  • Microsoft 70-461: Querying Microsoft SQL Server 2012
  • Microsoft 70-462: Administering Microsoft SQL Server 2012 Databases
  • Microsoft 70-463: Implementing A Data Warehouse With Microsoft SQL Server 2012

This $438 course bundle is available, for a limited time, for just $35.99. Learn more about this bundle, the courses included, the instructor, and how to purchase.

 

 

[Source:- Infoworld]

 

Microsoft rolls out SQL Server 2016 with a special deal to woo Oracle customers

Microsoft has released SQL Server 2016.

The next version of Microsoft’s SQL Server relational database management system is now available, and along with it comes a special offer designed specifically to woo Oracle customers.

Until the end of this month, Oracle users can migrate their databases to SQL Server 2016 and receive the necessary licenses for free with a subscription to Microsoft’s Software Assurance maintenance program.

Microsoft announced the June 1 release date for SQL Server 2016 early last month. Among the more notable enhancements it brings are updateable, in-memory column stores and advanced analytics. As a result, applications can now deploy sophisticated analytics and machine learning models within the database at performance levels as much as 100 times faster than what they’d be outside it, Microsoft said.

The software’s new Always Encrypted feature helps protect data at rest and in memory, while Stretch Database aims to reduce storage costs while keeping data available for querying in Microsoft’s Azure cloud. A new Polybase tool allows you to run queries on external data in Hadoop or Azure blob storage.

Also included are JSON support, “significantly faster” geospatial query support, a feature called Temporal Tables for “traveling back in time” and a Query Store for ensuring performance consistency.

SQL Server 2016 features were first released in Microsoft Azure and stress-tested through more than 1.7 million Azure SQL DB databases. The software comes in Enterprise and Standard editions along with free Developer and Express versions.

Support for SQL Server 2005 ended in April.

Though Wednesday’s announcement didn’t mention it, Microsoft previously said it’s planning to bring SQL Server to Linux. That version is now due to be released in the middle of next year, Microsoft said.

 

[Source:- Infoworld]

 

Take a closer look at your Spark implementation

Take a closer look at your Spark implementation

Apache Spark, the extremely popular data analytics execution engine, was initially released in 2012. It wasn’t until 2015 that Spark really saw an uptick in support, but by November 2015, Spark saw 50 percent more activity than the core Apache Hadoop project itself, with more than 750 contributors from hundreds of companies participating in its development in one form or another.

Spark is a hot new commodity for a reason. Its performance, general-purpose applicability, and programming flexibility combine to make it a versatile execution engine. Yet that variety also leads to varying levels of support for the product and different ways solutions are delivered.

While evaluating analytic software products that support Spark, customers should look closely under the hood and examine four key facets of how the support for Spark is implemented:

  • How Spark is utilized inside the platform
  • What you get in a packaged product that includes Spark
  • How Spark is exposed to you and your team
  • How you perform analytics with the different Spark libraries

Spark can be used as a developer tool via its APIs, or it can be used by BI tools via its SQL interface. Or Spark can be embedded in an application, providing access to business users without requiring programming skills and without limiting Spark’s utility through a SQL interface. I examine each of these options below and explain why all Spark support is not the same.

Programming on Spark

If you want the full power of Spark, you can program directly to its processing engine. There are APIs that are exposed through Java, Python, Scala, and R. In addition to stream and graph processing components, Spark offers a machine-learning library (MLlib) as well as Spark SQL, which allows data tools to connect to a Spark engine and query structured data, or programmers to access data via SQL queries they write themselves.

A number of vendors offer standalone Spark implementations; the major Hadoop distribution suppliers also offer Spark within their platforms. Access is exposed either through a command line or Notebook interface.

But performing analytics on core Spark with its APIs is a time-consuming, programming-intensive process. While Spark offers an easier programming model than, say, native Hadoop, it still requires developers. Even for organizations with developer resources, deploying them to work on lengthy data analytics projects may amount to an intolerable hidden cost. With many organizations, programming on Spark is not an actionable course for this reason.

BI on Spark

Spark SQL is a standards-based way to access data in Spark. It has been relatively easy for BI products to add support for Spark SQL to query tabular data in Spark. The dialect of SQL used by Spark is similar to that of Apache Hive, making Spark SQL akin to earlier SQL-on-Hadoop technologies.

Although Spark SQL uses the Spark engine behind the scenes, it suffers from the same disadvantages as Hive and Impala: Data must be in a structured, tabular format to be queried. This forces Spark to be treated as if it were a relational database, which cripples many of the advantages of a big data engine. Simply put, putting BI on top of Spark requires the transformation of the data into a reasonable tabular format that can be consumed by the BI tools.

Embedding Spark

Another way to leverage Spark is to abstract away its complexity by embedding it deep into a product and taking full advantage of its power behind the scenes. This allows users to leverage the speed and power of Spark without needing developers.

This architecture brings up three key questions. First, does the platform truly hide all of the technical complexities of Spark? As a customer, one needs to examine all aspects of how you would create each step of the analytic cycle — integration, preparation, analysis, visualization, and operationalization. A number of products offer self-service capabilities that abstract away Spark’s complexities, but others force the analyst to dig down and code — for example, in performing integration and preparation. These products may also require you to first ingest all your data into the Hadoop file system for processing. This adds extra length to your analytic cycles, creates fragile and fragmented analytic processes, and requires specialized skills.

Second, how does the platform take advantage of Spark? It’s critical to understand how Spark is used in the execution framework. Spark is sometimes embedded in a fashion that does not have the full scalability of a true cluster. This can limit overall performance as the volume of analytic jobs increases.

Third, how are you protected for the future? The strength of being tightly coupled with the Spark engine is also a weakness. The big data industry moves quickly. MapReduce was the predominant engine in Hadoop for six years. Apache Tez became mainstream in 2013, and now Spark has become a major engine. Assuming the technology curve continues to produce new engines at the same rate, Spark will almost certainly be supplanted by a new engine within 18 months, forcing products tightly coupled to Spark to be reengineered — a far from trivial undertaking. Even with that effort put aside, you must consider whether the redesigned product will be fully compatible with what you’ve built in the older version.

The first step to uncovering the full power of Spark is to understand that not all Spark support is created equal. It’s crucial that organizations grasp the differences in Spark implementations and what each approach means for their overall analytic workflow. Only then can they make a strategic buying decision that will meet their needs over the long haul.

 

[Source:- Infoworld]