Whether or not you’ve dabbled with queries or databases, earning a MCSA certification will attract the eyes and wallets of company execs across the states and beyond. SQL is a go-to software for implementing data warehouses, as well as efficiently managing massive amounts of data. In this bundle, currently discounted 91%, you’ll access three courses:
Microsoft 70-461: Querying Microsoft SQL Server 2012
Microsoft 70-462: Administering Microsoft SQL Server 2012 Databases
Microsoft 70-463: Implementing A Data Warehouse With Microsoft SQL Server 2012
This $438 course bundle is available, for a limited time, for just $35.99. Learn more about this bundle, the courses included, the instructor, and how to purchase.
The next version of Microsoft’s SQL Server relational database management system is now available, and along with it comes a special offer designed specifically to woo Oracle customers.
Until the end of this month, Oracle users can migrate their databases to SQL Server 2016 and receive the necessary licenses for free with a subscription to Microsoft’s Software Assurance maintenance program.
Microsoft announced the June 1 release date for SQL Server 2016 early last month. Among the more notable enhancements it brings are updateable, in-memory column stores and advanced analytics. As a result, applications can now deploy sophisticated analytics and machine learning models within the database at performance levels as much as 100 times faster than what they’d be outside it, Microsoft said.
The software’s new Always Encrypted feature helps protect data at rest and in memory, while Stretch Database aims to reduce storage costs while keeping data available for querying in Microsoft’s Azure cloud. A new Polybase tool allows you to run queries on external data in Hadoop or Azure blob storage.
Also included are JSON support, “significantly faster” geospatial query support, a feature called Temporal Tables for “traveling back in time” and a Query Store for ensuring performance consistency.
SQL Server 2016 features were first released in Microsoft Azure and stress-tested through more than 1.7 million Azure SQL DB databases. The software comes in Enterprise and Standard editions along with free Developer and Express versions.
Support for SQL Server 2005 ended in April.
Though Wednesday’s announcement didn’t mention it, Microsoft previously said it’s planning to bring SQL Server to Linux. That version is now due to be released in the middle of next year, Microsoft said.
Apache Spark, the extremely popular data analytics execution engine, was initially released in 2012. It wasn’t until 2015 that Spark really saw an uptick in support, but by November 2015, Spark saw 50 percent more activity than the core Apache Hadoop project itself, with more than 750 contributors from hundreds of companies participating in its development in one form or another.
Spark is a hot new commodity for a reason. Its performance, general-purpose applicability, and programming flexibility combine to make it a versatile execution engine. Yet that variety also leads to varying levels of support for the product and different ways solutions are delivered.
While evaluating analytic software products that support Spark, customers should look closely under the hood and examine four key facets of how the support for Spark is implemented:
How Spark is utilized inside the platform
What you get in a packaged product that includes Spark
How Spark is exposed to you and your team
How you perform analytics with the different Spark libraries
Spark can be used as a developer tool via its APIs, or it can be used by BI tools via its SQL interface. Or Spark can be embedded in an application, providing access to business users without requiring programming skills and without limiting Spark’s utility through a SQL interface. I examine each of these options below and explain why all Spark support is not the same.
Programming on Spark
If you want the full power of Spark, you can program directly to its processing engine. There are APIs that are exposed through Java, Python, Scala, and R. In addition to stream and graph processing components, Spark offers a machine-learning library (MLlib) as well as Spark SQL, which allows data tools to connect to a Spark engine and query structured data, or programmers to access data via SQL queries they write themselves.
A number of vendors offer standalone Spark implementations; the major Hadoop distribution suppliers also offer Spark within their platforms. Access is exposed either through a command line or Notebook interface.
But performing analytics on core Spark with its APIs is a time-consuming, programming-intensive process. While Spark offers an easier programming model than, say, native Hadoop, it still requires developers. Even for organizations with developer resources, deploying them to work on lengthy data analytics projects may amount to an intolerable hidden cost. With many organizations, programming on Spark is not an actionable course for this reason.
BI on Spark
Spark SQL is a standards-based way to access data in Spark. It has been relatively easy for BI products to add support for Spark SQL to query tabular data in Spark. The dialect of SQL used by Spark is similar to that of Apache Hive, making Spark SQL akin to earlier SQL-on-Hadoop technologies.
Although Spark SQL uses the Spark engine behind the scenes, it suffers from the same disadvantages as Hive and Impala: Data must be in a structured, tabular format to be queried. This forces Spark to be treated as if it were a relational database, which cripples many of the advantages of a big data engine. Simply put, putting BI on top of Spark requires the transformation of the data into a reasonable tabular format that can be consumed by the BI tools.
Another way to leverage Spark is to abstract away its complexity by embedding it deep into a product and taking full advantage of its power behind the scenes. This allows users to leverage the speed and power of Spark without needing developers.
This architecture brings up three key questions. First, does the platform truly hide all of the technical complexities of Spark? As a customer, one needs to examine all aspects of how you would create each step of the analytic cycle — integration, preparation, analysis, visualization, and operationalization. A number of products offer self-service capabilities that abstract away Spark’s complexities, but others force the analyst to dig down and code — for example, in performing integration and preparation. These products may also require you to first ingest all your data into the Hadoop file system for processing. This adds extra length to your analytic cycles, creates fragile and fragmented analytic processes, and requires specialized skills.
Second, how does the platform take advantage of Spark? It’s critical to understand how Spark is used in the execution framework. Spark is sometimes embedded in a fashion that does not have the full scalability of a true cluster. This can limit overall performance as the volume of analytic jobs increases.
Third, how are you protected for the future? The strength of being tightly coupled with the Spark engine is also a weakness. The big data industry moves quickly. MapReduce was the predominant engine in Hadoop for six years. Apache Tez became mainstream in 2013, and now Spark has become a major engine. Assuming the technology curve continues to produce new engines at the same rate, Spark will almost certainly be supplanted by a new engine within 18 months, forcing products tightly coupled to Spark to be reengineered — a far from trivial undertaking. Even with that effort put aside, you must consider whether the redesigned product will be fully compatible with what you’ve built in the older version.
The first step to uncovering the full power of Spark is to understand that not all Spark support is created equal. It’s crucial that organizations grasp the differences in Spark implementations and what each approach means for their overall analytic workflow. Only then can they make a strategic buying decision that will meet their needs over the long haul.
A publicly disclosed vulnerability in the MySQL database could allow attackers to completely compromise some servers.
The vulnerability affects “all MySQL servers in default configuration in all version branches (5.7, 5.6, and 5.5) including the latest versions,” as well as the MySQL-derived databases MariaDB and Percona DB, according to Dawid Golunski, the researcher who found it.
The flaw, tracked as CVE-2016-6662, can be exploited to modify the MySQL configuration file (my.cnf) and cause an attacker-controlled library to be executed with root privileges if the MySQL process is started with the mysqld_safe wrapper script.
The exploit can be executed if the attacker has an authenticated connection to the MySQL service, which is common in shared hosting environments, or through an SQL injection flaw, a common type of vulnerability in websites.
Golunski reported the vulnerability to the developers of all three affected database servers, but only MariaDB and Percona DB received patches so far. Oracle, which develops MySQL, was informed on Jul. 29, according to the researcher, but has yet to fix the flaw.
Oracle releases security updates based on a quarterly schedule and the next one is expected in October. However, since the MariaDB and Percona patches are public since the end of August, the researcher decided to release details about the vulnerability Monday so that MySQL admins can take actions to protect their servers.
Golunski’s advisory contains a limited proof-of-concept exploit, but some parts have been intentionally left out to prevent widespread abuse. The researcher also reported a second vulnerability to Oracle, CVE-2016-6663, that could further simplify the attack, but he hasn’t published details about it yet.
The disclosure of CVE-2016-6662 was met with some criticism on specialized discussion forums, where some users argued that it’s actually a privilege escalation vulnerability and not a remote code execution one as described, because an attacker would need some level of access to the database.
“As temporary mitigations, users should ensure that no mysql config files are owned by mysql user, and create root-owned dummy my.cnf files that are not in use,” Golunski said in his advisory. “These are by no means a complete solution and users should apply official vendor patches as soon as they become available.”
Oracle didn’t immediately respond to a request for comments on the vulnerability.
Snowflake, the cloud-based data warehouse solution co-founded by Microsoft alumnus Bob Muglia, is lowering storage prices and adding a self-service option, meaning prospective customers can open an account with nothing more than a credit card.
These changes also raise an intriguing question: How long can a service like Snowflake expect to reside on Amazon, which itself offers services that are more or less in direct competition — and where the raw cost of storage undercuts Snowflake’s own pricing for same?
Open to the public
The self-service option, called Snowflake On Demand, is a change from Snowflake’s original sales model. Rather than calling a sales representative to set up an account, Snowflake users can now provision services themselves with no more effort than would be needed to spin up an AWS EC2 instance.
In a phone interview, Muglia discussed how the reason for only just now transitioning to this model was more technical than anything else. Before self-service could be offered, Snowflake had to put protections into place to ensure that both the service itself and its customers could be protected from everything from malice (denial-of-service attacks) to incompetence (honest customers submitting massively malformed queries).
“We wanted to make sure we had appropriately protected the system,” Muglia said, “before we opened it up to anyone, anywhere.”
This effort was further complicated by Snowflake’s relative lack of hard usage limits, which Muglia characterized as being one of its major standout features. “There is no limit to the number of tables you can create,” Muglia said, but he further pointed out that Snowflake has to strike a balance between what it can offer any one customer and protecting the integrity of the service as a whole.
“We get some crazy SQL queries coming in our direction,” Muglia said, “and regardless of what comes in, we need to continue to perform appropriately for that customer as well as other customers. We see SQL queries that are a megabyte in size — the query statements [themselves] are a megabyte in size.” (Many such queries are poorly formed, auto-generated SQL, Muglia claimed.)
Fewer costs, more competition
The other major change is a reduction in storage pricing for the service — $30/TB/month for capacity storage, $50/TB/month for on-demand storage, and uncompressed storage at $10/TB/month.
It’s enough of a reduction in price that Snowflake will be unable to rely on storage costs as a revenue source, since those prices barely pay for the use of Amazon’s services as a storage provider. But Muglia is confident Snowflake is profitable enough overall that such a move won’t impact the company’s bottom line.
“We did the data modeling on this,” said Muglia, “and our margins were always lower on storage than on compute running queries.”
According to the studies Snowflake performed, “when customers put more data into Snowflake, they run more queries…. In almost every scenario you can imagine, they were very much revenue-positive and gross-margin neutral, because people run more queries.”
The long-term implications for Snowflake continuing to reside on Amazon aren’t clear yet, especially since Amazon might well be able to undercut Snowflake by directly offering competitive services.
Muglia, though, is confident that Snowflake’s offering is singular enough to stave off competition for a good long time, and is ready to change things up if need be. “We always look into the possibility of moving to other cloud infrastructures,” Muglia said, “although we don’t have plans to do it right now.”
He also noted that Snowflake competes with Amazon and Redshift right now, but “we have a very different shape of product relative to Redshift…. Snowflake is storing multiple petabytes of data and is able to run hundreds of simultaneous concurrent queries. Redshift can’t do that; no other product can do that. It’s that differentiation that allows to effective compete with Amazon, and for that matter Google and Microsoft and Oracle and Teradata.”
SQL Server Analysis Services, one of the key features of Microsoft’s relational database enterprise offering, is going to the cloud. The company announced Tuesday that it’s launching the public beta of Azure Analysis Services, which gives users cloud-based access to semantic data modeling tools.
The news is part of a host of announcements the company is making at the Professional Association for SQL Server Summit in Seattle this week. On top of the new cloud service, Microsoft also released new tools for migrating to the latest version of SQL Server and an expanded free trial for Azure SQL Data Warehouse. On the hardware side, the company revealed new reference architecture for using SQL Server 2016 with active data sets of up to 145TB.
The actions are all part of Microsoft’s continued investment in the company’s relational database product at a time when it’s trying to get customers to move to its cloud.
Azure Analysis Services is designed to help companies get the benefits of cloud processing for semantic data modeling, while still being able to glean insights from data that’s stored either on-premises or in the public cloud. It’s compatible with databases like SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Oracle and Teradata. Customers that already use SQL Server Analysis Services in their private data centers can take the models from that deployment and move them to Azure, too.
One of the key benefits to using Azure Analysis Services is that it’s a fully managed service. Microsoft deals with the work of figuring out the compute resources underpinning the functionality, and users can just focus on the data.
Like its on-premises predecessor, Azure Analysis Services integrates with Microsoft’s Power BI data visualization tools, providing additional modeling capabilities that go beyond what that service can offer. Azure AS can also connect to other business intelligence software, like Tableau.
Microsoft also is making it easier to migrate from an older version of its database software to SQL Server 2016. To help companies evaluate the difference between their old version of SQL Server and the latest release, Microsoft has launched the Database Experimentation Assistant.
Customers can use the assistant to run experiments across different versions of the software, so they can see what if any benefits they’ll get out of the upgrade process while also helping to reduce risk. The Data Migration Assistant, which is supposed to help move workloads, is also being upgraded.
For companies that have large amounts of data they want to store in a cloud database, Microsoft is offering an expanded free trial of Azure SQL Data Warehouse. Users can sign up starting on Tuesday, and get a free month of use. Those customers who want to give it a shot will have to move quickly, though: Microsoft is only taking trial sign-ups until December 31.
Microsoft Corporate Vice President Joseph Sirosh said in an interview that the change to the Azure SQL Data Warehouse trial was necessary because setting up the system to work with actual data warehouse workloads would blow through the typical Azure free trial. Giving people additional capacity to work with should let them have more of an opportunity to test the service before committing to a large deployment.
All of this SQL news comes a little more than a month before AWS Re:Invent, Amazon’s big cloud conference in Las Vegas. It’s likely that we’ll see Amazon unveil some new database products at that event, continuing the ongoing cycle of competition among database vendors in the cloud.
Apache Ignite is an in-memory computing platform that can be inserted seamlessly between a user’s application layer and data layer. Apache Ignite loads data from the existing disk-based storage layer into RAM, improving performance by as much as six orders of magnitude (1 million-fold).
The in-memory data capacity can be easily scaled to handle petabytes of data simply by adding more nodes to the cluster. Further, both ACID transactions and SQL queries are supported. Ignite delivers performance, scale, and comprehensive capabilities far above and beyond what traditional in-memory databases, in-memory data grids, and other in-memory-based point solutions can offer by themselves.
Apache Ignite does not require users to rip and replace their existing databases. It works with RDBMS, NoSQL, and Hadoop data stores. Apache Ignite enables high-performance transactions, real-time streaming, and fast analytics in a single, comprehensive data access and processing layer. It uses a distributed, massively parallel architecture on affordable, commodity hardware to power existing or new applications. Apache Ignite can run on premises, on cloud platforms such as AWS and Microsoft Azure, or in a hybrid environment.
The Apache Ignite unified API supports SQL, C++, .Net, Java, Scala, Groovy, PHP, and Node.js. The unified API connects cloud-scale applications with multiple data stores containing structured, semistructured, and unstructured data. It offers a high-performance data environment that allows companies to process full ACID transactions and generate valuable insights from real-time, interactive, and batch queries.
Users can keep their existing RDBMS in place and deploy Apache Ignite as a layer between it and the application layer. Apache Ignite automatically integrates with Oracle, MySQL, Postgres, DB2, Microsoft SQL Server, and other RDBMSes. The system automatically generates the application domain model based on the schema definition of the underlying database, then loads the data. In-memory databases typically provide only a SQL interface, whereas Ignite supports a wider group of access and processing paradigms in addition to ANSI SQL. Apache Ignite supports key/value stores, SQL access, MapReduce, HPC/MPP processing, streaming/CEP processing, clustering, and Hadoop acceleration in a single integrated in-memory computing platform.
GridGain Systems donated the original code for Apache Ignite to the Apache Software Foundation in the second half of 2014. Apache Ignite was rapidly promoted from an incubating project to a top-level Apache project in 2015. In the second quarter of 2016, Apache Ignite was downloaded nearly 200,000 times. It is used by organizations around the world.
Apache Ignite is JVM-based distributed middleware based on a homogeneous cluster topology implementation that does not require separate server and client nodes. All nodes in an Ignite cluster are equal, and they can play any logical role per runtime application requirement.
A service provider interface (SPI) design is at the core of Apache Ignite. The SPI-based design makes every internal component of Ignite fully customizable and pluggable. This enables tremendous configurability of the system, with adaptability to any existing or future server infrastructure.
Apache Ignite also provides direct support for parallelization of distributed computations based on fork-join, MapReduce, or MPP-style processing. Ignite uses distributed parallel computations extensively, and they are fully exposed at the API level for user-defined functionality.
In-memory data grid. Apache Ignite includes an in-memory data grid that handles distributed in-memory data management, including ACID transactions, failover, advanced load balancing, and extensive SQL support. The Ignite data grid is a distributed, object-based, ACID transactional, in-memory key-value store. In contrast to traditional database management systems, which utilize disk as their primary storage mechanism, Ignite stores data in memory. By utilizing memory rather than disk, Apache Ignite is up to 1 million times faster than traditional databases.
SQL support. Apache Ignite supports free-form ANSI SQL-99 compliant queries with virtually no limitations. Ignite can use any SQL function, aggregation, or grouping, and it supports distributed, noncolocated SQL joins and cross-cache joins. Ignite also supports the concept of field queries to help minimize network and serialization overhead.
In-memory compute grid. Apache Ignite includes a compute grid that enables parallel, in-memory processing of CPU-intensive or other resource-intensive tasks such as traditional HPC, MPP, fork-join, and MapReduce processing. Support is also provided for standard Java ExecutorService asynchronous processing.
In-memory service grid. The Apache Ignite service grid provides complete control over services deployed on the cluster. Users can control how many service instances should be deployed on each cluster node, ensuring proper deployment and fault tolerance. The service grid guarantees continuous availability of all deployed services in case of node failures. It also supports automatic deployment of multiple instances of a service, of a service as a singleton, and of services on node startup.
In-memory streaming. In-memory stream processing addresses a large family of applications for which traditional processing methods and disk-based storage, such as disk-based databases or file systems, are inadequate. These applications are extending the limits of traditional data processing infrastructures.
Streaming support allows users to query rolling windows of incoming data. This enables users to answer questions such as “What are the 10 most popular products over the last hour?” or “What is the average price in a certain product category for the past 12 hours?”
Another common stream processing use case is pipelining a distributed events workflow. As events are coming into the system at high rates, the processing of events is split into multiple stages, each of which has to be properly routed within a cluster for processing. These customizable event workflows support complex event processing (CEP) applications.
In-memory Hadoop acceleration. The Apache Ignite Accelerator for Hadoop enables fast data processing in existing Hadoop environments via the tools and technology an organization is already using.
Ignite in-memory Hadoop acceleration is based on the first dual-mode, high-performance in-memory file system that is 100 percent compatible with Hadoop HDFS and an in-memory optimized MapReduce implementation. Delivering up to 100 times faster performance, in-memory HDFS and in-memory MapReduce provide easy-to-use extensions to disk-based HDFS and traditional MapReduce. This plug-and-play feature requires minimal to no integration. It works with any open source or commercial version of Hadoop 1.x or Hadoop 2.x, including Cloudera, Hortonworks, MapR, Apache, Intel, and AWS. The result is up to 100-fold faster performance for MapReduce and Hive jobs.
Distributed in-memory file system. A unique feature of Apache Ignite is the Ignite File System (IGFS), which is a file system interface to in-memory data. IGFS delivers similar functionality to Hadoop HDFS. It includes the ability to create a fully functional file system in memory. IGFS is at the core of the Apache Ignite In-Memory Accelerator for Hadoop.
The data from each file is split on separate data blocks and stored in cache. Data in each file can be accessed with a standard Java streaming API. For each part of the file, a developer can calculate an affinity and process the file’s content on corresponding nodes to avoid unnecessary networking.
Unified API. The Apache Ignite unified API supports a wide variety of common protocols for the application layer to access data. Supported protocols include SQL, Java, C++, .Net, PHP, MapReduce, Scala, Groovy, and Node.js. Ignite supports several protocols for client connectivity to Ignite clusters, including Ignite Native Clients, REST/HTTP, SSL/TLS, and Memcached.SQL.
Advanced clustering. Apache Ignite provides one of the most sophisticated clustering technologies on JVMs. Ignite nodes can automatically discover each other, which helps scale the cluster when needed without having to restart the entire cluster. Developers can also take advantage of Ignite’s hybrid cloud support, which allows users to establish connections between private clouds and public clouds such as AWS or Microsoft Azure.
Additional features. Apache Ignite provides high-performance, clusterwide messaging functionality. It allows users to exchange data via publish-subscribe and direct point-to-point communication models.
The distributed events functionality in Ignite allows applications to receive notifications about cache events occurring in a distributed grid environment. Developers can use this functionality to be notified about the execution of remote tasks or any cache data changes within the cluster. Event notifications can be grouped and sent in batches and at timely intervals. Batching notifications help attain high cache performance and low latency.
Ignite allows for most of the data structures from the java.util.concurrent framework to be used in a distributed fashion. For example, you could add to a double-ended queue (java.util.concurrent.BlockingDeque) on one node and poll it from another node. Or you could have a distributed primary key generator, which would guarantee uniqueness on all nodes.
Ignite distributed data structures include support for these standard Java APIs: Concurrent map, distributed queues and sets, AtomicLong, AtomicSequence, AtomicReference, and CountDownLatch.
Apache Spark. Apache Spark is a fast, general-purpose engine for large-scale data processing. Ignite and Spark are complementary in-memory computing solutions. They can be used together in many instances to achieve superior performance and functionality.
Apache Spark and Apache Ignite address somewhat different use cases and rarely compete for the same task. The table below outlines some of the key differences.
Apache Spark doesn’t provide shared storage, so data from HDFS or other disk storage must be loaded into Spark for processing. State can be passed from Spark job to job only by saving the processed data back into external storage. Ignite can share Spark state directly in memory, without storing the state to disk.
One of the main integrations for Ignite and Spark is the Apache Ignite Shared RDD API. Ignite RDDs are essentially wrappers around Ignite caches that can be deployed directly inside of executing Spark jobs. Ignite RDDs can also be used with the cache-aside pattern, where Ignite clusters are deployed separately from Spark, but still in-memory. The data is still accessed using Spark RDD APIs.
Spark supports a fairly rich SQL syntax, but it doesn’t support data indexing, so it must do full scans all the time. Spark queries may take minutes even on moderately small data sets. Ignite supports SQL indexes, resulting in much faster queries, so using Spark with Ignite can accelerate Spark SQL more than 1,000-fold. The result set returned by Ignite Shared RDDs also conforms to the Spark Dataframe API, so it can be further analyzed using standard Spark dataframes. Both Spark and Ignite natively integrate with Apache YARN and Apache Mesos, so it’s easier to use them together.
When working with files instead of RDDs, it’s still possible to share state between Spark jobs and applications using the Ignite In-Memory File System (IGFS). IGFS implements the Hadoop FileSystem API and can be deployed as a native Hadoop file system, exactly like HDFS. Ignite plugs in natively to any Hadoop or Spark environment. IGFS can be used with zero code changes in plug-and-play fashion.
Apache Cassandra. Apache Cassandra can serve as a high-performance solution for structured queries. But the data in Cassandra should be modeled such that each predefined query results in one row retrieval. Thus, you must know what queries will be required before modeling the data.
Those who wondered what it would be like to run Microsoft SQL Server on Linux now have an answer. Microsoft has released the first public preview of the long-promised product.
Microsoft also wants to make clear this isn’t a “SQL Server Lite” for those satisfied with a reduced feature set. Microsoft has a four-point plan to make this happen.
First is through broad support for all major enterprise-grade Linux editions: Red Hat Enterprise Linux, Ubuntu Linux, and soon Suse Linux Enterprise Server. “Support” means behaving like other Linux applications on the distributions, not requiring a Microsoft-only methodology for installing or running the app. An introductory video depicts SQL Server installed on RHEL through the system’s yum package manager, and a white paper describes launching SQL Server’s services via systemd.
Second, Microsoft promises the full set of SQL Server 2016’s features for Linux users—not only support for the T-SQL command set, but high-end items like in-memory OLTP, always-on encryption, and row-level security. It will be a first-class citizen on Linux, as SQL Server has been on Windows itself.
Third is Linux support for the tooling around SQL Server—not SQL Server Management Studio alone, but also the Migration Assistant for relocating workloads to Linux systems and the sqlps PowerShell module. This last item is in line with a possibility introduced when PowerShell was initially open-sourced: Once ported to Linux, it would become part of the support structure for other big-name Microsoft applications as they, too, showed up on the OS. That’s now happening.
By bringing SQL Server to Linux, Microsoft can compete more directly with Oracle, which has long provided its product on Linux. Oracle may be blunting the effects of the strategy by shifting customers toward a cloud-based service model, but any gains are likely to be hard-won.
The other, immediate benefit is to provide Microsoft customers with more places to run SQL Server. Enterprises have historically run mixes of Linux and Windows systems, and SQL Server on Linux will let them shave the costs of running some infrastructure.
Most of all, Microsoft is striving to prove a Microsoft shop can lose little, and preferably nothing, by making a switch—and a new shop eyeing SQL Server has fewer reasons to opt for a competing database that’s Linux-first.
CrateDB, an open source, clustered database designed for missions like fast text search and analytics, released its first full 1.0 version last week after three years in development.
It’s built upon several existing open source technologies — Elasticsearch and Lucene, for instance — but no direct knowledge of them is needed to deploy it, as CrateDB offers more than a repackaging of those products.
The database caught the attention of InfoWorld’s Peter Wayner back in 2015 because it promised “a search engine like [Apache] Lucene [and ‘its larger, scalable, and distributed cousin Elasticsearch’], but with the structure and querying ease of SQL.”
The idea is to provide more than a full-text search system. CrateDB’s use cases include big data analytics and scalable aggregations across large data sets. It allows querying via standard ANSI SQL, but it uses a distributed, horizontally scalable architecture, so that any number of nodes can be spun up and run side by side with minimal work.
CrateDB gets two major advantages from the NoSQL side. One is support for unstructured data via JSON documents and BLOB storage, with JSON data queryable through SQL as well. Another is support for high-speed writing, to make the database a suitable target for high-speed data ingestion a la Hadoop.
But CrateDB’s biggest draw may be the setup process and the overall level of get-in-and-go usability. The only prerequisite is Java 8, or you can use Docker to run a provided container image. Nodes automatically discover each other as long as they’re on a network that supports multicast. The web UI can bootstrap a cluster with sample data (courtesy of Twitter), and the command-line shell uses conventional SQL syntax for inserting and querying data. Also included is support for PostgreSQL’s wire protocol, although any actual SQL commands sent through it need to adhere to CrateDB’s implementation of SQL.
CrateDB’s one of a flood of recent database products that all address specific issues that have sprung up: scalability, resiliency, mixing modalities (NoSQL vs. SQL, document vs. graph), high-speed writes, and so on. The philosophy behind such products generally runs like this: Existing solutions are too old, hidebound, or legacy-oriented to solve current and future problems, so we need a clean slate. The trick will be to see whether the benefits of the clean slate outweigh the difficulties of moving to it — hence, CrateDB’s emphasis on usability and quick starts.
setup procedure will allow users to configure SQL Server on Microsoft Azure without the aid of a database administrator.
“The new wizard for building and configuring a new virtual machine with SQL Server 2014 is very well put together,” said Denny Cherry, founder and principal consultant for Denny Cherry and Associates Consulting. “It helps solve a lot of the complexity of building a new SQL Server, specifically around how you need to configure the storage in order to get a high-performing SQL Server VM.”
Joseph D’Antoni, principal consultant at Denny Cherry and Associates Consulting, said that one of the major challenges with Azure was allocating storage. For instance, he said, to configure SQL Server on an Azure VM, you needed to allocate disks manually to get the needed amount of IOPS. This meant you had to know exactly what your storage application needs were for optimal performance, and many people were “kind of guessing,” D’Antoni said. With the new wizard, all you have to do is enter the required number of IOPS and storage is allocated automatically.
Automating SQL Server setup for an Azure VM means that no longer does everything have to be done manually: Connectivity, performance, security and storage are configured automatically during setup. “I think it does simplify what was a pretty complex process,” D’Antoni said.
You can now use the Internet to set up SQL Server connectivity and enable SQL Server authentication through the Azure Web portal. Previously, connecting SQL Server to an Azure VM via the Internet required a multistep process using SQL Server Management Studio. The new automated configuration process lets you pick whether to expand connectivity to the whole Azure virtual network or to connect only within the individual VM.
The new process for configuring SQL Server on an Azure virtual machine also includes automated patching and automated backup. The automated patching allows you pick a time when you want all your patches to occur. Users can schedule patches to minimize the impact they’ll have on the workload. Automated backup allows you to specify how long to keep backups.
“I think that these are a great enhancement on the old process of having to know how to configure these components manually within the VM,” Cherry said, “because these configurations can get tricky to configure.”
D’Antoni added that this innovation is going to affect smaller organizations the most, because it means that they won’t need an expert to move SQL Server onto an Azure virtual machine. “[The simplified configuration] gives the power to someone who is deploying a VM when they would have needed an administrator or a DBA before. To that extent, it’s kind of a big deal.”