Complex 3-D data on all devices

Complex 3-D data on all devices

A new web-based software platform is swiftly bringing the visualization of 3-D data to every device, optimizing the use of, for example, virtual reality and augmented reality in industry. In this way, Fraunhofer researchers have brought the ideal of “any data on any device” a good deal closer.

If you want to be sure that the person you are sending documents and pictures to will be able to open them on their computer, then you send them in PDF and JPG format. But what do you do with 3-D content? “A standardized option hasn’t existed before now,” says Dr. Johannes Behr, head of the Visual Computing System Technologies department at the Fraunhofer Institute for Computer Graphics Research IGD. In particular, industry lacks a means of taking the very large, increasingly complex volumes of 3-D data that arise and rendering them useful – and of being able to use the data on every device, from smartphones to VR goggles. “The data volume is growing faster than the means of visualizing it,” reports Behr. Fraunhofer IGD is presenting a solution to this problem in the form of its “instant3DHub” software, which allows engineers, technicians and assemblers to use spatial design and assembly plans without any difficulty on their own devices. “This will enable them to inspect industrial plants or digital buildings, etc. in real time and find out what’s going on there,” explains Behr.

Software calculates only visible components

On account of the gigantic volumes of data that have to be processed, such an undertaking has thus far been either impossible or possible only with a tremendous amount of effort. After all, users had to manually choose in advance which data should be processed for the visualization, a task then executed by expensive special software. Not exactly a cost-effective method, and a time-consuming one as well. With the web-based Fraunhofer solution, every company can adapt the visualization tool to its requirements. The software autonomously selects the data to be prepared, by intelligently calculating, for example, that only views of visible parts are transmitted to the user’s device. Citing the example of a power plant, Behr explains: “Out of some 3.5 million components, only the approximately 3,000 visible parts are calculated on the server and transmitted to the device.”

Such visibility calculations are especially useful for VR and AR applications, as the objects being viewed at any given moment appear in the display in real time. At CeBIT, researchers will be showing how well this works, using the example of car maintenance. In a VR application, it is necessary to load up to 120 images per second onto data goggles. In this way, several thousand points of 3-D data can be transmitted from a central database for a vehicle model to a device in just one second. The process is so fast because the complete data does not have to be loaded to the device, as used to be the case, but is streamed over the web. A huge variety of 3-D web applications are delivered on the fly, without permanent storage, so that even mobile devices such as tablets and smartphones can make optimal use of them. One key feature of this process is that for every access to instant3DHub, the data is assigned to, prepared and visualized for the specific applications. “As a result, the system fulfills user- and device-specific requirements, and above all is secure,” says Behr. BMW, Daimler and Porsche already use instant3DHub at over 1,000 workstations. Even medium-sized companies such as SimScale and thinkproject have successfully implemented “instantreality” and instant3Dhub and are developing their own individual software solutions on that basis.

Augmented reality is a key technology for Industrie 4.0

Technologies that create a link between CAD data and the real production environment are also relevant for the domain of augmented reality. “Augmented reality is a key technology for Industrie 4.0, because it constantly compares the digital target situation in real time against the actual situation as captured by cameras and sensors,” adds Dr. Ulrich Bockholt, head of the Virtual and Augmented Reality department at Fraunhofer IGD. Ultimately, however, the solution is of interest to many sectors, he explains, even in the construction and architecture field, where it can be used to help visualize building information models on smartphones, tablet computers or data goggles.

 

[Source:- Phys.org]

 

 

Upcoming Windows 10 update reduces spying, but Microsoft is still mum on which data it specifically collects

Privacy-2-1024x812

There’s some good news for privacy-minded individuals who haven’t been fond of Microsoft’s data collection policy with Windows 10. When the upcoming Creators Update drops this spring, it will overhaul Microsoft’s data collection policies. Terry Myerson, executive vice president of Microsoft’s Windows and Devices Group, has published a blog post with a list of the changes Microsoft will be making.

First, Microsoft has launched a new web-based privacy dashboard with the goal of giving people an easy, one-stop location for controlling how much data Microsoft collects. Your privacy dashboard has sections for Browse, Search, Location, and Cortana’s Notebook, each covering a different category of data MS might have received from your hardware. Personally, I keep the Digital Assistant side of Cortana permanently deactivated and already set telemetry to minimal, but if you haven’t taken those steps you can adjust how much data Microsoft keeps from this page.

Second, Microsoft is condensing its telemetry options. Currently, there are four options — Security, Basic, Enhanced, and Full. Most consumers only have access to three of these settings — Basic, Enhanced, and Full. The fourth, security, is reserved for Windows 10 Enterprise or Windows 10 Education. Here’s how Microsoft describes each category:

Security: Information that’s required to help keep Windows, Windows Server, and System Center secure, including data about the Connected User Experience and Telemetry component settings, the Malicious Software Removal Tool, and Windows Defender.

Basic: Basic device info, including: quality-related data, app compatibility, app usage data, and data from the Security level.

Enhanced: Additional insights, including: how Windows, Windows Server, System Center, and apps are used, how they perform, advanced reliability data, and data from both the Basic and the Security levels.

Full: All data necessary to identify and help to fix problems, plus data from the Security, Basic, and Enhanced levels.

That’s the old system. Going forward, Microsoft is collapsing the number of telemetry levels to two. Here’s how Myerson describes the new “Basic” level:

[We’ve] further reduced the data collected at the Basic level. This includes data that is vital to the operation of Windows. We use this data to help keep Windows and apps secure, up-to-date, and running properly when you let Microsoft know the capabilities of your device, what is installed, and whether Windows is operating correctly. This option also includes basic error reporting back to Microsoft.

Windows 10 will also include an enhanced privacy section that will show during start-up and offer much better granularity over privacy settings. Currently, many of these controls are buried in various menus that you have to manually configure after installing the operating system.

It’s nice that Microsoft is cutting back on telemetry collection at the basic level. The problem is, as Stephen J Vaughn-Nichols writes, Microsoft is still collecting a creepy amount of information on “Full,” and it still defaults to sharing all this information with Cortana — which means Microsoft has data files on people it can be compelled to turn over by a warrant from an organization like the NSA or FBI. Given the recent expansion of the NSA’s powers, this information can now be shared with a variety of other agencies without filtering it first. And while Microsoft’s business model doesn’t directly depend on scraping and selling customer data the way Google does, the company is still gathering an unspecified amount of information. Full telemetry, for example, may “unintentionally include parts of a document you were using when a problem occurred.” Vaughn-Nichols isn’t thrilled about that idea, and neither am I.

The problem with Microsoft’s disclosure is it mostly doesn’t disclose. Even basic telemetry is described as “includes data that is vital to the operation of Windows.” Okay. But what does that mean?

I’m glad to see Microsoft taking steps towards restoring user privacy, but these are small steps that only modify policies around the edges. Until the company actually and meaningfully discloses what telemetry is collected under Basic settings and precisely what Full settings do and don’t send in the way of personally identifying information, the company isn’t explaining anything so much as it’s using vague terms and PR in place of a disclosure policy.

As I noted above, I’d recommend turning Cortana (the assistant) off. If you don’t want to do that, you should regularly review the information MS has collected about you and delete any items you don’t want to part of the company’s permanent record.

 

 

[Source:- Extremetech]

Attackers start wiping data from CouchDB and Hadoop databases

Data-wiping attacks have hit exposed Hadoop and CouchDB databases.

It was only a matter of time until ransomware groups that wiped data from thousands of MongoDB databases and Elasticsearch clusters started targeting other data storage technologies. Researchers are now observing similar destructive attacks hitting openly accessible Hadoop and CouchDB deployments.

Security researchers Victor Gevers and Niall Merrigan, who monitored the MongoDB and Elasticsearch attacks so far, have also started keeping track of the new Hadoop and CouchDB victims. The two have put together spreadsheets on Google Docs where they document the different attack signatures and messages left behind after data gets wiped from databases.

In the case of Hadoop, a framework used for distributed storage and processing of large data sets, the attacks observed so far can be described as vandalism.

That’s because the attackers don’t ask for payments to be made in exchange for returning the deleted data. Instead, their message instructs the Hadoop administrators to secure their deployments in the future.

According to Merrigan’s latest count, 126 Hadoop instances have been wiped so far. The number of victims is likely to increase because there are thousands of Hadoop deployments accessible from the internet — although it’s hard to say how many are vulnerable.

The attacks against MongoDB and Elasticsearch followed a similar pattern. The number of MongoDB victims jumped from hundreds to thousands in a matter of hours and to tens of thousands within a week. The latest count puts the number of wiped MongoDB databases at more than 34,000 and that of deleted Elasticsearch clusters at more than 4,600.

A group called Kraken0, responsible for most of the ransomware attacks against databases, is trying to sell its attack toolkit and a list of vulnerable MongoDB and Elasticsearch installations for the equivalent of US$500 in bitcoins.

The number of wiped CouchDB databases is also growing rapidly, reaching more than 400 so far. CouchDB is a NoSQL-style database platform similar to MongoDB.

Unlike the Hadoop vandalism, the CouchDB attacks are accompanied by ransom messages, with attackers asking for 0.1 bitcoins (around $100) to return the data. Victims are advised against paying because, in many of the MongoDB attacks, there was no evidence that attackers had actually copied the data before deleting it.

Researchers from Fidelis Cybersecurity have also observed the Hadoop attacks and have published a blog post with more details and recommendations on securing such deployments.

The destructive attacks against online database storage systems are not likely to stop soon because there are other technologies that have not yet been targeted and that might be similarly misconfigured and left unprotected on the internet by users.

 

 

[Source:- JW]

Apache Beam unifies batch and streaming for big data

Apache Beam unifies batch and streaming for big data

Apache Beam, a unified programming model for both batch and streaming data, has graduated from the Apache Incubator to become a top-level Apache project.

Aside from becoming another full-fledged widget in the ever-expanding Apache tool belt of big-data processing software, Beam addresses ease of use and dev-friendly abstraction, rather than simply offering raw speed or a wider array of included processing algorithms.

Beam us up!

Beam provides a single programming model for creating batch and stream processing jobs (the name is a hybrid of “batch” and “stream”), and it offers a layer of abstraction for dispatching to various engines used to run the jobs. The project originated at Google, where it’s currently a service called GCD (Google Cloud Dataflow). Beam uses the same API as GCD, and it can use GCD as an execution engine, along with Apache Spark, Apache Flink (a stream processing engine with a highly memory-efficient design), and now Apache Apex (another stream engine for working closely with Hadoop deployments).

The Beam model involves five components: the pipeline (the pathway for data through the program); the “PCollections,” or data streams themselves; the transforms, for processing data; the sources and sinks, where data is fetched and eventually sent; and the “runners,” or components that allow the whole thing to be executed on an engine.

Apache says it separated concerns in this fashion so that Beam can “easily and intuitively express data processing pipelines for everything from simple batch-based data ingestion to complex event-time-based stream processing.” This is in line with reworking tools like Apache Spark to support stream and batch processing within the same product and with similar programming models. In theory, it’s one fewer concept for prospective developers to wrap their head around, but that presumes Beam is used in lieu of Spark or other frameworks, when it’s more likely it’ll be used — at first — to augment them.

Hands off

One possible drawback to Beam’s approach is that while the layers of abstraction in the product make operations easier, they also put the developer at a distance from the underlying layers. A good case in point: Beam’s current level of integration with Apache Spark; the Spark runner doesn’t yet use Spark’s more recent DataFrames system, and thus may not take advantage of the optimizations those can provide. But this isn’t a conceptual flaw, it’s an issue with the implementation, which can be addressed in time.

The big payoff of using Beam, as noted by Ian Pointer in his discussion of Beam in early 2016, is that it makes migrations between processing systems less of a headache. Likewise, Apache says Beam “cleanly [separates] the user’s processing logic from details of the underlying engine.”

Separation of concern and ease of migration will be good to have if the ongoing rivalries, and competitions between the various big data processing engines continues. Granted, Apache Spark has emerged as one of the undisputed champs of the field and become a de facto standard choice. But there’s always room for improvement or an entirely new streaming or processing paradigm. Beam is less about offering a specific alternative than about providing developers and data-wranglers with more breadth of choice between them.

 

 

[Source:- Javaworld]

Snowflake now offers data warehousing to the masses

Snowflake now offers data warehousing to the masses

Snowflake, the cloud-based data warehouse solution co-founded by Microsoft alumnus Bob Muglia, is lowering storage prices and adding a self-service option, meaning prospective customers can open an account with nothing more than a credit card.

These changes also raise an intriguing question: How long can a service like Snowflake expect to reside on Amazon, which itself offers services that are more or less in direct competition — and where the raw cost of storage undercuts Snowflake’s own pricing for same?

Open to the public

The self-service option, called Snowflake On Demand, is a change from Snowflake’s original sales model. Rather than calling a sales representative to set up an account, Snowflake users can now provision services themselves with no more effort than would be needed to spin up an AWS EC2 instance.

In a phone interview, Muglia discussed how the reason for only just now transitioning to this model was more technical than anything else. Before self-service could be offered, Snowflake had to put protections into place to ensure that both the service itself and its customers could be protected from everything from malice (denial-of-service attacks) to incompetence (honest customers submitting massively malformed queries).

“We wanted to make sure we had appropriately protected the system,” Muglia said, “before we opened it up to anyone, anywhere.”

This effort was further complicated by Snowflake’s relative lack of hard usage limits, which Muglia characterized as being one of its major standout features. “There is no limit to the number of tables you can create,” Muglia said, but he further pointed out that Snowflake has to strike a balance between what it can offer any one customer and protecting the integrity of the service as a whole.

“We get some crazy SQL queries coming in our direction,” Muglia said, “and regardless of what comes in, we need to continue to perform appropriately for that customer as well as other customers. We see SQL queries that are a megabyte in size — the query statements [themselves] are a megabyte in size.” (Many such queries are poorly formed, auto-generated SQL, Muglia claimed.)

Fewer costs, more competition

The other major change is a reduction in storage pricing for the service — $30/TB/month for capacity storage, $50/TB/month for on-demand storage, and uncompressed storage at $10/TB/month.

It’s enough of a reduction in price that Snowflake will be unable to rely on storage costs as a revenue source, since those prices barely pay for the use of Amazon’s services as a storage provider. But Muglia is confident Snowflake is profitable enough overall that such a move won’t impact the company’s bottom line.

“We did the data modeling on this,” said Muglia, “and our margins were always lower on storage than on compute running queries.”

According to the studies Snowflake performed, “when customers put more data into Snowflake, they run more queries…. In almost every scenario you can imagine, they were very much revenue-positive and gross-margin neutral, because people run more queries.”

The long-term implications for Snowflake continuing to reside on Amazon aren’t clear yet, especially since Amazon might well be able to undercut Snowflake by directly offering competitive services.

Muglia, though, is confident that Snowflake’s offering is singular enough to stave off competition for a good long time, and is ready to change things up if need be. “We always look into the possibility of moving to other cloud infrastructures,” Muglia said, “although we don’t have plans to do it right now.”

He also noted that Snowflake competes with Amazon and Redshift right now, but “we have a very different shape of product relative to Redshift…. Snowflake is storing multiple petabytes of data and is able to run hundreds of simultaneous concurrent queries. Redshift can’t do that; no other product can do that. It’s that differentiation that allows to effective compete with Amazon, and for that matter Google and Microsoft and Oracle and Teradata.”

 

 

[Source:- IW]

Fire up big data processing with Apache Ignite

Fire up big data processing with Apache Ignite

Apache Ignite is an in-memory computing platform that can be inserted seamlessly between a user’s application layer and data layer. Apache Ignite loads data from the existing disk-based storage layer into RAM, improving performance by as much as six orders of magnitude (1 million-fold).

The in-memory data capacity can be easily scaled to handle petabytes of data simply by adding more nodes to the cluster. Further, both ACID transactions and SQL queries are supported. Ignite delivers performance, scale, and comprehensive capabilities far above and beyond what traditional in-memory databases, in-memory data grids, and other in-memory-based point solutions can offer by themselves.

Apache Ignite does not require users to rip and replace their existing databases. It works with RDBMS, NoSQL, and Hadoop data stores. Apache Ignite enables high-performance transactions, real-time streaming, and fast analytics in a single, comprehensive data access and processing layer. It uses a distributed, massively parallel architecture on affordable, commodity hardware to power existing or new applications. Apache Ignite can run on premises, on cloud platforms such as AWS and Microsoft Azure, or in a hybrid environment.

apache ignite architecture

The Apache Ignite unified API supports SQL, C++, .Net, Java, Scala, Groovy, PHP, and Node.js. The unified API connects cloud-scale applications with multiple data stores containing structured, semistructured, and unstructured data. It offers a high-performance data environment that allows companies to process full ACID transactions and generate valuable insights from real-time, interactive, and batch queries.

Users can keep their existing RDBMS in place and deploy Apache Ignite as a layer between it and the application layer. Apache Ignite automatically integrates with Oracle, MySQL, Postgres, DB2, Microsoft SQL Server, and other RDBMSes. The system automatically generates the application domain model based on the schema definition of the underlying database, then loads the data. In-memory databases typically provide only a SQL interface, whereas Ignite supports a wider group of access and processing paradigms in addition to ANSI SQL. Apache Ignite supports key/value stores, SQL access, MapReduce, HPC/MPP processing, streaming/CEP processing, clustering, and Hadoop acceleration in a single integrated in-memory computing platform.

GridGain Systems donated the original code for Apache Ignite to the Apache Software Foundation in the second half of 2014. Apache Ignite was rapidly promoted from an incubating project to a top-level Apache project in 2015. In the second quarter of 2016, Apache Ignite was downloaded nearly 200,000 times. It is used by organizations around the world.

Architecture

Apache Ignite is JVM-based distributed middleware based on a homogeneous cluster topology implementation that does not require separate server and client nodes. All nodes in an Ignite cluster are equal, and they can play any logical role per runtime application requirement.

A service provider interface (SPI) design is at the core of Apache Ignite. The SPI-based design makes every internal component of Ignite fully customizable and pluggable. This enables tremendous configurability of the system, with adaptability to any existing or future server infrastructure.

Apache Ignite also provides direct support for parallelization of distributed computations based on fork-join, MapReduce, or MPP-style processing. Ignite uses distributed parallel computations extensively, and they are fully exposed at the API level for user-defined functionality.

Key features

In-memory data grid. Apache Ignite includes an in-memory data grid that handles distributed in-memory data management, including ACID transactions, failover, advanced load balancing, and extensive SQL support. The Ignite data grid is a distributed, object-based, ACID transactional, in-memory key-value store. In contrast to traditional database management systems, which utilize disk as their primary storage mechanism, Ignite stores data in memory. By utilizing memory rather than disk, Apache Ignite is up to 1 million times faster than traditional databases.

apache ignite data grid

SQL support. Apache Ignite supports free-form ANSI SQL-99 compliant queries with virtually no limitations. Ignite can use any SQL function, aggregation, or grouping, and it supports distributed, noncolocated SQL joins and cross-cache joins. Ignite also supports the concept of field queries to help minimize network and serialization overhead.

In-memory compute grid. Apache Ignite includes a compute grid that enables parallel, in-memory processing of CPU-intensive or other resource-intensive tasks such as traditional HPC, MPP, fork-join, and MapReduce processing. Support is also provided for standard Java ExecutorService asynchronous processing.

apache ignite compute grid

In-memory service grid. The Apache Ignite service grid provides complete control over services deployed on the cluster. Users can control how many service instances should be deployed on each cluster node, ensuring proper deployment and fault tolerance. The service grid guarantees continuous availability of all deployed services in case of node failures. It also supports automatic deployment of multiple instances of a service, of a service as a singleton, and of services on node startup.

In-memory streaming. In-memory stream processing addresses a large family of applications for which traditional processing methods and disk-based storage, such as disk-based databases or file systems, are inadequate. These applications are extending the limits of traditional data processing infrastructures.

apache ignite streaming

Streaming support allows users to query rolling windows of incoming data. This enables users to answer questions such as “What are the 10 most popular products over the last hour?” or “What is the average price in a certain product category for the past 12 hours?”

Another common stream processing use case is pipelining a distributed events workflow. As events are coming into the system at high rates, the processing of events is split into multiple stages, each of which has to be properly routed within a cluster for processing. These customizable event workflows support complex event processing (CEP) applications.

In-memory Hadoop acceleration. The Apache Ignite Accelerator for Hadoop enables fast data processing in existing Hadoop environments via the tools and technology an organization is already using.

apache ignite hadoop rev

Ignite in-memory Hadoop acceleration is based on the first dual-mode, high-performance in-memory file system that is 100 percent compatible with Hadoop HDFS and an in-memory optimized MapReduce implementation. Delivering up to 100 times faster performance, in-memory HDFS and in-memory MapReduce provide easy-to-use extensions to disk-based HDFS and traditional MapReduce. This plug-and-play feature requires minimal to no integration. It works with any open source or commercial version of Hadoop 1.x or Hadoop 2.x, including Cloudera, Hortonworks, MapR, Apache, Intel, and AWS. The result is up to 100-fold faster performance for MapReduce and Hive jobs.

Distributed in-memory file system. A unique feature of Apache Ignite is the Ignite File System (IGFS), which is a file system interface to in-memory data. IGFS delivers similar functionality to Hadoop HDFS. It includes the ability to create a fully functional file system in memory. IGFS is at the core of the Apache Ignite In-Memory Accelerator for Hadoop.

The data from each file is split on separate data blocks and stored in cache. Data in each file can be accessed with a standard Java streaming API. For each part of the file, a developer can calculate an affinity and process the file’s content on corresponding nodes to avoid unnecessary networking.

Unified API. The Apache Ignite unified API supports a wide variety of common protocols for the application layer to access data. Supported protocols include SQL, Java, C++, .Net, PHP, MapReduce, Scala, Groovy, and Node.js. Ignite supports several protocols for client connectivity to Ignite clusters, including Ignite Native Clients, REST/HTTP, SSL/TLS, and Memcached.SQL.

Advanced clustering. Apache Ignite provides one of the most sophisticated clustering technologies on JVMs. Ignite nodes can automatically discover each other, which helps scale the cluster when needed without having to restart the entire cluster. Developers can also take advantage of Ignite’s hybrid cloud support, which allows users to establish connections between private clouds and public clouds such as AWS or Microsoft Azure.

Additional features. Apache Ignite provides high-performance, clusterwide messaging functionality. It allows users to exchange data via publish-subscribe and direct point-to-point communication models.

The distributed events functionality in Ignite allows applications to receive notifications about cache events occurring in a distributed grid environment. Developers can use this functionality to be notified about the execution of remote tasks or any cache data changes within the cluster. Event notifications can be grouped and sent in batches and at timely intervals. Batching notifications help attain high cache performance and low latency.

Ignite allows for most of the data structures from the java.util.concurrent framework to be used in a distributed fashion. For example, you could add to a double-ended queue (java.util.concurrent.BlockingDeque) on one node and poll it from another node. Or you could have a distributed primary key generator, which would guarantee uniqueness on all nodes.

Ignite distributed data structures include support for these standard Java APIs: Concurrent map, distributed queues and sets, AtomicLong, AtomicSequence, AtomicReference, and CountDownLatch.

Key integrations

Apache Spark. Apache Spark is a fast, general-purpose engine for large-scale data processing. Ignite and Spark are complementary in-memory computing solutions. They can be used together in many instances to achieve superior performance and functionality.

Apache Spark and Apache Ignite address somewhat different use cases and rarely compete for the same task. The table below outlines some of the key differences.

Apache Spark doesn’t provide shared storage, so data from HDFS or other disk storage must be loaded into Spark for processing. State can be passed from Spark job to job only by saving the processed data back into external storage. Ignite can share Spark state directly in memory, without storing the state to disk.

One of the main integrations for Ignite and Spark is the Apache Ignite Shared RDD API. Ignite RDDs are essentially wrappers around Ignite caches that can be deployed directly inside of executing Spark jobs. Ignite RDDs can also be used with the cache-aside pattern, where Ignite clusters are deployed separately from Spark, but still in-memory. The data is still accessed using Spark RDD APIs.

Spark supports a fairly rich SQL syntax, but it doesn’t support data indexing, so it must do full scans all the time. Spark queries may take minutes even on moderately small data sets. Ignite supports SQL indexes, resulting in much faster queries, so using Spark with Ignite can accelerate Spark SQL more than 1,000-fold. The result set returned by Ignite Shared RDDs also conforms to the Spark Dataframe API, so it can be further analyzed using standard Spark dataframes. Both Spark and Ignite natively integrate with Apache YARN and Apache Mesos, so it’s easier to use them together.

When working with files instead of RDDs, it’s still possible to share state between Spark jobs and applications using the Ignite In-Memory File System (IGFS). IGFS implements the Hadoop FileSystem API and can be deployed as a native Hadoop file system, exactly like HDFS. Ignite plugs in natively to any Hadoop or Spark environment. IGFS can be used with zero code changes in plug-and-play fashion.

Apache Cassandra. Apache Cassandra can serve as a high-performance solution for structured queries. But the data in Cassandra should be modeled such that each predefined query results in one row retrieval. Thus, you must know what queries will be required before modeling the data.

 

 

[Source:- Infoworld]

Department of Labor sues Google over wage data

Google's Mountain View, California headquarters

The U.S. Department of Labor has filed a lawsuit against Google, with the company’s ability to win government contracts at risk.

The agency is seeking what it calls “routine” information about wages and the company’s equal opportunity program. The agency filed a lawsuit with its Office of Administrative Law Judges to gain access to the information, it announced Wednesday.

Google, as a federal contractor, is required to provide the data as part of a compliance check by the agency’s Office of Federal Contract Compliance Programs (OFCCP), according to the Department of Labor. The inquiry is focused on Google’s compliance with equal employment laws, the agency said.

“Like other federal contractors, Google has a legal obligation to provide relevant information requested in the course of a routine compliance evaluation,” OFCCP Acting Director Thomas Dowd said in a press release. “Despite many opportunities to produce this information voluntarily, Google has refused to do so.”

Google said it’s provided hundreds of thousands of records to the agency over the past year, including some related to wages. However, a handful of OFCCP data requests were “overbroad” or would reveal confidential data, the company said in a statement.

“We’ve made this clear to the OFCCP, to no avail,” the statement added. “These requests include thousands of employees’ private contact information which we safeguard rigorously.”

Google must allow the federal government to inspect and copy records relevant to compliance, the Department of Labor said. The agency requested the information in September 2015, but Google provided only partial responses, an agency spokesman said by email.

 

 

[Source:- Javaworld]

Snowflake now offers data warehousing to the masses

Snowflake now offers data warehousing to the masses

Snowflake, the cloud-based data warehouse solution co-founded by Microsoft alumnus Bob Muglia, is lowering storage prices and adding a self-service option, meaning prospective customers can open an account with nothing more than a credit card.

These changes also raise an intriguing question: How long can a service like Snowflake expect to reside on Amazon, which itself offers services that are more or less in direct competition — and where the raw cost of storage undercuts Snowflake’s own pricing for same?

Open to the public

The self-service option, called Snowflake On Demand, is a change from Snowflake’s original sales model. Rather than calling a sales representative to set up an account, Snowflake users can now provision services themselves with no more effort than would be needed to spin up an AWS EC2 instance.

In a phone interview, Muglia discussed how the reason for only just now transitioning to this model was more technical than anything else. Before self-service could be offered, Snowflake had to put protections into place to ensure that both the service itself and its customers could be protected from everything from malice (denial-of-service attacks) to incompetence (honest customers submitting massively malformed queries).

“We wanted to make sure we had appropriately protected the system,” Muglia said, “before we opened it up to anyone, anywhere.”

This effort was further complicated by Snowflake’s relative lack of hard usage limits, which Muglia characterized as being one of its major standout features. “There is no limit to the number of tables you can create,” Muglia said, but he further pointed out that Snowflake has to strike a balance between what it can offer any one customer and protecting the integrity of the service as a whole.

“We get some crazy SQL queries coming in our direction,” Muglia said, “and regardless of what comes in, we need to continue to perform appropriately for that customer as well as other customers. We see SQL queries that are a megabyte in size — the query statements [themselves] are a megabyte in size.” (Many such queries are poorly formed, auto-generated SQL, Muglia claimed.)

Fewer costs, more competition

The other major change is a reduction in storage pricing for the service — $30/TB/month for capacity storage, $50/TB/month for on-demand storage, and uncompressed storage at $10/TB/month.

It’s enough of a reduction in price that Snowflake will be unable to rely on storage costs as a revenue source, since those prices barely pay for the use of Amazon’s services as a storage provider. But Muglia is confident Snowflake is profitable enough overall that such a move won’t impact the company’s bottom line.

“We did the data modeling on this,” said Muglia, “and our margins were always lower on storage than on compute running queries.”

According to the studies Snowflake performed, “when customers put more data into Snowflake, they run more queries…. In almost every scenario you can imagine, they were very much revenue-positive and gross-margin neutral, because people run more queries.”

The long-term implications for Snowflake continuing to reside on Amazon aren’t clear yet, especially since Amazon might well be able to undercut Snowflake by directly offering competitive services.

Muglia, though, is confident that Snowflake’s offering is singular enough to stave off competition for a good long time, and is ready to change things up if need be. “We always look into the possibility of moving to other cloud infrastructures,” Muglia said, “although we don’t have plans to do it right now.”

He also noted that Snowflake competes with Amazon and Redshift right now, but “we have a very different shape of product relative to Redshift…. Snowflake is storing multiple petabytes of data and is able to run hundreds of simultaneous concurrent queries. Redshift can’t do that; no other product can do that. It’s that differentiation that allows to effective compete with Amazon, and for that matter Google and Microsoft and Oracle and Teradata.”

[An earlier version of this article incorrectly identified “uncompressed storage” as “compressed storage”. The pricing of this feature is the same.]

 

[Source:- Infowrold]

 

Fire up big data processing with Apache Ignite

Fire up big data processing with Apache Ignite

Apache Ignite is an in-memory computing platform that can be inserted seamlessly between a user’s application layer and data layer. Apache Ignite loads data from the existing disk-based storage layer into RAM, improving performance by as much as six orders of magnitude (1 million-fold).

The in-memory data capacity can be easily scaled to handle petabytes of data simply by adding more nodes to the cluster. Further, both ACID transactions and SQL queries are supported. Ignite delivers performance, scale, and comprehensive capabilities far above and beyond what traditional in-memory databases, in-memory data grids, and other in-memory-based point solutions can offer by themselves.

Apache Ignite does not require users to rip and replace their existing databases. It works with RDBMS, NoSQL, and Hadoop data stores. Apache Ignite enables high-performance transactions, real-time streaming, and fast analytics in a single, comprehensive data access and processing layer. It uses a distributed, massively parallel architecture on affordable, commodity hardware to power existing or new applications. Apache Ignite can run on premises, on cloud platforms such as AWS and Microsoft Azure, or in a hybrid environment.

apache ignite architecture

The Apache Ignite unified API supports SQL, C++, .Net, Java, Scala, Groovy, PHP, and Node.js. The unified API connects cloud-scale applications with multiple data stores containing structured, semistructured, and unstructured data. It offers a high-performance data environment that allows companies to process full ACID transactions and generate valuable insights from real-time, interactive, and batch queries.

Users can keep their existing RDBMS in place and deploy Apache Ignite as a layer between it and the application layer. Apache Ignite automatically integrates with Oracle, MySQL, Postgres, DB2, Microsoft SQL Server, and other RDBMSes. The system automatically generates the application domain model based on the schema definition of the underlying database, then loads the data. In-memory databases typically provide only a SQL interface, whereas Ignite supports a wider group of access and processing paradigms in addition to ANSI SQL. Apache Ignite supports key/value stores, SQL access, MapReduce, HPC/MPP processing, streaming/CEP processing, clustering, and Hadoop acceleration in a single integrated in-memory computing platform.

GridGain Systems donated the original code for Apache Ignite to the Apache Software Foundation in the second half of 2014. Apache Ignite was rapidly promoted from an incubating project to a top-level Apache project in 2015. In the second quarter of 2016, Apache Ignite was downloaded nearly 200,000 times. It is used by organizations around the world.

Architecture

Apache Ignite is JVM-based distributed middleware based on a homogeneous cluster topology implementation that does not require separate server and client nodes. All nodes in an Ignite cluster are equal, and they can play any logical role per runtime application requirement.

A service provider interface (SPI) design is at the core of Apache Ignite. The SPI-based design makes every internal component of Ignite fully customizable and pluggable. This enables tremendous configurability of the system, with adaptability to any existing or future server infrastructure.

Apache Ignite also provides direct support for parallelization of distributed computations based on fork-join, MapReduce, or MPP-style processing. Ignite uses distributed parallel computations extensively, and they are fully exposed at the API level for user-defined functionality.

Key features

In-memory data grid. Apache Ignite includes an in-memory data grid that handles distributed in-memory data management, including ACID transactions, failover, advanced load balancing, and extensive SQL support. The Ignite data grid is a distributed, object-based, ACID transactional, in-memory key-value store. In contrast to traditional database management systems, which utilize disk as their primary storage mechanism, Ignite stores data in memory. By utilizing memory rather than disk, Apache Ignite is up to 1 million times faster than traditional databases.

apache ignite data grid

SQL support. Apache Ignite supports free-form ANSI SQL-99 compliant queries with virtually no limitations. Ignite can use any SQL function, aggregation, or grouping, and it supports distributed, noncolocated SQL joins and cross-cache joins. Ignite also supports the concept of field queries to help minimize network and serialization overhead.

In-memory compute grid. Apache Ignite includes a compute grid that enables parallel, in-memory processing of CPU-intensive or other resource-intensive tasks such as traditional HPC, MPP, fork-join, and MapReduce processing. Support is also provided for standard Java ExecutorService asynchronous processing.

apache ignite compute grid

In-memory service grid. The Apache Ignite service grid provides complete control over services deployed on the cluster. Users can control how many service instances should be deployed on each cluster node, ensuring proper deployment and fault tolerance. The service grid guarantees continuous availability of all deployed services in case of node failures. It also supports automatic deployment of multiple instances of a service, of a service as a singleton, and of services on node startup.

In-memory streaming. In-memory stream processing addresses a large family of applications for which traditional processing methods and disk-based storage, such as disk-based databases or file systems, are inadequate. These applications are extending the limits of traditional data processing infrastructures.

apache ignite streaming

Streaming support allows users to query rolling windows of incoming data. This enables users to answer questions such as “What are the 10 most popular products over the last hour?” or “What is the average price in a certain product category for the past 12 hours?”

Another common stream processing use case is pipelining a distributed events workflow. As events are coming into the system at high rates, the processing of events is split into multiple stages, each of which has to be properly routed within a cluster for processing. These customizable event workflows support complex event processing (CEP) applications.

In-memory Hadoop acceleration. The Apache Ignite Accelerator for Hadoop enables fast data processing in existing Hadoop environments via the tools and technology an organization is already using.

apache ignite hadoop rev

Ignite in-memory Hadoop acceleration is based on the first dual-mode, high-performance in-memory file system that is 100 percent compatible with Hadoop HDFS and an in-memory optimized MapReduce implementation. Delivering up to 100 times faster performance, in-memory HDFS and in-memory MapReduce provide easy-to-use extensions to disk-based HDFS and traditional MapReduce. This plug-and-play feature requires minimal to no integration. It works with any open source or commercial version of Hadoop 1.x or Hadoop 2.x, including Cloudera, Hortonworks, MapR, Apache, Intel, and AWS. The result is up to 100-fold faster performance for MapReduce and Hive jobs.

Distributed in-memory file system. A unique feature of Apache Ignite is the Ignite File System (IGFS), which is a file system interface to in-memory data. IGFS delivers similar functionality to Hadoop HDFS. It includes the ability to create a fully functional file system in memory. IGFS is at the core of the Apache Ignite In-Memory Accelerator for Hadoop.

The data from each file is split on separate data blocks and stored in cache. Data in each file can be accessed with a standard Java streaming API. For each part of the file, a developer can calculate an affinity and process the file’s content on corresponding nodes to avoid unnecessary networking.

Unified API. The Apache Ignite unified API supports a wide variety of common protocols for the application layer to access data. Supported protocols include SQL, Java, C++, .Net, PHP, MapReduce, Scala, Groovy, and Node.js. Ignite supports several protocols for client connectivity to Ignite clusters, including Ignite Native Clients, REST/HTTP, SSL/TLS, and Memcached.SQL.

Advanced clustering. Apache Ignite provides one of the most sophisticated clustering technologies on JVMs. Ignite nodes can automatically discover each other, which helps scale the cluster when needed without having to restart the entire cluster. Developers can also take advantage of Ignite’s hybrid cloud support, which allows users to establish connections between private clouds and public clouds such as AWS or Microsoft Azure.

Additional features. Apache Ignite provides high-performance, clusterwide messaging functionality. It allows users to exchange data via publish-subscribe and direct point-to-point communication models.

The distributed events functionality in Ignite allows applications to receive notifications about cache events occurring in a distributed grid environment. Developers can use this functionality to be notified about the execution of remote tasks or any cache data changes within the cluster. Event notifications can be grouped and sent in batches and at timely intervals. Batching notifications help attain high cache performance and low latency.

Ignite allows for most of the data structures from the java.util.concurrent framework to be used in a distributed fashion. For example, you could add to a double-ended queue (java.util.concurrent.BlockingDeque) on one node and poll it from another node. Or you could have a distributed primary key generator, which would guarantee uniqueness on all nodes.

Ignite distributed data structures include support for these standard Java APIs: Concurrent map, distributed queues and sets, AtomicLong, AtomicSequence, AtomicReference, and CountDownLatch.

Key integrations

Apache Spark. Apache Spark is a fast, general-purpose engine for large-scale data processing. Ignite and Spark are complementary in-memory computing solutions. They can be used together in many instances to achieve superior performance and functionality.

Apache Spark and Apache Ignite address somewhat different use cases and rarely compete for the same task. The table below outlines some of the key differences.

Apache Spark doesn’t provide shared storage, so data from HDFS or other disk storage must be loaded into Spark for processing. State can be passed from Spark job to job only by saving the processed data back into external storage. Ignite can share Spark state directly in memory, without storing the state to disk.

One of the main integrations for Ignite and Spark is the Apache Ignite Shared RDD API. Ignite RDDs are essentially wrappers around Ignite caches that can be deployed directly inside of executing Spark jobs. Ignite RDDs can also be used with the cache-aside pattern, where Ignite clusters are deployed separately from Spark, but still in-memory. The data is still accessed using Spark RDD APIs.

Spark supports a fairly rich SQL syntax, but it doesn’t support data indexing, so it must do full scans all the time. Spark queries may take minutes even on moderately small data sets. Ignite supports SQL indexes, resulting in much faster queries, so using Spark with Ignite can accelerate Spark SQL more than 1,000-fold. The result set returned by Ignite Shared RDDs also conforms to the Spark Dataframe API, so it can be further analyzed using standard Spark dataframes. Both Spark and Ignite natively integrate with Apache YARN and Apache Mesos, so it’s easier to use them together.

When working with files instead of RDDs, it’s still possible to share state between Spark jobs and applications using the Ignite In-Memory File System (IGFS). IGFS implements the Hadoop FileSystem API and can be deployed as a native Hadoop file system, exactly like HDFS. Ignite plugs in natively to any Hadoop or Spark environment. IGFS can be used with zero code changes in plug-and-play fashion.

Apache Cassandra. Apache Cassandra can serve as a high-performance solution for structured queries. But the data in Cassandra should be modeled such that each predefined query results in one row retrieval. Thus, you must know what queries will be required before modeling the data.

[Source:- Infowrold]

Azure SQL Data Warehouse brings MPP to Microsoft cloud

Image result for Azure SQL Data Warehouse brings MPP to Microsoft cloud

“It is an extremely high-performance MPP service, with column store indexing,” according to Andrew Snodgrass, analyst at Directions on Microsoft in Kirkland, Wash. “Azure SQL Data Warehouse can put numerous processors to work on queries, returning results much faster than any single server.”

For many smaller companies, a data warehouse is still new, and dedicating staff to nurse and feed the warehouse is a burden. Cloud can be a benefit there. But even large companies with established data warehousing programs are currently reviewing their options.

That is one reason Microsoft is promoting the new offering as an alternative to on-premises data warehouses, especially those that focus on producing monthly reports. Today, these systems may have low utilization over much of the month, and then find high use as monthly reports come due.

That is a perfect case for cloud, a Microsoft data leader asserts. In an online blog last week discussing Azure SQL Data Warehouse’s move to general availability, Joseph Sirosh, corporate vice president overseeing Microsoft’s Data Group, described Azure elastic cloud computing as a means to efficiently marry processing with workload requirements for data warehousing. The service has been available in preview releases since June of last year.

Broad shoulders of column architecture

While some column-based analytics systems go back over 20 years, broad use of the column data architecture was still fairly new in 2012 whenAmazon tapped such technology for its Redshift data warehouse in the cloud. Such software was principally useful when data warehouses were required to support large numbers of user queries against their data stores.

Redshift seriously upped the ante in cloud data warehouses, making them more than commodity type products. As Redshift became a prominent part of Amazon’s cloud portfolio, it put pressure on Microsoft to add similar capabilities to its Azure line, while at the same time supporting basic SQL Server compatibility. While it has taken a while to achieve, viewers said this release of a Microsoft cloud-based data warehouse is still timely.

“Microsoft is playing catchup to some extent. But it was required to make some significant changes. Their strategy now is cloud first,” said Ben Harden, principal for data and analytics at Richmond, Virg.-based services provider CapTech Ventures Inc.

Harden said cloud computing is a very influential trend and that CapTech is now seeing demand for both Amazon and Microsoft cloud implementations.

The wait for Microsoft’s cloud data warehouse may have been worthwhile, according to Joe Caserta, president at Caserta Concepts LLC, a New York-based data consultancy that has partner agreements with both Amazon and Microsoft analytics.

“I am kind of glad they waited until they were ready,” he said. “They now have a good core set of tools.”

Scalability on a scale of one to ten

The ability to scale up to handle peak data loads is a plus for Azure SQL Data Warehouse, according to Paul Ohanian, CTO at Pound Sand, an El Segundo, Calif.-based electronic game developer that has worked with the new Microsoft software. A major use has been to produce analytics to track players’ behavior, identify trends and create projections. Scaling up was a concern, he said.

PRO+

Content

Find more PRO+ content and other member only offers, here.


  • E-Handbook

“Our game was featured in the iOS App Store for seven weeks over Christmas. We went from testing a game with about 1,000 people overseas to all of a sudden getting half-a-million users in six days. But Azure SQL Data Warehouse allowed us to very easily scale from something like 1,000 users to 100,000 users,” he said. “Literally, we saw that rise in eight hours.”

Handling such issues was his team’s original goal, according to Ohanian. “When we shipped our game, the scaling totally worked,” he said.

Ohanian said his group has been an Azure cloud user for a number of years, but that it looked at other cloud and analytics alternatives. Affinities between APIs already in use and APIs for Azure SQL Data Warehouse were a factor in choosing the Microsoft software, he said.

By his estimation, the cloud data warehouse service is not too late to the fair, and it is full-fledged in important areas such as management.

“Maybe it’s because it is coming later, but it seems to have streamlined some of the complexity,” Ohanian said. “It is easy to get things up and going, and to manage.”

Ohanian found favor with Azure SQL Data Warehouse pricing, which separates expenses for storage and computing. Starting Sept. 1, data storage will be charged based on Azure Premium Storage rates of $122.88 per 1TB per month; at that time, compute pricing will be about $900 per month per 100 data warehouse units, according to Microsoft. Data warehouse units are the company’s measure for underlying resources such as CPU and memory.

It is important for Microsoft to field data warehouse products suited for its established users, so, naturally, C# and .NET developers continue to be a target of Azure cloud updates.

“We are seeing pretty equal demand between Amazon and Azure clouds. It is boiling down to what skill sets users have today,” said consultant Harden. “People are taking the path of least resistance where skills are concerned.”

Path of least resistance leads me on

The availability of Microsoft’s massively parallel processing cloud data warehouse augers greater competition in cloud data, which is welcome by some.

An example is Todd Hinton, vice president of product strategy at RedPoint Global in Wellesley, Mass., a maker of data management and digital marketing tools. He is not alone in saying greater competition in the cloud data space could be good, or being unwilling to pick a winner just yet.

“I think they are going to be head-to-head competitors. You have Amazon fans and you have Microsoft fans. It’s almost like the old operating system battle between Linux and Windows. For our part, we are data agnostic. We will be interested to see how Azure SQL Data Warehouse shakes out.”

Like others, his company’s software supports both Amazon and Azure. The company offers direct integration with AWS Redshift already, and he said he expects it will offer native support for SQL Azure Data Warehouse later this year.

Competition in cloud data warehouses goes further too, with players ranging from IBM, Informatica, Oracle and Teradata to Cazena, Snowflake Computing, Treasure Data and others. While it is not as hot as the Hadoop or Spark data management cauldrons, releases like Microsoft’s show the cloud data warehouse space is heating up.

 

[Source:- techtarget]