Spark Archives - ScaleOut Software

Unlocking New Capabilities for Azure Digital Twins with Real-Time Analytics

Tue, 09 Nov 2021 14:00:55 +0000

The Need for Real-Time Analytics with Digital Twins

In countless applications that track live systems, real-time analytics plays a key role in identifying problems (or finding opportunities) and responding fast enough to make a difference. Consider a software telematics application that tracks a nationwide fleet of trucks to ensure timely deliveries. Dispatchers receive telemetry from trucks every few seconds detailing location, speed, lateral acceleration, engine parameters, and cargo viability. In a classic needle-and-haystack scenario, dispatchers must continuously sift through telemetry from thousands of trucks to spot issues, such as lost or fatigued drivers, engines requiring maintenance, or unreliable cargo refrigeration. They must intervene quickly to keep the supply chain running smoothly. Real-time analytics can help dispatchers tackle this seemingly impossible task by automatically sifting through telemetry as it arrives, analyzing it for anomalies needing attention, and alerting dispatchers when conditions warrant.

By using a process of divide and conquer, digital twins can dramatically simplify the construction of applications that implement real-time analytics for telematics or other applications. A digital twin for each truck can track that truck’s parameters (for example, maintenance and driver history) and its dynamic state (location, speed, engine and cargo condition, etc.). The digital twin can analyze telemetry from the truck to update this state information and generate alerts when needed. It can encapsulate analytics code or use machine learning techniques to look for anomalies. Running simultaneously, thousands of digital twins can track all the trucks in a fleet to keep dispatchers informed while reducing their workload.

Applying the digital twin model to real-time analytics expands its range of uses from its traditional home in product lifecycle management and infrastructure tracking to managing time-critical, live systems with many data sources. Examples include preventive maintenance, health-device tracking, logistics, physical and cyber security, IoT for smart cities, ecommerce shopping, financial services, and many others. But how can we integrate real-time analytics with digital twins and ensure high performance combined with straightforward application development?

Message Processing with Azure Digital Twins

Microsoft’s Azure Digital Twins provides a compelling platform for creating digital twin models with a rich set of features for describing their contents, including properties, components, inheritance, and more. The Azure Digital Twins Explorer GUI tool lets users view digital twin models and instances, as well as their relationships.

Azure digital twins can host dynamic properties that track the current state of physical data sources. Users can create serverless functions using Azure Functions to ingest messages generated by data sources and delivered to digital twins via Azure IoT Hub (or other message hubs). These functions update the properties of Azure digital twins using APIs provided for this purpose. Here’s a redrawn tutorial example that shows how Azure functions can process messages from a thermostat and update both its digital twin and a parent digital twin that models the room in which the thermostat is located. Note that the first Azure function’s update triggers the Azure Event Grid to run a second function that updates the room’s property:

The challenge in using serverless functions to process messages and perform real-time analytics is that they add overhead and complexity. By their nature, serverless functions are stateless and must obtain their state from external services; this adds latency. In addition, they are subject to scheduling and authentication overheads on each invocation, and this adds delays that limit scalability. The use of multiple serverless functions and associated mechanisms, such as Event Grid topics and routes, also adds complexity in developing analytics code.

Adding Real-Time Analytics Using In-Memory Computing

Integrating an in-memory computing platform with the Azure Digital Twins infrastructure addresses both of the challenges. This technology runs on a cluster of virtual servers and hosts application-defined software objects in memory for fast access along with a software-based compute engine that can run application-defined methods with extremely low latency. By storing each Azure digital twin instance’s properties in memory and routing incoming messages to an in-memory method for processing, both latency and complexity can be dramatically reduced, and real-time analytics can be scaled to handle thousands or even millions of data sources.

ScaleOut Software’s newly announced Azure Digital Twins Integration does just this. It integrates the ScaleOut Digital Twin Streaming Service , an in-memory computing platform running on Microsoft Azure (or on premises), with the Azure Digital Twins service to provide real-time streaming analytics. It accelerates message processing using in-memory computing to ensure fast, scalable performance while simultaneously streamlining the programming model.

The ScaleOut Azure Digital Twins Integration creates a component within an Azure Digital Twin model in which it hosts “real-time” properties for each digital twin instance of the model. These properties track dynamic changes to the instance’s physical data source and provide context for real-time analytics.

To implement real-time analytics code, application developers create a message-processing method for an Azure digital twin model. This method can be written in C# or Java, using an intuitive rules-based language, or by configuring machine learning (ML) algorithms implemented by Microsoft’s ML.NET library. It makes use of each instance’s real-time properties, which it stores in a memory-based object called a real-time digital twin, and the in-memory compute engine automatically persists these properties in the Azure digital twin instance.

Here’s a diagram that illustrates how real-time digital twins integrate with Azure digital twins to provide real-time streaming analytics:

This diagram shows how each real-time digital twin instance maintains in-memory properties, which it retrieves when deployed, and automatically persists these properties in its corresponding Azure digital twin instance. The real-time digital twin connects to Azure IoT Hub or other message source to receive and then analyze incoming messages from its corresponding data source. Fast, in-memory processing provides sub-millisecond access to real-time properties and completes message processing with minimal latency. It also avoids repeated authentication delays every time a message is processed by authenticating once with the Azure Digital Twins service at startup.

All real-time analytics performed during message processing can run within a single in-memory method that has full access to the digital twin instance’s properties. This code also can access and update properties in other Azure digital twin instances. These features simplify design by avoiding the need to split functionality across multiple serverless functions and by providing a straightforward, object-oriented design framework with advanced, built-in capabilities, such as ML.

To further accelerate development, ScaleOut provides tools that automatically generate Azure digital twin model definitions for real-time properties. These model definitions can be used either to create new digital twin models or to add a real-time component to an existing model. Users just need to upload the model definitions to the Azure Digital Twins service.

Here’s how the tutorial example for the thermostat would be implemented using ScaleOut’s Azure Digital Twins Integration:

Note that the ScaleOut Digital Twins Streaming Service takes responsibility for ingesting messages from Azure IoT Hub and for invoking analytics code for the data source’s incoming messages. Multiple, pipelined connections with Azure IoT Hub ensure high throughput. Also note that the two serverless functions and use of Event Grid have been eliminated since the in-memory method handles both message processing and updates to the parent object (Room 21).

Combining the ScaleOut Digital Twin Streaming Service with Azure Digital Twins gives users the power of in-memory computing for real-time analytics while leveraging the full spectrum of Azure services and tools, as illustrated below for the thermostat example:

Users can view real-time properties with the Azure Digital Twins Explorer tool and track changes due to message processing. They also can take advantage of Azure’s ecosystem of big data analytics tools like Spark to perform batch processing. ScaleOut’s real-time data aggregation, continuous query, and visualization tools for real-time properties enable second-by-second tracking of live systems that boosts situational awareness for users.

Example of Real-Time Analytics with Azure Digital Twins

Incorporating real-time analytics using ScaleOut’s Azure Digital Twins Integration unlocks a wide array of applications for Azure Digital Twins. For example, here’s how the telematics software application discussed above could be implemented:

Each truck has a corresponding Azure digital twin which tracks its properties including a subset of real-time properties held in a component of each instance. When telemetry messages flow in to Azure IoT Hub, they are processed and analyzed by ScaleOut’s in-memory computing platform using a real-time digital twin that holds a truck’s real-time properties in memory for fast access and a message-processing method that analyzes telemetry changes, updates properties, and signals alerts when needed.

Real-time analytics can run ML algorithms that continuously examine telemetry, such as engine parameters, to detect anomalies and signal alerts. Digital twin analytics, combined with data aggregation and visualization powered by the in-memory platform, enable dispatchers to quickly spot emerging issues and take corrective action in a timely manner.

Summing Up

Digital twins offer a powerful means to model and visualize a population of physical devices. Adding real-time analytics to digital twins extends their reach into live, production systems that perform time-sensitive functions. By enabling managers to continuously examine telemetry from thousands or even millions of data sources and immediately identify emerging issues, they can avoid costly problems and capture elusive opportunities.

Azure Digital Twins has emerged as a compelling platform for hosting digital twin models. With the integration of in-memory computing technology using the ScaleOut Digital Twin Streaming Service, Azure Digital Twins gains the ability to analyze incoming telemetry with low latency, high scalability, and a straightforward development model. The combination of these two technologies has the potential to unlock a wide range of important new use cases for digital twins.

The post Unlocking New Capabilities for Azure Digital Twins with Real-Time Analytics appeared first on ScaleOut Software.

]]>

Using Real-Time Digital Twins for Aggregate Analytics

Mon, 15 Jun 2020 18:14:37 +0000

Maintain and aggregate dynamic information for
thousands of data sources.

When analyzing telemetry from a large population of data sources, such as a fleet of rental cars or IoT devices in “smart cities” deployments, it’s difficult if not impossible for conventional streaming analytics platforms to track the behavior of each individual data source and derive actionable information in real time. Instead, most applications just sift through the telemetry for patterns that might indicate exceptional conditions and forward the bulk of incoming messages to a data lake for offline scrubbing with a big data tool such as Spark.

Maintain State Information for Each Data Source

An innovative new software technique called “real-time digital twins” leverages in-memory computing technology to turn the Lambda model for streaming analytics on its head and enable each data source to be independently tracked and responded to in real time. A real-time digital twin is a software object that encapsulates dynamic state information for each data source combined with application-specific code for processing incoming messages from that data source. This state information gives the code the context it needs to assess the incoming telemetry and generate useful feedback within 1-3 milliseconds.

For example, suppose an application analyzes heart rate, blood pressure, oxygen saturation, and other telemetry from thousands of people wearing smart watches or medical devices. By holding information about each user’s demographics, medical history, medications, detected anomalies, and current activity, real-time digital twins can intelligently assess this telemetry while updating their state information to further refine their feedback to each person. Beyond just helping real-time digital twins respond more effectively in the moment, maintaining context improves feedback over time.

Use State Information for Aggregate Analytics

State information held within real-time digital twins also provides a repository of significant data that can be analyzed in aggregate to spot important trends. With in-memory computing, aggregate analysis can be performed continuously every few seconds instead of waiting for offline analytics in a data lake. In this usage, relevant state information is computed for each data source and updated as telemetry flows in. It is then repeatedly extracted from all real-time digital twins and aggregated to highlight emerging patterns or issues that may need attention. This provides a powerful tool for maximizing overall situational awareness.

Consider an emergency monitoring system during the COVID-19 crisis that tracks the need for supplies across the nation’s 6,100+ hospitals and attempts to quickly respond when a critical shortage emerges. Let’s assume all hospitals send messages every few minutes to this system running in a central command center. These messages provide updates on various types and amounts of shortages (for example, of PPE, ventilators, and medicines) that the hospitals need to quickly rectify. Using state information, a real-time digital twin for each hospital can both track and evaluate these shortages as they evolve. It can look at key indicators, such as the relative importance of each supply type and the rate at which the shortages are increasing, to create a dynamic measure of urgency that the hospital receive attention from the command center. All of this data is continuously updated within the real-time digital twin as messages arrive to give personnel the latest status.

Aggregate analysis can then compare this data across all hospitals by region to identify which regions have the greatest immediate need and track how fast and where overall needs are evolving. Personnel can then query state information within the real-time digital twins to quickly determine which specific hospitals should receive supplies and what specific supplies should be immediately delivered to them. Using real-time digital twins, all of this can be accomplished in seconds or minutes.

This analysis flow is illustrated in the following diagram:

As this example shows, real-time digital twins provide both a real-time filter and aggregator of the data stream from each data source to create dynamic information that is continuously extracted for aggregate analysis. Real-time digital twins also track detailed information about the data source that can be queried to provide a complete understanding of evolving conditions and enable appropriate action.

Numerous Applications Need Real-Time Monitoring

This new paradigm for streaming analytics can be applied to numerous applications. For example, it can be used in security applications to assess and filter incoming telemetry (such as likely false positives) from intrusion sensors and create an overall likelihood of a genuine threat from a given location within a large physical or cyber system. Aggregate analysis combined with queries can quickly evaluate the overall threat profile, pinpoint the source(s), and track how the threat is changing over time. This information enables personnel to assess the strategic nature of the threat and take the most effective action.

Likewise, disaster-recovery applications can use real-time digital twins to track assets needed to respond to emergencies, such as hurricanes and forest fires. Fleets of rental cars or trucks can use real-time digital twins to track vehicles and quickly identify issues, such as lost drivers or breakdowns. IoT applications can use real-time digital twins to implement predictive analytics for mission-critical devices, such as medical refrigerators. The list goes on.

Summing Up: Do More in Real Time

Conventional streaming analytics only attempt to perform superficial analysis of aggregated data streams and defer the bulk of analysis to offline processing. Because of their ability to maintain dynamic, application-specific information about each data source, real-time digital twins offer breathtaking new capabilities to track thousands of data sources in real time, provide intelligent feedback, and combine this with immediate, highly focused aggregate analysis. By harnessing the scalable power of in-memory computing, real-time digital twins are poised to usher in a new era in streaming analytics.

We invite you to learn more about the ScaleOut Digital Twin Streaming Service, which is available for evaluation and production use today.

The post Using Real-Time Digital Twins for Aggregate Analytics appeared first on ScaleOut Software.

]]>

Real-Time Digital Twins: A New Approach to Streaming Analytics

Sat, 01 Feb 2020 00:40:58 +0000

Real-time digital twins offer a compelling new software model for tracking and analyzing telemetry from large numbers of data sources. Consider the typical, conventional streaming analytics pipeline available on popular cloud platforms:

A conventional pipeline combines telemetry from all data sources into a single stream which is queried by the user’s streaming analytics application. This code often takes the form of a set of SQL queries (extended with time-windowing semantics) running continuously to select interesting events from the stream. These query results are then forwarded to a data lake for offline analytics using tools such as Spark and for data visualization. Query results also might be forwarded to cloud-based serverless functions to trigger alerts or other actions in conjunction with access to a database or blob store.

These techniques are highly effective for analyzing telemetry in aggregate to identify unusual situations which might require action. For example, if the telemetry is tracking a fleet of rental cars, a query could report all cars by make and model that have reported a mechanical problem more than once over the last 24 hours so that follow-up inquiries can be made. In another application, if the telemetry stream contains key-clicks from an e-commerce clothing site, a query might count how many times a garment of a given type or brand was viewed in the last hour so that a flash sale can be started.

A key limitation of this approach is that it is difficult to separately track and analyze the behavior of each individual data source, especially when they number in the thousands or more. It’s simply not practical to create a unique query tailored for each data source. Fine-grained analysis by data source must be relegated to offline processing in the data lake, making it impossible to craft individualized, real-time responses to the data sources.

For example, a rental car company might want to alert a driver if she/he strays from an allowed region or appears to be lost or repeatedly speeding. An e-commerce company might want to offer a shopper a specific product based on analyzing the click-stream in real time with knowledge of the shopper’s brand preferences and demographics. These individualized actions are impractical using the conventional tools of real-time streaming analytics.

However, real-time digital twins easily bring these capabilities within reach. Take a look at how the streaming pipeline differs when using real-time digital twins:

The first important difference to note is that the execution platform automatically correlates telemetry events by data source. This avoids the need for the application to select events by data source using queries (which is impractical in any case when using a conventional pipeline with many data sources). The second difference is that real-time digital twins maintain immediately accessible (in-memory) state information for each data source which is used by message-processing code to analyze incoming events from that data source. This enables straightforward application code to immediately react to telemetry information in the context of knowledge about the history and state of each data source.

For example, the rental car application can keep each driver’s contract, location history, and the car’s known mechanical issues and service history within the corresponding digital twin for immediate reference to help detect whether an alert is needed. Likewise, the e-commerce application can keep each shopper’s recent product searches along with brand preferences and demographics in her/his digital twin, enabling timely suggestions targeted to each shopper.

The power of real-time digital twins lies in their ability to make fine-grained analysis and responses possible in real time for thousands of data sources. They are made possible by scalable, in-memory computing technology hosted on clusters of cloud-based servers. This provides the fast response times and scalable throughput needed to support many thousands of data sources.

Lastly, real-time digital twins open the door to real-time aggregate analytics that analyze state data across all instances to spot emerging patterns and trends. Instead of waiting for the data lake to provide insights, aggregate analytics on real-time digital twins can immediately surface patterns of interest, maximizing situational awareness and assisting in the creation of response strategies.

With aggregate analytics, the rental car company can identify regions with unusual delays due to weather or highway blockages and then alert the appropriate drivers to suggest alternative routes. The e-commerce company can spot hot-selling products perhaps due to social media events and respond to ensure that inventory is made available.

Real-time digital twins create exciting new capabilities that were not previously possible with conventional techniques. You can find detailed information about ScaleOut Software’s cloud service for real-time digital twins here.

The post Real-Time Digital Twins: A New Approach to Streaming Analytics appeared first on ScaleOut Software.

]]>

The Digital Twin: A Foundational Concept for Stateful Stream Processing

Fri, 04 May 2018 23:30:03 +0000

Traditional stream-processing and complex event processing systems, such as Apache Storm and Software AG’s Apama, have focused on extracting interesting patterns from incoming data with stateless applications. While these applications maintain state information about the data stream itself, they don’t generally make use of information about the data sources or their context. For example, if an IoT application is attempting to detect whether data from a temperature sensor is predicting the failure of the medical freezer to which it is attached, it looks at patterns in the temperature changes, such as sudden spikes or a continuously upward trend, without regard to the freezer’s usage or service history.

The following diagram depicts a typical stream processing pipeline processing events from many data sources:

More recent stream-processing platforms, such as Apache Flink, have incorporated stateful stream processing into their architectures in the form of key-value stores or databases that the application can make use of to enhance its analysis. But they do not offer a specific semantic model which applications can leverage to organize and track useful state information and thereby deepen their ability to analyze data streams.

The answer to this challenge may be found in the digital twin model. While this term was coined by Dr. Michael Grieves (U. Michigan) in 2002 for use in product life cycle management, it was recently popularized for IoT by Gartner in a 2017 report. This model offers key insights into how state data can be organized within stream-processing applications for maximum effectiveness. In particular, it suggests that applications implement a stateful model of the physical data sources that generate event streams, and that the application maintain separate state information for each data source. For example, using the digital twin model, a rental car company can track and analyze telemetry from each car in its fleet with digital twins:

The digital twin model thereby provides an intuitive approach to organizing state data, and, by shifting the focus of analysis from the event stream to the data sources, it potentially enables much deeper introspection than previously possible. With the digital twin model, an application can conveniently track all relevant information about the evolving state of physical data sources. It can then analyze incoming events in this rich context to provide high quality insights, alerting, and feedback. For example, digital twins of medical freezers could track detailed facts about the specific model, its service history, environmental conditions, and usage patterns for each physical unit to help analyze telemetry from a temperature sensor and make more informed predictions about possible impending failures.

Beyond providing a powerful semantic model for stateful stream processing, digital twins also offer advantages for software engineering because they can take advantage of well understood object-oriented programming techniques. A digital twin can be implemented as a data class which encapsulates both state data (including a time-ordered event collection) and methods for updating and analyzing that data. Analytics methods can range from simple sequential code to machine learning algorithms or rules engines. These methods also can reach out to databases to access and update historical data sets.

For each physical data source, an instance of a digital twin model is created by the stream-processing system to receive and analyze events. It is the responsibility of the system to correlate data from a given data source for delivery to each instance of a physical twin. In many applications, a stream-processing system may host thousands (or more) digital twins to handle the workload from its data sources. In an upcoming blog, we will look at how in-memory data grids provide a highly scalable platform for hosting digital twins.

One parting thought concerns the granularity of a digital twin. Does it encompass a model of a single sensor or that of a subsystem comprising multiple sensors? As with object-oriented programming in general, the answer is in the hands of the application developer, who must make choices about which data (and event streams) are logically related and need to be encapsulated in a single entity for analysis to meet the application’s goals.

The digital twin model provides a powerful organizational tool that focuses on the state of data sources instead of just the data within event streams. With this additional context, it magnifies the developer’s ability to implement deep introspection and represents a new way of thinking about stateful stream processing.

The post The Digital Twin: A Foundational Concept for Stateful Stream Processing appeared first on ScaleOut Software.

]]>

Using In-Memory Data Grids for ETL on Streaming Data

Mon, 10 Mar 2014 17:00:40 +0000

The Hadoop stack offers a compelling set of technologies and tools that can be deployed to serve as the core of next-generation data warehouses. The combination of scalable MapReduce to analyze petabyte data sets, parallel SQL query using Hive or Impala, and data visualization tools give the analyst powerful resources for mining strategically important data. The Hadoop Distributed File System (HDFS) serves as a highly scalable data repository for hosting this data and efficiently feeding it into Hadoop’s parallel analysis engine. With industrial strength support from companies like Cloudera and others, the time is now right for deploying a Hadoop-based data warehouse:

Using ETL to Feed the Data Warehouse

A key challenge for any data warehouse is to supply data to it in a format that can be readily ingested and analyzed, and this is the role of the well-known process called “extract-transform-load” (ETL). In the case of Hadoop, this usually means extracting data from external sources and transforming them into a form that can be stored in HDFS for use by MapReduce applications. When incoming data arrives as collections of files, it’s a straightforward matter either to just copy them into HDFS or to periodically run a batch MapReduce application which reads in the files, transforms the data as needed, and outputs it to HDFS.

Consider a company that sends end-of-day reports from its field offices to the data warehouse for aggregate analysis. The data warehouse can start up a MapReduce application after the last report has been uploaded to read from an external file system, reorganize it, and then output the results to HDFS. For example, this application might use the keys output from the mappers to join data for various fields (such as, revenue, volume, etc.) across all offices so that the reducers can output this data to HDFS by field instead of by office.

Implementing ETL using MapReduce offers several advantages. It makes use of the data warehouse’s parallel infrastructure to quickly process the data on a cluster of servers. It also leverages the development team’s skill sets in developing MapReduce applications to minimize overall cost. Lastly, it avoids the need to deploy a variety of technologies, which creates unnecessary complexity and headaches for system administrators.

The Challenge: Real-Time ETL

Running ETL using a batch MapReduce job works fine for static data, such as file-based, end-of-day reports. But what about streaming data that continuously flows into the data warehouse? For example, consider an e-commerce website that accepts orders which flow to the data warehouse for analysis to identify patterns and issues. The website generates a continuous stream of orders which must be stored as HDFS files by an ETL processing step, as illustrated by the following diagram:

The simplest possible approach to this problem is to store the incoming orders as individual files in HDFS. Of course, this does not allow for any data translation prior to saving the files in the data warehouse. Also, this creates many file I/O operations both when loading HDFS and later when reading large numbers of small files during each analysis.

A better solution would be to run a MapReduce application that reads the input stream and outputs to HDFS. This enables the translation step to reorganize and consolidate the data as necessary and to efficiently output it to HDFS. By using standard MapReduce instead of another stream processing platform, such as Spark or Storm, the skill sets already employed for the data warehouse can be used instead of requiring a different software stack to perform ETL.

However, the data warehouse’s batch-oriented MapReduce execution environment incurs high scheduling latency (typically 15 seconds or longer) that makes it unsuitable for processing an incoming data stream. Furthermore, this MapReduce application would need to run continuously, tying up resources that were intended for data analysis, not ongoing ETL.

The Solution: Offload to an In-Memory Data Grid Running MapReduce

The streaming ETL challenge can be met by deploying an in-memory data grid with an integrated MapReduce engine, such as ScaleOut hServer, to capture the data stream in real time, perform ETL, and offload the data warehouse. Let’s take a look at how this works.

IMDGs host data in memory and distribute it across a cluster of commodity servers. Using an object-oriented data storage model, they provide APIs for storing, accessing, and updating data objects in well under a millisecond (depending on the size of the object). This enables operational systems to use IMDGs for storing fast-changing, “live” data, such as the data warehouse’s incoming order stream.

An IMDG provides an ideal repository for the data stream, buffering orders as objects within the grid and running the ETL application using built-in MapReduce (more on that below). The IMDG matches the arrival rate of the incoming data stream by adding servers as needed to its cluster, ensuring that both storage capacity and update throughput scale linearly while keeping update times fast. Also, the IMDG maintains high availability using data replication so that if a server fails, the IMDG can continue to handle update requests without delay.

Because IMDGs store data in memory distributed across a cluster of servers, they can easily perform data-parallel computations on stored data, such as the ETL function needed by the data warehouse; they simply make use of the cluster’s processing power to analyze data “in place,” that is, without the need to migrate it to other servers. This enables IMDGs to complete ETL fast (possibly in less than a second) with minimal overhead.

Some IMDGs, such as ScaleOut hServer, can execute standard Hadoop MapReduce applications (i.e., applications which are fully code-compatible with Apache Hadoop), allowing these applications to access in-memory data from the grid and output to HDFS. This enables the ETL function to be deployed as a conventional MapReduce application within the IMDG. The application extracts orders from the grid’s memory, transforms them as required for storage in the data warehouse, and then outputs them to HDFS using standard MapReduce techniques, as illustrated in the following diagram:

Ensuring Continuous Processing

The use of an IMDG offloads the data warehouse, allowing the MapReduce application performing ETL to run continuously. It also dramatically reduces the latency required to start up each iteration from 15+ seconds to a few milliseconds. Buffering orders in memory while simultaneously migrating them to HDFS ensures that ETL processing seamlessly keeps up with the incoming data stream.

To show how continuous processing can be achieved, the following diagram depicts the use of a “double buffering” strategy to perform ETL processing. IMDGs organize collections of objects within name spaces that can be identified and used as input to a MapReduce application. In this case, while incoming orders are added to one name space, which serves as an input buffer, the MapReduce application extracts orders from a second name space that was previously filled; it then organizes them into an appropriate format and outputs the data to HDFS. Upon completion, the extracted orders are cleared from the associated name space, the name spaces are switched, and the MapReduce application is restarted on the other name space:

This technique uses the memory of the IMDG to allow orders to flow smoothly into the IMDG while processing by the MapReduce ETL application is ongoing. It requires that sufficient memory be available in the IMDG to buffer incoming order objects during the processing time of the application. Because the IMDG can scale memory capacity by adding servers and because the IMDG fast start-up and data-parallel execution minimize the ETL application’s processing time, continuous processing of incoming orders is ensured.

Summing Up

Hadoop’s powerful analytics capabilities are rapidly making it the centerpiece of next-generation data warehouses. The ability of IMDGs to implement ETL for streaming data enables them to serve as a vital component of these infrastructures. IMDGs which can run MapReduce applications provide the threefold benefits of meeting the low latency requirements for ingesting streaming data, offloading the data warehouse’s execution environment, and leveraging existing Hadoop skills. ETL on streaming data is yet another example of real-time analytics and a prime application for IMDGs.

Perhaps most exciting is that hosting ETL in an IMDG’s real-time analytics engine opens the door to analyzing the order stream (or a clickstream) in real time and generating instant feedback for web users. Over time, the ETL function can evolve to perform real-time analysis, provide guidance, and thereby drive incremental sales. The IMDG’s analytics engine forms a bridge from the data warehouse to customers, helping push the benefits of data analytics to the point of sale where it can have maximum impact.

The post Using In-Memory Data Grids for ETL on Streaming Data appeared first on ScaleOut Software.

]]>

How Do In-Memory Data Grids Differ from Spark?

Tue, 25 Feb 2014 19:29:31 +0000

As an in-memory computing vendor, we’ve found that our products often get confused with some popular open-source, in-memory technologies. Perhaps the three technologies we are most often confused with are Spark/Spark Streaming, Storm, and complex event processing (CEP). These innovative technologies are great at what they’re built for, but in-memory data grids (IMDGs) were created for a distinct use case. In this blog post, we will take a look at how IMDGs differ from Spark and Spark Streaming.

The Basics: IMDGs Provide Fast, Scalable, and Highly Available Data Storage

IMDGs host data in memory and distribute it across a cluster of commodity servers. Using an object-oriented data storage model, they provide APIs for updating data objects typically in well under a millisecond (depending on the size of the object). This enables operational systems to use IMDGs for storing, accessing, and updating fast-changing data, while maintaining fast access times even as the storage workload grows. For example, an e-commerce website can store session state and shopping carts within an IMDG, and a financial services application can store stock portfolios. In both cases, stored data must be frequently updated and accessed.

Data storage needs can easily grow as more users store data within an IMDG. IMDGs accommodate this growth by adding servers to the cluster and automatically rebalancing stored data across the servers. This ensures that both capacity and throughput grow linearly with the size of the workload and that access and update times remain low regardless of the workload’s size.

Moreover, IMDGs maintain stored data with high availability using data replication. They typically create one or more replicas of each data object on different servers so that they can continue to access all stored data even after a server (or network component) fails; they do not have to pause to recreate data after a failure. IMDGs also self-heal to automatically create new replicas during recovery. All of this is critically important to operational systems which must continuously handle access and update requests without delay.

IMDGs Add Data-Parallel Computation for Analytics

Because IMDGs store data in memory distributed across a cluster of servers, they easily can perform data-parallel computations on stored data; they simply make use of the cluster’s processing power to analyze data “in place,” that is, without the need to migrate it to other servers. This enables IMDGs to provide fast results with minimum overhead. For example, a recent demonstration of ScaleOut hServer running a MapReduce calculation for a financial services application generated analysis results in about 330 milliseconds compared to 15+ seconds for Apache Hadoop.

A significant aspect of the IMDG’s architecture for data analytics is that it performs its computations on data hosted in memory – not on an incoming data stream. This memory-based storage is continuously updated by an incoming data stream, so the computation has access to the latest changes to the data. However, the computation also has access to the history of changes as manifested by the state of the data stored in the grid. This gives the computation a much richer data set for performing an analysis than it would have if it could only see the incoming data stream. We call it “stateful” real-time analytics.

Take a look at the following diagram, which illustrates the architecture for ScaleOut Analytics Server and ScaleOut hServer. The diagram shows a stream of incoming changes which are applied to the grid’s memory-based data store using API updates. The real-time analytics engine performs data parallel computation on the stored data, combines the results across the cluster, and outputs a combined stream of alerts to the operational system.

The power of stateful analytics is that the computation can provide deeper insights than otherwise. For example, an e-commerce website can analyze not just browser actions but also interpret these actions in terms of a history of customer preferences and shopping history to offer feedback. Likewise, a financial services application can analyze market price fluctuations to determine trading strategies based on the trading histories for individual portfolios tuned after several trades and influenced by preferences.

Comparison to Spark

The Berkeley Spark project has developed a data-parallel execution engine designed to accelerate Hadoop MapReduce calculations (and add related operators) by staging data in memory instead of by moving it from disk to memory and back for each operator. Using this technique and other optimizations, it has demonstrated impressive performance gains over Hadoop MapReduce. This project’s stated goal (quoting from a tutorial slide deck from U.C. Berkeley’s amplab is to “extend the MapReduce model to better support two common classes of analytics apps: iterative algorithms (machine learning, graphs) [and] interactive data mining [and] enhance programmability: integrate into Scala programming language.”

A key new mechanism that supports Spark’s programming model is the resilient distributed dataset (RDD) to “allow apps to keep working sets in memory for efficient reuse.” They are “immutable, partitioned collections of objects created through parallel transformations.” To support fault tolerance, “RDDs maintain lineage information that can be used to reconstruct lost partitions.”

You can see the key differences between using an IMDG hosting data-parallel computation and Spark to perform MapReduce and similar analyses. IMDGs analyze updatable, highly available, memory-based collections of objects, and this makes them ideal for operational environments in which data is being constantly updated even while analytics computations are ongoing. In contrast, Spark was designed to create, analyze, and transform immutable collections of data hosted in memory. This makes Spark ideal for optimizing the execution of a series of analytics operators.

The following diagram illustrates Spark’s use of memory-hosted RDDs to hold data accessed by its analytics engine:

However, Spark is not well suited to operational environments for two reasons. First, data cannot be updated. In fact, if Spark inputs data from HDFS, changes have to propagated to HDFS from another data source since HDFS files only can be appended, not updated. Second, RDDs are not highly available. Their fault-tolerance results from reconstructing them from their recorded lineage, which may take substantially more time to complete than server failover by an IMDG. This represents an appropriate tradeoff for Spark because, unlike IMDGs, it focuses on analytics computations on data that does not need to be constantly available.

Even though Spark makes different design tradeoffs than IMDGs to support fast analytics, IMDGs can still deliver comparable speedup over Hadoop. For example, we measured Apache Spark executing the well-known Hadoop “word count” benchmark on a 4-server cluster running 9.6X faster than CDH5 Hadoop MapReduce for a 10 GB dataset hosted in HDFS. On this same benchmark, ScaleOut hServer ran 14X faster than Hadoop when executing standard Java MapReduce code.

What about Spark Streaming?

Spark Streaming extends Spark to handle streams of input data and was motivated by the need to “process large streams of live data and provide results in near-real-time” (quoting from the slide deck referenced above). It “run[s] a streaming computation as a series of very small, deterministic batch jobs” by chopping up an input stream into a sequence of RDDs which it feeds to Spark’s execution engine. “The processed results of the RDD operations are returned in batches.” Computations can create or update other RDDs in memory which hold information regarding the state or history of the stream.

The representation of input and output streams as RDDs can be illustrated as follows:

This model of computation overcomes Spark’s basic limitation of working only on immutable data. Spark Streaming offers stateful operators that enable incoming data to be combined with in-memory state. However, it employs a distinctly stream-oriented approach with parallel operators that does not match the typical, object-oriented usage model of asynchronous, individual updates to memory-based objects implemented by IMDGs for operational environments. It also uses Spark’s fault-tolerance which does not support high availability for individual objects.

For example, IMDGs apply incoming changes to individual objects within a stateful collection by using straightforward object updates, and they simultaneously run data-parallel operations on the collection as a whole to perform analytics. We theorize that when using Spark Streaming, the same computation would require that each collection of updates represented by an incoming RDD be applied to the appropriate subset of objects within another “stateful” RDD held in memory. This in turn would require that the two RDDs be aligned to perform a parallel operation, which could add complexity to the original algorithm, especially if updates need to be applied to more than one object in the stateful collection. Also, fault-tolerance might require checkpointing to disk since the collection’s lineage could grow lengthy over time.

Summing Up

IMDGs offer a platform for scalable, memory-based storage and data-parallel computation which was specifically designed for use in operational systems, such as the ones we looked at above. Because it incorporates API support for accessing and updating individual data objects with integrated high availability, IMDGs are easily integrated into the business logic of these systems. Although Spark and Spark Streaming, with their use of memory-based storage and accelerated MapReduce execution times, bear a resemblance to IMDGs such as ScaleOut hServer, they were not intended for use in operational systems and do not provide the feature set needed to make this feasible. We will take a look at how IMDGs differ from Storm and CEP in an upcoming blog.

The post How Do In-Memory Data Grids Differ from Spark? appeared first on ScaleOut Software.

]]>