The post Using Real-Time Digital Twins for Aggregate Analytics appeared first on ScaleOut Software.
]]>
When analyzing telemetry from a large population of data sources, such as a fleet of rental cars or IoT devices in “smart cities” deployments, it’s difficult if not impossible for conventional streaming analytics platforms to track the behavior of each individual data source and derive actionable information in real time. Instead, most applications just sift through the telemetry for patterns that might indicate exceptional conditions and forward the bulk of incoming messages to a data lake for offline scrubbing with a big data tool such as Spark.
Maintain State Information for Each Data Source
An innovative new software technique called “real-time digital twins” leverages in-memory computing technology to turn the Lambda model for streaming analytics on its head and enable each data source to be independently tracked and responded to in real time. A real-time digital twin is a software object that encapsulates dynamic state information for each data source combined with application-specific code for processing incoming messages from that data source. This state information gives the code the context it needs to assess the incoming telemetry and generate useful feedback within 1-3 milliseconds.
For example, suppose an application analyzes heart rate, blood pressure, oxygen saturation, and other telemetry from thousands of people wearing smart watches or medical devices. By holding information about each user’s demographics, medical history, medications, detected anomalies, and current activity, real-time digital twins can intelligently assess this telemetry while updating their state information to further refine their feedback to each person. Beyond just helping real-time digital twins respond more effectively in the moment, maintaining context improves feedback over time.
Use State Information for Aggregate Analytics
State information held within real-time digital twins also provides a repository of significant data that can be analyzed in aggregate to spot important trends. With in-memory computing, aggregate analysis can be performed continuously every few seconds instead of waiting for offline analytics in a data lake. In this usage, relevant state information is computed for each data source and updated as telemetry flows in. It is then repeatedly extracted from all real-time digital twins and aggregated to highlight emerging patterns or issues that may need attention. This provides a powerful tool for maximizing overall situational awareness.
Consider an emergency monitoring system during the COVID-19 crisis that tracks the need for supplies across the nation’s 6,100+ hospitals and attempts to quickly respond when a critical shortage emerges. Let’s assume all hospitals send messages every few minutes to this system running in a central command center. These messages provide updates on various types and amounts of shortages (for example, of PPE, ventilators, and medicines) that the hospitals need to quickly rectify. Using state information, a real-time digital twin for each hospital can both track and evaluate these shortages as they evolve. It can look at key indicators, such as the relative importance of each supply type and the rate at which the shortages are increasing, to create a dynamic measure of urgency that the hospital receive attention from the command center. All of this data is continuously updated within the real-time digital twin as messages arrive to give personnel the latest status.
Aggregate analysis can then compare this data across all hospitals by region to identify which regions have the greatest immediate need and track how fast and where overall needs are evolving. Personnel can then query state information within the real-time digital twins to quickly determine which specific hospitals should receive supplies and what specific supplies should be immediately delivered to them. Using real-time digital twins, all of this can be accomplished in seconds or minutes.
This analysis flow is illustrated in the following diagram:
As this example shows, real-time digital twins provide both a real-time filter and aggregator of the data stream from each data source to create dynamic information that is continuously extracted for aggregate analysis. Real-time digital twins also track detailed information about the data source that can be queried to provide a complete understanding of evolving conditions and enable appropriate action.
Numerous Applications Need Real-Time Monitoring
This new paradigm for streaming analytics can be applied to numerous applications. For example, it can be used in security applications to assess and filter incoming telemetry (such as likely false positives) from intrusion sensors and create an overall likelihood of a genuine threat from a given location within a large physical or cyber system. Aggregate analysis combined with queries can quickly evaluate the overall threat profile, pinpoint the source(s), and track how the threat is changing over time. This information enables personnel to assess the strategic nature of the threat and take the most effective action.
Likewise, disaster-recovery applications can use real-time digital twins to track assets needed to respond to emergencies, such as hurricanes and forest fires. Fleets of rental cars or trucks can use real-time digital twins to track vehicles and quickly identify issues, such as lost drivers or breakdowns. IoT applications can use real-time digital twins to implement predictive analytics for mission-critical devices, such as medical refrigerators. The list goes on.
Summing Up: Do More in Real Time
Conventional streaming analytics only attempt to perform superficial analysis of aggregated data streams and defer the bulk of analysis to offline processing. Because of their ability to maintain dynamic, application-specific information about each data source, real-time digital twins offer breathtaking new capabilities to track thousands of data sources in real time, provide intelligent feedback, and combine this with immediate, highly focused aggregate analysis. By harnessing the scalable power of in-memory computing, real-time digital twins are poised to usher in a new era in streaming analytics.
We invite you to learn more about the ScaleOut Digital Twin Streaming Service, which is available for evaluation and production use today.
The post Using Real-Time Digital Twins for Aggregate Analytics appeared first on ScaleOut Software.
]]>The post How Do In-Memory Data Grids Differ from Storm? appeared first on ScaleOut Software.
]]>(The following description of in-memory data grids (IMDGs) is excerpted from last week’s blog post; see that post for more details.)
IMDGs host data in memory and distribute it across a cluster of commodity servers. Using an object-oriented data storage model, they provide APIs for updating data objects typically in well under a millisecond (depending on the size of the object). This enables operational systems to use IMDGs for storing, accessing, and updating fast-changing, “live” data, while maintaining fast access times even as the storage workload grows.
Data storage needs can easily grow as more users store data within an IMDG. IMDGs accommodate this growth by adding servers to the cluster and automatically rebalancing stored data across the servers. This ensures that both capacity and throughput scale linearly with growth in the workload, and access and update times remain low regardless of the workload’s size. Moreover, IMDGs maintain stored data with high availability using data replication so that if a server fails, operational systems can continuously handle access requests and update requests without delay.
IMDGs Perform Data-Parallel Computation
Because IMDGs store data in memory distributed across a cluster of servers, they easily can perform data-parallel computations on stored data; they simply make use of the cluster’s processing power to analyze data “in place,” that is, without the need to migrate it to other servers. This enables IMDGs to provide fast results (often in milliseconds) with minimal overhead.
The following diagram of the architecture used by ScaleOut Analytics Server and ScaleOut hServer shows a stream of incoming changes which are applied to the grid’s memory-based data store using API updates. The real-time analytics engine performs data parallel computation on stored data, combines the results across the cluster, and outputs a combined stream of alerts to the operational system.
A significant aspect of the IMDG’s architecture for data analytics is that it performs computations on data hosted in memory – not specifically on an incoming data stream. This memory-based storage is continuously updated by an incoming data stream, so the computation has access to the latest changes to the data. However, the computation also has access to the history of changes manifested by the current state of data stored in the grid. This gives the computation a rich data set for analysis that includes both the incoming data stream and the application’s persistent state.
Storm originally was developed by Nathan Marz at Backtype to overcome the limitations of Hadoop in analyzing streams of incoming data, such as Twitter streams and web log files. Its goal was to provide real-time, continuous computation that is both scalable and fault tolerant. Described both as stream processing and event processing, its computation model incorporates a combination of task parallelism and pipelining. The developer describes two basic entities: “spouts,” which generate streams of data in the form of ordered tuples, and “bolts,” which process incoming streams and optionally generate outgoing streams for other bolts. Spouts and bolts are organized into an acyclic, directed graph to create an executable configuration. (See this slide deck, among many available, for a more detailed explanation.)
The following diagram illustrates a Storm configuration of streams and bolts processing a set of input streams and generating a set of output streams. The green circles represent tuples within an input stream, and the blue boxes represent bolts. Note that spouts which generate the input streams are not shown in the diagram. The orange circles represent an optional output data stream, which may be implemented by the bolts in an arbitrary manner (e.g., as API calls to an external agent instead of as a stream of tuples).
Application developers specify several aspects of the configuration, such as the number of tasks that can be spawned to execute each bolt, and the manner in which an incoming stream’s tuples are distributed across these tasks. Various groupings implement characteristics that correspond to behaviors found in Hadoop MapReduce. For example, the shuffle grouping implements a random distribution of tuples to tasks akin to input to mappers, and the field grouping implements a key-based partitioning very close to that used as input to reducers. Other groupings also are available, such as “all,” which is equivalent to multicast.
Storm implements and executes a specified configuration using a hierarchy of nodes whose state and fault-tolerance are maintained by the open-source Zookeeper cluster manager. A master node (called Nimbus) manages a set of worker nodes (called Supervisors), which run tasks. Strategies are available to handle failures of each of these components and to ensure that stream tuples are reliably processed.
A major strength of Storm is its continuous execution model. Once a configuration has been deployed, incoming data streams can be processed without scheduling delays, thereby providing uninterrupted, real-time results. This overcomes a major drawback of Hadoop MapReduce, which processes data in batch jobs with significant latency (often 15+ seconds) in starting up each job.
IMDGs approximate Storm’s continuous execution model in two ways. First they allow continuous, overlapped updates to in-memory state, enabling them to handle high arrival rates of incoming data (e.g., 1000s of updates per second for each IMDG server in a cluster). Both IMDGs and Storm scale out to increase throughput. Second, some IMDGs allow data-parallel operations to be performed continuously with very low startup delay (typically a few milliseconds). This allows IMDGs to output a stream of analysis results that matches the low latency required by operational systems. (Unlike Storm, IMDGs such as ScaleOut hServer also precisely match Hadoop’s MapReduce semantics, which require that reducers be able to process all key-value pairs emitted by the mappers in a given computation.)
Storm’s data model describes a set of tuple streams. Bolts analyze and filter these streams, creating new streams to hold their results. While bolts are unconstrained in their ability to access and update external stores, such as IMDGs or file-based NoSQL stores (e.g., Mongo DB or Cassandra), this is not a central aspect of their processing model. Put another way, Storm does not provide any particular semantics for managing stateful data.
In contrast, IMDGs are organized around a stateful data model implemented by an object-oriented, in-memory store which is both scalable and highly available. This store is intended to hold ongoing, business-logic state implemented by collections of objects representing fast-changing data used in operational environments. In previous blog posts, we have seen examples in e-commerce (e.g., session-state and shopping carts) and financial services (e.g., portfolios and stock histories). Incoming data streams update these entities, which hold information that persists and evolves over their lifetimes. Making these entities “first class” citizens in the computation model simplifies the design of business logic while enabling stream processing using a combination of object-oriented updates and data-parallel computation to both modify and analyze this state.
Where IMDGs and Storm really differ is in their approaches to managing the complexity of the computation model. Like Microsoft Dryad and other parallel execution platforms with task precedence graphs, Storm defines a computation using a directed graph of execution nodes, each of which has a variable number of tasks. While the modular nature of an execution pipeline has appeal, its complexity can quickly become daunting. One reason for this is that the configuration’s graph is represented by sequential code describing bolts and the streams to which they are connected. As the number of bolts and streams grows, it becomes increasingly difficult to visualize their relationships and grasp the application’s overall behavior.
Other parallel systems like Storm with task precedence graphs, such as messaging passing systems and actor models, have demonstrated substantial complexity over the last few decades. Also, the Storm application developer must specify the number of tasks executed by each bolt. As the number of bolts and streams increase, it becomes challenging for the developer to manage the graph, predict the dynamics of its execution, and tune for best performance.
A central reason that IMDGs employ a data-parallel computation model is its simplicity, both in exposition and execution. (Another key reason is that data-parallel computation minimizes data motion which limits scalability. Storm’s data motion between bolts may incur more network overhead than IMDGs and impact scalability, but we have not evaluated this.) Since their application code is inherently straightforward, data-parallel programs are relatively easy to understand, and they don’t need extensive tuning for high performance. Also, separating updates to business logic state from data-parallel analytics simplifies integration into operational systems.
IMDGs offer a platform for scalable, memory-based storage and data-parallel computation which was specifically designed for use in operational systems. Because it incorporates API support for accessing and updating individual data objects and data-parallel analytics, IMDGs are easily integrated into the business logic of these systems.
Storm was designed for a different purpose, namely to analyze streams of data using a continuously running execution pipeline. Its more complex computation model fits this purpose well, and, as a result, Storm embodies a different set of tradeoffs than IMDGs. Clearly, the term “real-time analytics” encompasses a variety of solutions designed to meet diverse business requirements.
The post How Do In-Memory Data Grids Differ from Storm? appeared first on ScaleOut Software.
]]>The post How Do In-Memory Data Grids Differ from Spark? appeared first on ScaleOut Software.
]]>IMDGs host data in memory and distribute it across a cluster of commodity servers. Using an object-oriented data storage model, they provide APIs for updating data objects typically in well under a millisecond (depending on the size of the object). This enables operational systems to use IMDGs for storing, accessing, and updating fast-changing data, while maintaining fast access times even as the storage workload grows. For example, an e-commerce website can store session state and shopping carts within an IMDG, and a financial services application can store stock portfolios. In both cases, stored data must be frequently updated and accessed.
Data storage needs can easily grow as more users store data within an IMDG. IMDGs accommodate this growth by adding servers to the cluster and automatically rebalancing stored data across the servers. This ensures that both capacity and throughput grow linearly with the size of the workload and that access and update times remain low regardless of the workload’s size.
Moreover, IMDGs maintain stored data with high availability using data replication. They typically create one or more replicas of each data object on different servers so that they can continue to access all stored data even after a server (or network component) fails; they do not have to pause to recreate data after a failure. IMDGs also self-heal to automatically create new replicas during recovery. All of this is critically important to operational systems which must continuously handle access and update requests without delay.
IMDGs Add Data-Parallel Computation for Analytics
Because IMDGs store data in memory distributed across a cluster of servers, they easily can perform data-parallel computations on stored data; they simply make use of the cluster’s processing power to analyze data “in place,” that is, without the need to migrate it to other servers. This enables IMDGs to provide fast results with minimum overhead. For example, a recent demonstration of ScaleOut hServer running a MapReduce calculation for a financial services application generated analysis results in about 330 milliseconds compared to 15+ seconds for Apache Hadoop.
A significant aspect of the IMDG’s architecture for data analytics is that it performs its computations on data hosted in memory – not on an incoming data stream. This memory-based storage is continuously updated by an incoming data stream, so the computation has access to the latest changes to the data. However, the computation also has access to the history of changes as manifested by the state of the data stored in the grid. This gives the computation a much richer data set for performing an analysis than it would have if it could only see the incoming data stream. We call it “stateful” real-time analytics.
Take a look at the following diagram, which illustrates the architecture for ScaleOut Analytics Server and ScaleOut hServer. The diagram shows a stream of incoming changes which are applied to the grid’s memory-based data store using API updates. The real-time analytics engine performs data parallel computation on the stored data, combines the results across the cluster, and outputs a combined stream of alerts to the operational system.
The power of stateful analytics is that the computation can provide deeper insights than otherwise. For example, an e-commerce website can analyze not just browser actions but also interpret these actions in terms of a history of customer preferences and shopping history to offer feedback. Likewise, a financial services application can analyze market price fluctuations to determine trading strategies based on the trading histories for individual portfolios tuned after several trades and influenced by preferences.
The Berkeley Spark project has developed a data-parallel execution engine designed to accelerate Hadoop MapReduce calculations (and add related operators) by staging data in memory instead of by moving it from disk to memory and back for each operator. Using this technique and other optimizations, it has demonstrated impressive performance gains over Hadoop MapReduce. This project’s stated goal (quoting from a tutorial slide deck from U.C. Berkeley’s amplab is to “extend the MapReduce model to better support two common classes of analytics apps: iterative algorithms (machine learning, graphs) [and] interactive data mining [and] enhance programmability: integrate into Scala programming language.”
A key new mechanism that supports Spark’s programming model is the resilient distributed dataset (RDD) to “allow apps to keep working sets in memory for efficient reuse.” They are “immutable, partitioned collections of objects created through parallel transformations.” To support fault tolerance, “RDDs maintain lineage information that can be used to reconstruct lost partitions.”
You can see the key differences between using an IMDG hosting data-parallel computation and Spark to perform MapReduce and similar analyses. IMDGs analyze updatable, highly available, memory-based collections of objects, and this makes them ideal for operational environments in which data is being constantly updated even while analytics computations are ongoing. In contrast, Spark was designed to create, analyze, and transform immutable collections of data hosted in memory. This makes Spark ideal for optimizing the execution of a series of analytics operators.
The following diagram illustrates Spark’s use of memory-hosted RDDs to hold data accessed by its analytics engine:
However, Spark is not well suited to operational environments for two reasons. First, data cannot be updated. In fact, if Spark inputs data from HDFS, changes have to propagated to HDFS from another data source since HDFS files only can be appended, not updated. Second, RDDs are not highly available. Their fault-tolerance results from reconstructing them from their recorded lineage, which may take substantially more time to complete than server failover by an IMDG. This represents an appropriate tradeoff for Spark because, unlike IMDGs, it focuses on analytics computations on data that does not need to be constantly available.
Even though Spark makes different design tradeoffs than IMDGs to support fast analytics, IMDGs can still deliver comparable speedup over Hadoop. For example, we measured Apache Spark executing the well-known Hadoop “word count” benchmark on a 4-server cluster running 9.6X faster than CDH5 Hadoop MapReduce for a 10 GB dataset hosted in HDFS. On this same benchmark, ScaleOut hServer ran 14X faster than Hadoop when executing standard Java MapReduce code.
Spark Streaming extends Spark to handle streams of input data and was motivated by the need to “process large streams of live data and provide results in near-real-time” (quoting from the slide deck referenced above). It “run[s] a streaming computation as a series of very small, deterministic batch jobs” by chopping up an input stream into a sequence of RDDs which it feeds to Spark’s execution engine. “The processed results of the RDD operations are returned in batches.” Computations can create or update other RDDs in memory which hold information regarding the state or history of the stream.
The representation of input and output streams as RDDs can be illustrated as follows:
This model of computation overcomes Spark’s basic limitation of working only on immutable data. Spark Streaming offers stateful operators that enable incoming data to be combined with in-memory state. However, it employs a distinctly stream-oriented approach with parallel operators that does not match the typical, object-oriented usage model of asynchronous, individual updates to memory-based objects implemented by IMDGs for operational environments. It also uses Spark’s fault-tolerance which does not support high availability for individual objects.
For example, IMDGs apply incoming changes to individual objects within a stateful collection by using straightforward object updates, and they simultaneously run data-parallel operations on the collection as a whole to perform analytics. We theorize that when using Spark Streaming, the same computation would require that each collection of updates represented by an incoming RDD be applied to the appropriate subset of objects within another “stateful” RDD held in memory. This in turn would require that the two RDDs be aligned to perform a parallel operation, which could add complexity to the original algorithm, especially if updates need to be applied to more than one object in the stateful collection. Also, fault-tolerance might require checkpointing to disk since the collection’s lineage could grow lengthy over time.
IMDGs offer a platform for scalable, memory-based storage and data-parallel computation which was specifically designed for use in operational systems, such as the ones we looked at above. Because it incorporates API support for accessing and updating individual data objects with integrated high availability, IMDGs are easily integrated into the business logic of these systems. Although Spark and Spark Streaming, with their use of memory-based storage and accelerated MapReduce execution times, bear a resemblance to IMDGs such as ScaleOut hServer, they were not intended for use in operational systems and do not provide the feature set needed to make this feasible. We will take a look at how IMDGs differ from Storm and CEP in an upcoming blog.
The post How Do In-Memory Data Grids Differ from Spark? appeared first on ScaleOut Software.
]]>