data-parallel Archives - ScaleOut Software

The Power of an In-Memory Data Grid for Hosting Digital Twin Models

Wed, 23 May 2018 22:21:37 +0000

As we saw in a previous blog, the digital twin model popularized by Gartner provides a conceptual breakthrough for stateful stream-processing and opens the door to a new way of analyzing event streams. Whether the application is in IoT, financial services, logistics, medical systems, or mobile apps, this model offers a new technique enabling deeper introspection on streaming data than previously possible.

The secret to the power of the digital twin model is its focus on organizing real-time data by the data source to which it corresponds. What this means is that all event messages from each data source are correlated into a time-ordered collection, which is associated with both dynamic state information and historical knowledge of that data source. This gives the stream-processing application a rich context for analyzing event messages and determining what actions need to be taken in real time.

For example, consider an application which tracks a rental car fleet to look for drivers who are lost or driving recklessly. This application can use the digital twin model to correlate real-time telemetry (e.g., location, speed) for each car in the fleet and combine that with data about the driver’s contract, safety record with the rental car company, and possibly driving history. Instead of just examining the latest incoming events, the application now has much more information instantly available to judge when and whether to signal an alert.

When implementing a stateful stream-processing application using the digital twin model, an object-oriented approach offers obvious leverage in managing the data and event processing associated with this model. For each type of data source, the developer can create a data type (object “class”) which describes the event collection, real-time state data, and analysis code to be executed when a new event message arrives. Instances of this data type then can be created for each unique data source as its digital twin for stream-processing. Here is a depiction of a digital twin object and its associated methods:

This object-oriented approach makes an in-memory data grid (IMDG) with integrated in-memory computing, (e.g., ScaleOut StreamServer) an excellent stream-processing platform for real-time digital twin models. IMDGs have a long history dating back almost two decades. They were originally created as middleware software to host large populations of fast-changing data objects, such as ecommerce session-state and shopping carts, in memory instead of database servers, thereby speeding up access. An IMDG implements a software-based, key-value store of serialized objects that spans a cluster of commodity servers (or cloud instances). Its architecture provides cost-effective scalability and high availability, while hiding the complexity of distributed in-memory storage from the applications which use them. It also can take full advantage of a cluster’s computing power to run application code within the IMDG — where the data lives — to maximize performance and avoid network bottlenecks.

Using an example from financial services, the following diagram illustrates an application’s logical view of an IMDG as a collection of objects (stock sectors in this case) and its physical implementation as in-memory storage spanning a cluster of servers:

It should now be clear why IMDGs offer a great way to host digital twin models and perform stateful-stream processing. An IMDG can store instances of a digital twin model and scale its storage capacity and event-processing throughput by adding servers to hold many thousands of instances as needed for all the data sources. The grid’s key-value storage model automatically correlates incoming event messages for the corresponding digital twin instance based on the data source’s identifying key. When the grid delivers an event to its digital twin, it then runs the model’s application code for event ingestion, analysis, and alerting in real time.

As an example, the following diagram illustrates the use of an IMDG to host digital twin objects for rental cars. The arrows indicate that the IMDG directs events from each car to its corresponding digital twin for event-processing.

As part of event-processing, digital twins can create alerts for human attention and/or feedback directed at the corresponding data source. In addition, the collection of digital twin objects stored in the IMDG can be queried or analyzed using data-parallel techniques (e.g., MapReduce) to extract important aggregate patterns and trends. For example, the rental car application could alert managers when a driver repeatedly exceeds the speed limit according to criteria specific to the driver’s age and driving history. It also could allow a manager to query the status of a specific car or compute the maximum excessive speeding for all cars in a specified region. These data flows are illustrated in the following diagram:

Because they are hosted in memory, digital twin models can react very quickly to incoming events, and the IMDG can scale by adding servers to keep event processing times fast even when the number of instances (objects) and/or event rates become very large. Although the in-memory state of a digital twin is restricted to the event and state data needed for real-time processing, the application can reference historical data from external database servers to broaden its context, as shown below. For example, the rental car application could access driving history only when incoming telemetry indicates a need for it. It also could store past events in a database for archival purposes.

What makes an IMDG an excellent fit for stateful stream-processing is its ability to transparently host both the state information and application code within a fast, highly scalable, in-memory computing platform and then automatically direct incoming events to their respective digital twin instances within the grid for processing. These two key capabilities, namely correlating events by data sources and analyzing these events within the context of the digital twin’s real-time state information, give applications powerful new tools and create an exciting new way to think about stateful stream-processing.

The post The Power of an In-Memory Data Grid for Hosting Digital Twin Models appeared first on ScaleOut Software.

]]>

AppFabric Caching: What Now?

Thu, 12 Nov 2015 05:14:12 +0000

Posted by Chris Villinger, VP Business Development & Marketing

Eighteen months ago we posted a blog on the performance and feature shortcomings of Microsoft’s Windows Server AppFabric (WSAF) Caching. Since then much has transpired. Microsoft announced earlier this year it will be ending support for Windows Server AppFabric 1.1 by April 2017. AppFabric Caching users now have to determine the right next step in migrating to an alternative distributed cache.

Recommended alternatives found lacking

With its “mobile first, cloud first” strategy, it appears that Microsoft is pushing customers to its Microsoft Azure cloud platform by recommending that “all Microsoft AppFabric customers using Cache to move to Microsoft Azure Redis Cache.” However, for many customers it is not yet practical to move to Azure, and a fully supported, on-premise solution is required for their distributed cache. Microsoft’s recommendation that customers move to Redis is both controversial and misleading since the Redis open source community does not recommend running on Windows.

Further, Redis is an in-memory database which works well on a single server but lacks many of the scalability and high-availability features of a mature, fully featured in-memory data grid, let alone real-time analytics and computing capabilities. This leaves customers with an uncertain and commercially unsupported future when migrating their on-premise, .NET compatible, distributed cache away from AppFabric Caching.

A better alternative

Luckily, ScaleOut Software has been loyally serving the .NET developer community with an industry-leading in-memory data grid for over a decade. ScaleOut’s architectural design philosophy focuses on delivering high performance with maximum ease-of-use. It employs a single, coherent architecture for integrating scalability and high availability; this architecture is transparently leveraged in all aspects of its in-memory data grid. We call this “scalable, highly available everything” — the platform goes beyond linear performance scaling for accessing grid objects and uses a common architecture for all features such as distributed locking, event processing, load-balancing, geographic replication, parallel computation and backup/restore.

The key benefits and advantages that set ScaleOut StateServer® apart from its industry peers are:

Peer-to-peer architecture:

ScaleOut StateServer uses a peer-to-peer architecture to avoid single points of failure and maximize ease of use. It avoids the need for a centralized configuration store (which is a single point of failure) by automatically replicating its configuration files on every cluster host. System administrators do not need to create and manage a configuration store; this is automatically handled by ScaleOut StateServer.
ScaleOut StateServer’s peer-to-peer architecture allows servers to be easily added and removed from the membership. Servers automatically form a new membership and rebalance the workload as needed. The membership self-heals after a server failure, restoring redundancy for high availability and redistributing the workload as necessary.

Ease-of-use:

ScaleOut StateServer’s acclaimed ease-of-use ensures that installation is extremely simple with minimum configuration steps. Unlike AppFabric Caching, it doesn’t require a highly qualified Windows system administrator to navigate the process of installing and configuring a distributed cache.
ScaleOut StateServer’s centralized, GUI-based management console dramatically simplifies management and provides important capabilities, such as cluster-wide control, dynamic performance charting alerts, and a grid heat-map. Holistic visualization delivers at-a-glance monitoring, which is not possible with PowerShell and other text-based scripting approaches.
ScaleOut Object Browser enables detailed introspection on the contents of the distributed cache, including access to values for object properties.

Extended functionality:

ScaleOut StateServer (and its product extensions) go far beyond AppFabric Caching’s basic capabilities and add important functionality, including:

Distributed, property-based query using Microsoft LINQ
Numerous API extensions, such as support for object dependencies and sliding timeouts
Powerful object browser to directly browse data stored in the grid
Scalable event handling
Data-parallel computation, including the world’s first C# MapReduce
Extensible support for building server-based data structures
WAN-based data replication and synchronized, global data access
Integrated support for public clouds, including Amazon AWS and Microsoft Azure

Furthermore, unlike Redis and some of the other open-source alternatives, ScaleOut StateServer is fully commercially supported by ScaleOut Software for use on Windows (or Linux).

Replacement and migration options

Customers have two paths to choose from when planning their migration off of AppFabric Caching to ScaleOut Software’s distributed cache: retain source-code compatibility for their existing legacy applications or migrate their applications to fully take advantage of native ScaleOut APIs.

Source-code compatible library

The ScaleOut Windows Server AppFabric (WSAF) Caching Compatibility Library is a source-code compatible, drop-in replacement for Microsoft AppFabric Caching APIs. This allows existing customer applications using AppFabric to preserve the legacy AppFabric Caching API semantics and switch to ScaleOut StateServer without making any code changes and use familiar PowerShell commands to manage the distributed cache. This library ships as part of ScaleOut StateServer release 5.4 and later, as described in the WSAF Caching Compatibility Library Reference.

Native ScaleOut APIs

Customers also can rewrite their applications to use ScaleOut StateServer’s native APIs, which allows applications to take full advantage of the extended functionality mentioned above. Using a hybrid approach, native ScaleOut StateServer APIs can be used alongside AppFabric APIs. We have developed a detailed technical AppFabric Caching migration guide to help developers through this process.

Using the WSAF Caching Compatibility Library

The WSAF Caching Compatibility Library is easy to integrate into applications that use AppFabric Caching’s APIs. Here is an outline of the required steps:

Add the compatibility library’s assembly (soss_wsaf_compat.dll) as a reference to the project. This assembly can be found in the ScaleOut StateServer installation folder (typically C:\Program Files\ScaleOut_Software\StateServer) in the Compat\WSAF_Caching folder.
Change the “using Microsoft.ApplicationServer.Caching;” statements in the source files to “using Soss.Compat.WSAF;”.
Recompile the project to start using the WSAF Caching Compatibility Library.

For More Information
More information about all the AppFabric Caching replacement and migration resources available can be found at www.scaleoutsoftware.com/appfabric, including a valuable offer for former AppFabric Caching users. We hope that the power of our platform as both a replacement and an upgrade from Microsoft’s WSAF Caching has captured your interest. Regardless if you run on-premise or in the cloud, it may be the right next step for your application.

The post AppFabric Caching: What Now? appeared first on ScaleOut Software.

]]>

AppFabric Caching: Retry Later

Thu, 15 May 2014 17:33:16 +0000

We have spent a great deal of time at ScaleOut Software re-architecting our in-memory data grid (IMDG)’s code base to make best use of many cores and large memory. For example, the IMDG must be able to efficiently create millions of objects in each server to make use of its huge storage capacity. Likewise, object access paths must be heavily multi-threaded and avoid lock contention to minimize access latency and maximize throughput. Also, load-balancing after membership changes must be both multi-threaded and pipelined to drive the network at maximum bandwidth.

Given all this, we thought it would be a good opportunity to see how we are doing relative to the competition, and in particular, relative to Microsoft’s AppFabric caching for Windows on-premise servers. In addition to looking at performance differences, we also want to compare ScaleOut StateServer (SOSS) to AppFabric on qualitative measures, such as features, ease of installation, and management.

Testing Scale-Up Performance

Well, performance comparisons aren’t so easy since the AppFabric license agreement states: “You may not disclose the results of any benchmark tests of the software to any third party without Microsoft’s prior written approval.” So our comments will be confined to what testing we felt was valuable and how well SOSS performed.

In a recent customer engagement, the application needed to load 10M objects into each server of the IMDG’s cluster to make full use of high-end servers with 60GB memory capacity. Measuring object creation and access rates on a single server is a good test of the IMDG’s memory management and multithreading, and this is an area in which we have made several performance optimizations. Using a workload of random object sizes varying from 200B to 2KB, SOSS was able to load 2 million objects in 59.2 seconds and then sustain 18K read/update pairs per second to random objects. That’s actually quite fast. (We invite you to test AppFabric’s performance; contact us if you need a test application.)

We also looked at load-balancing and recovery times for this workload of 2M objects when adding a server to a 2-server cluster, removing the third server, and also just killing the third server. This measures how well the IMDG’s servers use multithreading to maximize network bandwidth during load-balancing, and it also evaluates failure detection and recovery algorithms. These are areas in which we have invested heavily to take advantage of 10 Gbps (and faster) networks and to handle intermittent network delays inherent in virtual server infrastructures. While handling an access load of 6K read/update pairs per second, SOSS was measured to complete load-balancing in less than 35 seconds for joins and leaves and also to complete recovery and restore full throughput in this amount of time after a server failure.

Unwanted Client Exceptions from AppFabric

We were surprised to discover that AppFabric throws exceptions to the client application during load-balancing and recovery and due to security issues, as described in other blog posts. During load-balancing, the client gets the following exception when accessing the cache:

ErrorCode:SubStatus:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. …)

You’ll also see a “Please retry later” exception (forever!) if the client runs with insufficient security authorization; we had to run the client as administrator to avoid problems without a much deeper investigation. The client also throws these exceptions during “graceful” host shutdowns and during recovery.

To keep application development as straightforward as possible, SOSS handles all exceptions that occur during membership changes and load-balancing, automatically retrying requests within the server as necessary. This avoids exposing the application to the details of the IMDG’s internal operations unless the IMDG is completely unreachable.

A Few Words on Design Philosophy: Keep It Simple

Our experience with customers consistently reinforces the need to keep middleware software as easy to install and use as possible, especially given that it is deployed as distributed software running on a cluster of servers. This philosophy manifests itself in all areas, including installation, application development, and cluster management. We believe that installing our software should be as straightforward as we can make it, requiring minimal knowledge of the host operating system and the fewest possible explicit configuration settings.

AppFabric caching does not share this design approach and requires the configuration of numerous components, including security policies, SQL Server or a network file share with the appropriate permissions, “lead” hosts, etc. Also, a long list of PowerShell commands for managing the cache must be learned and correctly used. Running AppFabric caching in production typically requires the use of a domain and is deeply tied into the Windows Server infrastructure. As expected, AppFabric requires the comprehensive knowledge of a Windows system administrator to install and manage.

In our testing, the net result was about a half-day investment in time on our part (and some frustration) getting AppFabric up and running. After spending an hour trying to join multiple cluster hosts in a Windows workgroup, we switched to using a domain controller to make this work; it just wasn’t worth the time to sort out the incorrect configuration settings. Head scratching was required in several other areas before our AppFabric cluster showed signs of life.

GUI Based Management Is Crucial for Distributed Software

A major source of frustration with AppFabric is its use of PowerShell commands to manage the cache cluster. It’s easy to forget that distributed software is running on multiple servers which need to be orchestrated as a group, and that’s hard to do with a sequence of shell commands because you can’t track the state of the cluster at a glance. It’s much easier to manage a cluster with a graphical user interface (GUI) which shows the status of all hosts and alerts you to dynamic situations, such as high usage, load-balancing, or network outages.

To take full advantage of the GUI approach, SOSS uses a Windows-based management console with intuitive controls that make management of the IMDG simple and easy to learn. The console also adds advanced visualization features, such as integrated performance charts and tabular usage charts, a “heat” map showing the availability and dynamic state of all regions within the distributed store, and an object browser that lets you see stored objects and examine both their metadata and contents.

The following screenshots illustrate SOSS’s performance charting and heat map. Note that the status of the IMDG and all cluster hosts is instantly visible in the tree list at the left:

Because the GUI management console receives periodic updates from all cluster hosts, it stays tightly integrated with the cluster and dynamically updates the latest status. In contrast, the use of shell commands just gives you a snapshot of the state of the cluster at one instant in time. We also have observed that these results quickly can become out of sync with what client applications are actually experiencing. For example, a shell command can report that the cluster is in an unknown state when in fact the client is successfully completing access requests. (In AppFabric, be prepared to wait for several seconds for management commands to time out when a cluster host goes into an unknown state.)

Use Fully Peer-to-Peer Design for Simplicity and High Availability

AppFabric uses a single store, either a file share or SQL Server, to hold the cluster’s configuration, which adds complexity to installation and creates a single point of failure. Although SQL Server can be clustered, this adds even more cost than just using a single server. To avoid these problems, SOSS automatically replicates its configuration files on every cluster host to maximize availability with a fully peer-to-peer implementation. This approach also keeps the user from having to deal with managing a configuration store.

The peer-to-peer issue arises again in AppFabric’s requiring a majority of lead hosts (presumably) to reconfigure the cluster after a membership change. Because some hosts are lead hosts and others are not, an AppFabric caching cluster will go down even when hosts are healthy. Moreover, the administrator has to understand and ensure that a majority hosts quorum of lead hosts exists, and the number of lead hosts varies with cluster size. For example, a small cluster of up to 20 hosts might use 3 lead hosts, requiring two hosts to form a quorum. If 2 of the 3 lead hosts go down, the cluster will go down even if there are 18 healthy hosts.

With ScaleOut, you don’t need to know what the word “quorum” means. SOSS sidesteps these issues by using a fully peer-to-peer membership so that all hosts can participate in constructing the cluster membership. (SOSS makes use of a ScaleOut Software patent which eliminates the need for a majority hosts quorum by building a logical quorum on a uniform set of servers.) This means that SOSS can avoid the use of lead hosts and maintains service as long as any host survives. System administrators view all cluster hosts as peers and do not have to be aware of SOSS’s internal mechanisms for implementing the cluster membership.

Distributed Cache or In-Memory Data Grid?

It’s actually not clear to us whether AppFabric’s design philosophy regarding high availability is more closely aligned with “best effort” distributed caches like memcached or with mission-critical in-memory data grids like SOSS and others. Data replication is turned off by default and is explicitly set on a namespace-by-namespace basis using management commands. (High availability apparently is not available on Windows Server 2008R2 Standard Edition and requires Enterprise Edition, which can cost about $2,800 more per server, or you must upgrade to Windows Server 2012.) Also, since data is hosted in managed code, access delays due to garbage collection (as well as the exceptions noted above) are to be expected. (Microsoft recommends not storing more than 16GB in a cache host to avoid GC pauses.) Lastly, AppFabric’s client cache is not kept coherent with the distributed cache, and so the client cache cannot be used for transactional data.

In contrast, SOSS automatically replicates all stored objects on multiple hosts to maintain high availability at all times. Likewise, it uses an unmanaged heap for stored data to keep access times as predictable as possible and avoid GC pauses. It also keeps all client caches coherent with the IMDG so that multi-threaded code running on the cluster can coordinate access to transactional data using the well understood sequential consistency model.

In Summary: Design for Ease of Use and High Performance

It was tempting to write this blog post as a feature comparison between AppFabric and SOSS. It’s clear that, as a full in-memory data grid, SOSS — unlike AppFabric — incorporates many features that go well beyond distributed caching. For example, SOSS lets applications query stored data by property using Microsoft’s own LINQ, and ScaleOut Analytics Server (SOAS) can perform data-parallel analysis of queried objects using application code that SOAS automatically deploys on the cluster. ScaleOut hServer takes analytics a step further by hosting full Hadoop MapReduce on the IMDG.

That said, since AppFabric is targeted at distributed caching, we felt that AppFabric users likely are more focused on issues regarding deployment, performance, and availability. Beyond just evaluating how well products like AppFabric and SOSS extract performance from scale-up, it’s also important to examine how they stack up in their overall role of providing scalable, highly available in-memory storage.

When looking at other design approaches, we feel that ScaleOut’s philosophy of easy-to-use, fully peer-to-peer design with GUI-based management provides important dividends by simplifying deployment and maximizing visibility, while driving high performance and availability. Not surprisingly, all of this lowers the total cost of ownership, which — as we have seen — even for “free” software is never zero.

The post AppFabric Caching: Retry Later appeared first on ScaleOut Software.

]]>

Reports of Scale-Out’s Demise Are Greatly Exaggerated

Sun, 04 May 2014 17:20:43 +0000

A recent blog post highlighted a Microsoft technical report which asserts that most Hadoop workloads are 100 GB or smaller, and for almost all workloads except the very largest “a single ‘scale-up’ server can process each of these jobs and do as well or better than a cluster in terms of performance, cost, power, and server density.” It’s certainly true that Hadoop MapReduce seems to have focused more on clustering issues than on single-server optimizations. But — to paraphrase Mark Twain — reports of scale-out’s demise for all but the largest workloads are greatly exaggerated.

Scale-Up Versus Scale-Out

Scaling up is the process of migrating to an increasingly more powerful single server to process a workload faster or to handle a growing workload that fits within the server. Usually this means moving to a server with more CPU cores, greater memory capacity, and higher-end networking and storage options.

At some point, scaling up becomes costly, and workloads grow beyond what even a high end server can handle. To overcome this limitation, scaling out distributes a workload across a cluster of servers working together. Now CPU, memory, and storage resources can grow without predefined limits. Unless network bandwidth also grows proportionally to the number of servers, it usually becomes the limiter in scaling. (Supercomputing systems use scalable networks such as torus interconnects; commodity clusters typically use Ethernet switches with fixed bandwidth.)

There are several reasons why scaling out will continue to serve an important role, and to be sure, it’s more than the size of the workload that matters. First, scaling out enables computing capacity to be deployed incrementally and economically as the size of the workload increases using clusters of relatively small servers (instead of investing in a single, expensive multicore server to handle the highest anticipated workload). Second, scale-out’s ability to provide high availability will always be important to mission-critical applications, even if the problem size fits within one server. Third, technology changes so fast that today’s pricey, top of the line server will quickly become tomorrow’s dusty objet d’art as it awaits recycling. (Cloud computing just changes that to a per-hour calculation – top-of-the-line servers may not be cost-effective.)

Moreover, in our experience, many analytics applications host data sets much larger than the 100 GB size Microsoft cites as a typical upper bound, and new applications drive this trend by taking advantage of newly available memory. (We often host data sets in the terabytes in our in-memory data grid.) These data sets require a cluster to keep in memory. For example, a recent data set used in an e-commerce analytics application held 40 million objects totaling about 2 TB of data including replicas. This data set could not be stored in one server on Amazon EC2, even using their largest instance type; it required a cluster of servers to hold the entire data set in memory. Also, it’s often not advisable to store such large data sets in the smallest possible cluster of servers; there are benefits to using a larger cluster of small servers (see below).

That said, it’s clear that scale-out infrastructures, including middleware execution platforms like Hadoop, need to evolve to make full use of large memory capacity and many cores available within each server. As Hadoop gained popularity over the last few years, software architects have been focused on efficiently scaling out with minimum overhead, and the Microsoft paper reminds us to rethink our scale-up algorithms and extract maximum value out of new technology as it enters the mainstream. For example, fully using available cores with multi-threading and minimizing latency for inter-process data transfers with memory-mapped files can squeeze more performance out of modern servers.

Finding the Right Balance

Once we accept that scale-out is an integral element of mission-critical deployments, it’s finding the right balance between memory, CPU, and network bandwidth that matters in driving overall performance. One or more of these resources tends to lag in performance at any point as technology’s evolves, and software design has to compensate for that. For example, right now network bandwidth in commodity networks tends to be the laggard, making it costly (in time) to ship the 100s of gigabytes a server can now hold. (For example, a 10 Gbps network requires 80 seconds at maximum bandwidth to send 100 gigabytes between servers; scatter/gather of many objects greatly increases that time.) As the amount of data stored in each server grows, load-balancing between servers takes longer, and the delays eventually can impact availability. Scaling out can help this by distributing the data set across more servers, thereby reducing network delays in rebalancing workloads after a server is added or removed.

In sum, it’s not “either-or” — scale-out is here to stay. However, it’s absolutely true that maintaining support for the latest memory and processor technology is crucial to reaping the benefits of scale-up, integrating it with scale-out, and thereby maximizing performance, availability, and cost-effectiveness.

The post Reports of Scale-Out’s Demise Are Greatly Exaggerated appeared first on ScaleOut Software.

]]>

How Do In-Memory Data Grids Differ from Storm?

Tue, 04 Mar 2014 00:13:41 +0000

In last week’s blog post, we talked about the fact that our in-memory computing technology often is confused with popular other “big data” technologies, in particular Spark / Spark Streaming, Storm, and complex event processing (CEP). As we mentioned, these innovative technologies are great at what they’re built for, but in-memory data grids (IMDGs) were created for a distinct use case. In this blog post, we will take a look at how IMDGs differ from Storm.

Quick Review: IMDGs Provide Fast Data Storage

(The following description of in-memory data grids (IMDGs) is excerpted from last week’s blog post; see that post for more details.)

Data storage needs can easily grow as more users store data within an IMDG. IMDGs accommodate this growth by adding servers to the cluster and automatically rebalancing stored data across the servers. This ensures that both capacity and throughput scale linearly with growth in the workload, and access and update times remain low regardless of the workload’s size. Moreover, IMDGs maintain stored data with high availability using data replication so that if a server fails, operational systems can continuously handle access requests and update requests without delay.

IMDGs Perform Data-Parallel Computation

The following diagram of the architecture used by ScaleOut Analytics Server and ScaleOut hServer shows a stream of incoming changes which are applied to the grid’s memory-based data store using API updates. The real-time analytics engine performs data parallel computation on stored data, combines the results across the cluster, and outputs a combined stream of alerts to the operational system.

A significant aspect of the IMDG’s architecture for data analytics is that it performs computations on data hosted in memory – not specifically on an incoming data stream. This memory-based storage is continuously updated by an incoming data stream, so the computation has access to the latest changes to the data. However, the computation also has access to the history of changes manifested by the current state of data stored in the grid. This gives the computation a rich data set for analysis that includes both the incoming data stream and the application’s persistent state.

What is Storm?

Storm originally was developed by Nathan Marz at Backtype to overcome the limitations of Hadoop in analyzing streams of incoming data, such as Twitter streams and web log files. Its goal was to provide real-time, continuous computation that is both scalable and fault tolerant. Described both as stream processing and event processing, its computation model incorporates a combination of task parallelism and pipelining. The developer describes two basic entities: “spouts,” which generate streams of data in the form of ordered tuples, and “bolts,” which process incoming streams and optionally generate outgoing streams for other bolts. Spouts and bolts are organized into an acyclic, directed graph to create an executable configuration. (See this slide deck, among many available, for a more detailed explanation.)

The following diagram illustrates a Storm configuration of streams and bolts processing a set of input streams and generating a set of output streams. The green circles represent tuples within an input stream, and the blue boxes represent bolts. Note that spouts which generate the input streams are not shown in the diagram. The orange circles represent an optional output data stream, which may be implemented by the bolts in an arbitrary manner (e.g., as API calls to an external agent instead of as a stream of tuples).

Application developers specify several aspects of the configuration, such as the number of tasks that can be spawned to execute each bolt, and the manner in which an incoming stream’s tuples are distributed across these tasks. Various groupings implement characteristics that correspond to behaviors found in Hadoop MapReduce. For example, the shuffle grouping implements a random distribution of tuples to tasks akin to input to mappers, and the field grouping implements a key-based partitioning very close to that used as input to reducers. Other groupings also are available, such as “all,” which is equivalent to multicast.

Storm implements and executes a specified configuration using a hierarchy of nodes whose state and fault-tolerance are maintained by the open-source Zookeeper cluster manager. A master node (called Nimbus) manages a set of worker nodes (called Supervisors), which run tasks. Strategies are available to handle failures of each of these components and to ensure that stream tuples are reliably processed.

Comparison of IMDGs to Storm: Providing Continuous Execution

A major strength of Storm is its continuous execution model. Once a configuration has been deployed, incoming data streams can be processed without scheduling delays, thereby providing uninterrupted, real-time results. This overcomes a major drawback of Hadoop MapReduce, which processes data in batch jobs with significant latency (often 15+ seconds) in starting up each job.

IMDGs approximate Storm’s continuous execution model in two ways. First they allow continuous, overlapped updates to in-memory state, enabling them to handle high arrival rates of incoming data (e.g., 1000s of updates per second for each IMDG server in a cluster). Both IMDGs and Storm scale out to increase throughput. Second, some IMDGs allow data-parallel operations to be performed continuously with very low startup delay (typically a few milliseconds). This allows IMDGs to output a stream of analysis results that matches the low latency required by operational systems. (Unlike Storm, IMDGs such as ScaleOut hServer also precisely match Hadoop’s MapReduce semantics, which require that reducers be able to process all key-value pairs emitted by the mappers in a given computation.)

Stateless versus Stateful Data Model

Storm’s data model describes a set of tuple streams. Bolts analyze and filter these streams, creating new streams to hold their results. While bolts are unconstrained in their ability to access and update external stores, such as IMDGs or file-based NoSQL stores (e.g., Mongo DB or Cassandra), this is not a central aspect of their processing model. Put another way, Storm does not provide any particular semantics for managing stateful data.

In contrast, IMDGs are organized around a stateful data model implemented by an object-oriented, in-memory store which is both scalable and highly available. This store is intended to hold ongoing, business-logic state implemented by collections of objects representing fast-changing data used in operational environments. In previous blog posts, we have seen examples in e-commerce (e.g., session-state and shopping carts) and financial services (e.g., portfolios and stock histories). Incoming data streams update these entities, which hold information that persists and evolves over their lifetimes. Making these entities “first class” citizens in the computation model simplifies the design of business logic while enabling stream processing using a combination of object-oriented updates and data-parallel computation to both modify and analyze this state.

Complexity of the Computation Model

Where IMDGs and Storm really differ is in their approaches to managing the complexity of the computation model. Like Microsoft Dryad and other parallel execution platforms with task precedence graphs, Storm defines a computation using a directed graph of execution nodes, each of which has a variable number of tasks. While the modular nature of an execution pipeline has appeal, its complexity can quickly become daunting. One reason for this is that the configuration’s graph is represented by sequential code describing bolts and the streams to which they are connected. As the number of bolts and streams grows, it becomes increasingly difficult to visualize their relationships and grasp the application’s overall behavior.

Other parallel systems like Storm with task precedence graphs, such as messaging passing systems and actor models, have demonstrated substantial complexity over the last few decades. Also, the Storm application developer must specify the number of tasks executed by each bolt. As the number of bolts and streams increase, it becomes challenging for the developer to manage the graph, predict the dynamics of its execution, and tune for best performance.

A central reason that IMDGs employ a data-parallel computation model is its simplicity, both in exposition and execution. (Another key reason is that data-parallel computation minimizes data motion which limits scalability. Storm’s data motion between bolts may incur more network overhead than IMDGs and impact scalability, but we have not evaluated this.) Since their application code is inherently straightforward, data-parallel programs are relatively easy to understand, and they don’t need extensive tuning for high performance. Also, separating updates to business logic state from data-parallel analytics simplifies integration into operational systems.

Summing Up

IMDGs offer a platform for scalable, memory-based storage and data-parallel computation which was specifically designed for use in operational systems. Because it incorporates API support for accessing and updating individual data objects and data-parallel analytics, IMDGs are easily integrated into the business logic of these systems.

Storm was designed for a different purpose, namely to analyze streams of data using a continuously running execution pipeline. Its more complex computation model fits this purpose well, and, as a result, Storm embodies a different set of tradeoffs than IMDGs. Clearly, the term “real-time analytics” encompasses a variety of solutions designed to meet diverse business requirements.

The post How Do In-Memory Data Grids Differ from Storm? appeared first on ScaleOut Software.

]]>

How Do In-Memory Data Grids Differ from Spark?

Tue, 25 Feb 2014 19:29:31 +0000

As an in-memory computing vendor, we’ve found that our products often get confused with some popular open-source, in-memory technologies. Perhaps the three technologies we are most often confused with are Spark/Spark Streaming, Storm, and complex event processing (CEP). These innovative technologies are great at what they’re built for, but in-memory data grids (IMDGs) were created for a distinct use case. In this blog post, we will take a look at how IMDGs differ from Spark and Spark Streaming.

The Basics: IMDGs Provide Fast, Scalable, and Highly Available Data Storage

IMDGs host data in memory and distribute it across a cluster of commodity servers. Using an object-oriented data storage model, they provide APIs for updating data objects typically in well under a millisecond (depending on the size of the object). This enables operational systems to use IMDGs for storing, accessing, and updating fast-changing data, while maintaining fast access times even as the storage workload grows. For example, an e-commerce website can store session state and shopping carts within an IMDG, and a financial services application can store stock portfolios. In both cases, stored data must be frequently updated and accessed.

Moreover, IMDGs maintain stored data with high availability using data replication. They typically create one or more replicas of each data object on different servers so that they can continue to access all stored data even after a server (or network component) fails; they do not have to pause to recreate data after a failure. IMDGs also self-heal to automatically create new replicas during recovery. All of this is critically important to operational systems which must continuously handle access and update requests without delay.

IMDGs Add Data-Parallel Computation for Analytics

Because IMDGs store data in memory distributed across a cluster of servers, they easily can perform data-parallel computations on stored data; they simply make use of the cluster’s processing power to analyze data “in place,” that is, without the need to migrate it to other servers. This enables IMDGs to provide fast results with minimum overhead. For example, a recent demonstration of ScaleOut hServer running a MapReduce calculation for a financial services application generated analysis results in about 330 milliseconds compared to 15+ seconds for Apache Hadoop.

A significant aspect of the IMDG’s architecture for data analytics is that it performs its computations on data hosted in memory – not on an incoming data stream. This memory-based storage is continuously updated by an incoming data stream, so the computation has access to the latest changes to the data. However, the computation also has access to the history of changes as manifested by the state of the data stored in the grid. This gives the computation a much richer data set for performing an analysis than it would have if it could only see the incoming data stream. We call it “stateful” real-time analytics.

Take a look at the following diagram, which illustrates the architecture for ScaleOut Analytics Server and ScaleOut hServer. The diagram shows a stream of incoming changes which are applied to the grid’s memory-based data store using API updates. The real-time analytics engine performs data parallel computation on the stored data, combines the results across the cluster, and outputs a combined stream of alerts to the operational system.

The power of stateful analytics is that the computation can provide deeper insights than otherwise. For example, an e-commerce website can analyze not just browser actions but also interpret these actions in terms of a history of customer preferences and shopping history to offer feedback. Likewise, a financial services application can analyze market price fluctuations to determine trading strategies based on the trading histories for individual portfolios tuned after several trades and influenced by preferences.

Comparison to Spark

The Berkeley Spark project has developed a data-parallel execution engine designed to accelerate Hadoop MapReduce calculations (and add related operators) by staging data in memory instead of by moving it from disk to memory and back for each operator. Using this technique and other optimizations, it has demonstrated impressive performance gains over Hadoop MapReduce. This project’s stated goal (quoting from a tutorial slide deck from U.C. Berkeley’s amplab is to “extend the MapReduce model to better support two common classes of analytics apps: iterative algorithms (machine learning, graphs) [and] interactive data mining [and] enhance programmability: integrate into Scala programming language.”

A key new mechanism that supports Spark’s programming model is the resilient distributed dataset (RDD) to “allow apps to keep working sets in memory for efficient reuse.” They are “immutable, partitioned collections of objects created through parallel transformations.” To support fault tolerance, “RDDs maintain lineage information that can be used to reconstruct lost partitions.”

You can see the key differences between using an IMDG hosting data-parallel computation and Spark to perform MapReduce and similar analyses. IMDGs analyze updatable, highly available, memory-based collections of objects, and this makes them ideal for operational environments in which data is being constantly updated even while analytics computations are ongoing. In contrast, Spark was designed to create, analyze, and transform immutable collections of data hosted in memory. This makes Spark ideal for optimizing the execution of a series of analytics operators.

The following diagram illustrates Spark’s use of memory-hosted RDDs to hold data accessed by its analytics engine:

However, Spark is not well suited to operational environments for two reasons. First, data cannot be updated. In fact, if Spark inputs data from HDFS, changes have to propagated to HDFS from another data source since HDFS files only can be appended, not updated. Second, RDDs are not highly available. Their fault-tolerance results from reconstructing them from their recorded lineage, which may take substantially more time to complete than server failover by an IMDG. This represents an appropriate tradeoff for Spark because, unlike IMDGs, it focuses on analytics computations on data that does not need to be constantly available.

Even though Spark makes different design tradeoffs than IMDGs to support fast analytics, IMDGs can still deliver comparable speedup over Hadoop. For example, we measured Apache Spark executing the well-known Hadoop “word count” benchmark on a 4-server cluster running 9.6X faster than CDH5 Hadoop MapReduce for a 10 GB dataset hosted in HDFS. On this same benchmark, ScaleOut hServer ran 14X faster than Hadoop when executing standard Java MapReduce code.

What about Spark Streaming?

Spark Streaming extends Spark to handle streams of input data and was motivated by the need to “process large streams of live data and provide results in near-real-time” (quoting from the slide deck referenced above). It “run[s] a streaming computation as a series of very small, deterministic batch jobs” by chopping up an input stream into a sequence of RDDs which it feeds to Spark’s execution engine. “The processed results of the RDD operations are returned in batches.” Computations can create or update other RDDs in memory which hold information regarding the state or history of the stream.

The representation of input and output streams as RDDs can be illustrated as follows:

This model of computation overcomes Spark’s basic limitation of working only on immutable data. Spark Streaming offers stateful operators that enable incoming data to be combined with in-memory state. However, it employs a distinctly stream-oriented approach with parallel operators that does not match the typical, object-oriented usage model of asynchronous, individual updates to memory-based objects implemented by IMDGs for operational environments. It also uses Spark’s fault-tolerance which does not support high availability for individual objects.

For example, IMDGs apply incoming changes to individual objects within a stateful collection by using straightforward object updates, and they simultaneously run data-parallel operations on the collection as a whole to perform analytics. We theorize that when using Spark Streaming, the same computation would require that each collection of updates represented by an incoming RDD be applied to the appropriate subset of objects within another “stateful” RDD held in memory. This in turn would require that the two RDDs be aligned to perform a parallel operation, which could add complexity to the original algorithm, especially if updates need to be applied to more than one object in the stateful collection. Also, fault-tolerance might require checkpointing to disk since the collection’s lineage could grow lengthy over time.

Summing Up

IMDGs offer a platform for scalable, memory-based storage and data-parallel computation which was specifically designed for use in operational systems, such as the ones we looked at above. Because it incorporates API support for accessing and updating individual data objects with integrated high availability, IMDGs are easily integrated into the business logic of these systems. Although Spark and Spark Streaming, with their use of memory-based storage and accelerated MapReduce execution times, bear a resemblance to IMDGs such as ScaleOut hServer, they were not intended for use in operational systems and do not provide the feature set needed to make this feasible. We will take a look at how IMDGs differ from Storm and CEP in an upcoming blog.

The post How Do In-Memory Data Grids Differ from Spark? appeared first on ScaleOut Software.

]]>

Transforming Retail with Real-Time Analytics

Mon, 17 Feb 2014 23:38:13 +0000

Real-time analytics has the potential to transform operational systems by providing instant feedback that dramatically enhances how these systems respond to fast-changing events. For example, in a previous blog we saw how a hedge fund tracking its equity portfolios can respond to market fluctuations in milliseconds instead of minutes. However, these benefits are not restricted to financial services. In discussions with both e-commerce and brick-and-mortar retail companies, we also have identified opportunities to enhance their operational systems with real-time analytics. Let’s take a look at a few examples after a quick review of in-memory data grids (IMDGs).

Operational Systems Need In-Memory Data Grids

Operational systems typically manage fast-changing client data that constantly streams in for processing by business logic, which updates existing state information and initiates appropriate responses. Some responses provide feedback to clients and others commit changes to persistent storage. For example, an e-commerce system receives requests to view products from web browsers, displays requested products and offers, and sends requested information back to clients. It also receives orders from clients, which it commits to permanent storage, and then it sends out messages to other systems to process these orders.

In-memory data grids (IMDGs) have been used for several years within operational systems to ensure fast responses and to scale throughput as workloads grow. In-memory data grids enable the execution of business logic to scale out across a cluster of servers while holding fast-changing application state in memory accessible to all servers. Memory-based data storage helps minimize response times, and servers can add CPU capacity to handle incremental growth in the workload.

For example, an in-memory data grid can hold session state and shopping carts for an e-commerce web farm, enabling all web servers to quickly and seamlessly access this data as they handle incoming browser requests (which are distributed by an IP load-balancer to web servers):

In-Memory Computing: The Engine of Real-Time Analytics

The next step for operational systems is to add real-time analytics, and the easiest way to insert real-time analytics into an operational system is to integrate it with the system’s business logic using an IMDG. By adding real-time analytics to an in-memory data grid, it becomes instantly available to analyze fast-changing data flowing through the system and produce immediate results:

As we have explored in previous blogs, the key to fast response times for real-time analytics is data-parallel programming, that is, examining many data items in parallel using a single algorithm. This approach has two major strengths: (a) it enables the algorithm to be distributed across the grid’s cluster of servers for fast execution, and (b) it avoids moving data between servers for processing. The net result is that large, memory-based data sets can be quickly analyzed to generate timely responses.

Some IMDGs, such as ScaleOut Analytics Server, offer an integrated real-time analytics engine that automatically ships analytics code to all grid servers and then executes the code in parallel on a specified collection of data stored within the IMDG. This simplifies the task of embedding real-time analytics within an operational system and ensures high performance.

Real-time analytics also can be constructed using the Hadoop MapReduce programming model, which offers a very popular data-parallel design pattern. ScaleOut hServer hosts Hadoop MapReduce applications using its real-time analytics engine and eliminates the overheads of task scheduling and data motion usually associated with Hadoop, thereby opening the door to using MapReduce in operational systems.

Adding Real-Time Analytics to an E-Commerce System

Let’s look at how real-time analytics can be integrated into an e-commerce system. In addition to sending basic page requests to the system from clients browsing a website, the browser also can be instrumented to send detailed information about which products customers are examining and the time they are spending on each product. Combining all of this information, the system can build a history of site usage for each customer and collect a set of preferences for that customer. To support a large population of customers, customer information can be persisted in a database or NoSQL store and then brought into the IMDG when the customer starts browsing.

As illustrated in the following diagram, real-time analytics can continuously examine all active customers in parallel to identify special offers that are appropriate for the customer based on a combination of his/her preferences, shopping history, and current browsing behavior. By analyzing access patterns, the site also can determine if a customer is having difficulty finding products or services and suggest remedies. Inactive customers can be flagged and sent emails to remind them to complete purchases in their shopping carts. In addition, common patterns across customers can be identified and used to steer strategic decisions influenced by buying trends.

Using Real-Time Analytics in a Brick and Mortar Retail Store

As e-commerce has gained increasing dominance with the shopping public, brick and mortar stores have responded by personalizing the shopping experience. High end retailers are now beginning to send real-time information from the point of sale to back-office servers for analysis in order to provide immediate feedback to sales staff. This enables the retailer to dramatically enhance the shopping experience.

For example, opt-in customers can identify themselves to sales staff on arrival so that their preferences and history can be used to help suggest products of interest. Products can be tracked with RFID tags to alert the sales staff when an active customer’s size is not present on the sales floor and must be retrieved from the stockroom (preferably before the customer requests it). These tags also can identify which products are being taken from the shelves or racks so that buying trends can be tracked. This also helps the store determine which products are repeatedly left in the changing rooms and not purchased, increasing the store’s buying power with the manufacturer. These are some of the many potential uses for real-time analytics in brick and mortar retail.

As the following diagram illustrates, IMDGs with integrated real-time analytics provide a fast and highly scalable platform for hosting customer information and analytics algorithms used by brick and mortar stores. Streams of information regarding customer activity and product motion can be fed to an IMDG to update in-memory state information for customers and products. Using data-parallel execution, analytics algorithms can continuously analyze this in-memory state and generate alerts for the sales staff which are delivered to point of sale terminals or tablets.

Summing Up

These examples show the power of real-time analytics to enhance operational systems which manage retail purchases, whether online or in brick and mortar stores. By hosting real-time analytics within an IMDG, these systems easily can host customer and product information which is repeatedly updated by streams of activity data. Unlike pure streaming systems, IMDGs can integrate these two types of information to provide a more complete picture of customer activity, leading to a deeper understanding of behavior, preferences, and customer needs. Lastly, IMDGs which host data-parallel analytics algorithms can deliver fast results, avoiding the batch processing overheads of conventional analytics systems, while ensuring scalable performance to handle growing workloads.

The post Transforming Retail with Real-Time Analytics appeared first on ScaleOut Software.

]]>

How Object-Oriented Programming Simplifies Data-Parallel Analytics

Sat, 08 Feb 2014 23:09:49 +0000

In-memory computing enables real-time analytics to be integrated into operational systems so that fast-changing, “live” data can be instantly evaluated to provide feedback in milliseconds or seconds. As we have discussed in previous blogs, the key to scalable performance and fast response time lies in the use of data-parallel programming techniques. How can we structure these computations to ease their integration into operational systems?

The First Step: Object-Oriented Business Logic State

The evolution of in-memory data grids (IMDGs) over the past twelve or so years has paved the way for in-memory computing and analytics. IMDGs, such as ScaleOut StateServer, originally were developed to provide scalable, memory-based data storage for operational systems. For example, IMDGs host session-state and shopping carts for e-commerce web server farms and stock trades for financial systems. These applications and their associated IMDGs often run on clusters of physical or virtual servers to maintain fast response times for growing workloads.

Because web server applications, financial applications, and many others typically use object-oriented languages, such as Java, C#, or C++, to implement their business logic, they naturally structure fast-changing state information as collections of objects. For example, consider a website hosting an e-commerce application. This site’s shopping carts can be stored as instances of a “shopping cart” type, enabling business logic to directly manage these objects and leverage the benefits of object-oriented programming (e.g., data encapsulation and inheritance).

It is useful to think of a collection of objects as analogous to rows in a relational database table, with object properties corresponding to the table’s attributes. However, when objects are input from (or output to) a database or other persistent storage repository, a conversion to/from an object-oriented representation is performed. Tools such as Hibernate help perform these conversions.

To maximize ease of use, IMDGs are designed to integrate into an application’s object-oriented business logic and store fast-changing application state. They enable the application to seamlessly share data across servers and synchronize access by threads running on different servers. For example, consider a server cluster hosting a web server farm with a load-balancer distributing incoming browser connections to the web servers. If an e-commerce web application stores shopping carts within an IMDG, it can access the carts from any web server. Also, multiple browser connections can coordinate access to the shopping carts by using the IMDG’s APIs for distributed locking.

Note that IMDGs usually are not used for long term, persistent storage; that is the job of database and file servers. That said, IMDGs usually incorporate mechanisms to ensure the high availability of data after server outages so that they can be employed in mission-critical applications.

Partitioning Business Logic State into Object Collections

To enable IMDGs to store business logic state for scaled-out applications, their APIs provide an object-oriented view of data. IMDGs organize stored data as unstructured collections of objects, called “namespaces,” with each collection holding objects of a single type. These collections directly reflect mechanisms provided by object-oriented languages, such as Java maps and C# collections. To make objects globally accessible by the IMDG across a cluster of servers, the application supplies an identifying name, or “key,” for each stored object, usually in the form of a string, a number, or sometimes another object. For this reason, IMDGs are sometimes called “key-value stores.”

IMDGs automatically distribute objects within each namespace across the IMDG’s cluster of servers to maximize storage capacity and access throughput (by avoiding hot spots on individual servers). Well-designed IMDGs are “elastic” in the sense that they can transparently rebalance the storage workload as servers are added to handle more data or as servers are removed if the workload shrinks. In addition, IMDGs typically maintain redundant copies of stored objects on different servers so that data is not lost if a server fails or becomes unreachable. If such a failure occurs, the IMDG self-heals by restoring full redundancy to stored data on the surviving servers.

Take a look at the following diagram which depicts how an IMDG distributes collections of objects across a cluster of servers:

Because an IMDG’s memory is held in service processes running separately from applications, stored objects must be serialized into byte streams which are sent to the IMDG using inter-process or network communication. IMDGs typically store objects as “blobs” of serialized, uninterpreted data accessed by their keys. However, they also use client-side caches within the address space of applications to hold deserialized copies and thereby speed up retrieval.

Using Parallel Query to Select Objects

Because an IMDG runs on a cluster of servers, it incorporates scalable computing power that grows with storage capacity. IMDGs use this computing power both to scale access throughput and to select data using parallel query. IMDGs can query objects within a namespace by their object properties. Queries take the form of an expression which filters objects based on the values of these properties and generates a collection of keys identifying those objects which match the query criteria.

IMDG queries are similar to database queries except that they operate on object properties instead of relational attributes. For example, if an IMDG holds a collection describing equities in the stock market, a query could select all stocks which are in the high tech sector and have price/earnings ratios greater than 15. The following diagram depicts a parallel query in an IMDG:

ScaleOut StateServer takes advantage of Microsoft’s language integrated query (LINQ) to implement C# queries with a straightforward syntax similar to SQL. It constructs Java queries by composing methods which filter property values and creating Boolean expressions of these filters.

Various techniques are available to index objects for fast query so that they do not have to be deserialized to evaluate their properties when a query is performed. The net effect is that IMDGs can take full advantage of the cluster’s computing power to quickly select the objects of interest within a large collection stored in the IMDG.

Parallel Queries Can Create Performance Bottlenecks

However, efficiently managing the results of a parallel query without creating a performance bottleneck is a major challenge. If a query matches many objects, a large set of keys must be collected and delivered to the requesting application, and this can consume a great deal of network bandwidth. For example, matching 1M objects with 40-byte keys requires delivering 40MB to an application, which can take half a second on a gigabit network. Moreover, the application has to retrieve all 1M objects to process the selected data, which can take several seconds depending on the amount and complexity of the data.

Another issue with parallel queries is that they must be used judiciously to avoid saturating the CPUs and network. Because each query generates work for all servers, if many application threads simultaneously perform queries, the total amount of work can grow quadratically and overload the servers or network.

The Solution: Data-Parallel Computing

While parallel queries provide a powerful means to identify the objects of interest for further analysis, the key to scalable speedup without performance bottlenecks is to analyze objects in place on the grid servers. Instead of shipping the keys and objects to an application thread for analysis, it is far more efficient to ship the analytics code to the IMDG’s servers for parallel analysis. This minimizes data motion, scales the computation, and ensures that results can be returned as quickly as possible.

When performing a data-parallel computation, a parallel query serves to filter the objects that are submitted for analysis on each grid server. This avoids the performance bottleneck caused by shipping keys and objects across the network. The data-parallel computation also reduces the usage of parallel queries by moving the analysis into the IMDG.

The object-oriented data model simplifies the construction of data-parallel analysis in several ways. First, it partitions the data set into a collection of logically related entities (instances of a type) which can be independently analyzed. These objects form the domain decomposition needed for parallel execution, as discussed in the previous blog. Second, it enables the analysis code to be conveniently represented as a method on the type. Lastly, it enables the results of the parallel analysis to be structured as another collection of objects, usually of a different type, which can be combined or fed to another data-parallel operation (as is the case with Hadoop MapReduce).

Because the IMDG distributes the objects within a collection to all grid servers, it automatically ensures that the domain for data-parallel execution is load-balanced across the cluster. This maximizes overall throughput by balancing the analysis workload across the servers.

For example, consider the financial services example we saw in the previous blog in which the IMDG holds a collection of “strategy” objects for a hedge fund, each of which represents a market sector, such as high tech or real estate, and holds the equity positions and rules for that market sector. A data-parallel computation can independently update each strategy object with a snapshot of market price changes and then evaluate the strategy to determine if stock trades are needed. By performing this analysis in parallel across all strategies, results can be generated in milliseconds instead of several minutes needed by conventional disk-based, sequential analysis.

The following diagram shows how each IMDG server within a cluster can analyze local objects resulting from a parallel query, thereby avoiding data motion across the network. This diagram illustrates the multi-core “parallel method invocation” engine implemented by ScaleOut StateServer Pro. Note that locally selected objects are analyzed using the application’s Analyze method; analysis results then are combined by the Merge method and flow back into the server’s local memory for optional combining across servers.

Each server in the IMDG’s cluster run a portion of the data-parallel computation on locally selected objects, as illustrated below:

To simplify running data-parallel analyses, some IMDGs, such as ScaleOut StateServer Pro, can ship the application code to all grid servers and initialize the execution environment (i.e., start up JVMs or .NET runtimes). In addition, the IMDG’s use of object collections streamlines multiple data-parallel operations. For example, ScaleOut hServer uses the IMDG to run both the Map and Reduce phases as data-parallel operations, staging intermediate data within the IMDG to minimize overall execution time. Object collections also reduce the complexity and overhead of Hadoop’s input and output formats and enable automatic optimizations (e.g., automatically creating splits and partitions).

Summing Up

Object-oriented programming gives IMDGs an efficient and well understood means to hold business logic state, perform queries, and structure data-parallel computations. This allows IMDGs to be seamlessly integrated into operational systems and perform real-time analytics on “live” data, opening up many new opportunities to add value to these systems.

The post How Object-Oriented Programming Simplifies Data-Parallel Analytics appeared first on ScaleOut Software.

]]>

Creating Data-Parallel Computations for Real-Time Analytics

Sat, 01 Feb 2014 22:57:08 +0000

Real-time analytics offers enterprises the ability to examine “live,” fast-changing data within operational systems and obtain feedback in milliseconds to seconds. For example, a hedge fund in a financial services organization can track the effect of market fluctuations on its portfolios (“strategies”) of long and short equity positions in various market areas (high tech, real estate, etc.) and immediately identify strategies requiring rebalancing. As we have seen in previous blogs, the key to real-time performance, especially for growing workloads, is to use in-memory, data-parallel computing, which delivers scalable throughput and minimizes performance losses due to data motion. But how can we easily structure computations to take advantage of this “scale out” technology?

The Starting Point: Domain Decomposition

Going back to the 1980s, application developers realized that determining how to partition data so that each portion can be analyzed in parallel, a technique often called “domain decomposition,” is the fundamental design decision in data-parallel programming. Once this is accomplished, the domain can be parceled out among the servers within a computational cluster, and the analysis can proceed in parallel on all servers. The cluster speeds up the computation and also allows more servers to be added as the workload (i.e., the domain) grows in size.

The key to domain decomposition is determining which data within the state of an application to use as the domain for parallel analysis. In many cases, the domain is easy to identify, especially when the application is analyzing a physical entity. For example, in an earlier blog we saw how a climate simulation divides the atmosphere, land, and ocean into a grid of boxes which are analyzed independently at each time interval. Likewise, structural mechanics and fluid dynamics typically use grids that model physical systems.

The process of domain decomposition can be more challenging for applications whose data sets can be partitioned in different ways. Consider the hedge fund example above. Do we analyze the data by partitioning the stock symbols across the servers and then analyzing all strategies which hold a given stock? Alternatively, should we partition the strategies and analyze all positions within the strategy? Is there still another domain decomposition?

Minimize Data Motion: An Example in Financial Services

The best approach to choose usually is dictated by the need to minimize data motion. If a given decomposition induces data motion at each analysis step, this can kill performance while saturating the network. For example, if the hedge fund analysis partitions stock symbols, it would need to query all strategies affected by the stock symbol to obtain information needed to perform the analysis, and this requires substantial data motion for each stock symbol. (This assumes that the strategies as well as the stock symbols – and, in general, all data sets – are distributed across the cluster of servers performing the data-parallel computation, as is the case for data sets stored within an in-memory data grid (IMDG).

Instead, the computation could partition the strategies across the servers and analyze all strategies in parallel. In this case, it would need to query the latest stock prices for all positions within each strategy; this also induces data motion. In either case, if the strategies and stock prices are stored in separate data sets, unacceptable data motion will be incurred by the data-parallel computation.

However, if the stock prices for all relevant positions are kept up to date within the strategies instead of storing this data elsewhere, no data motion is needed to perform the data-parallel analysis, and maximum performance is achieved. Bearing in mind that multiple strategies maintain positions in a given stock, how can we efficiently update the strategies as price changes flow in from a market feed? The answer lies in leveraging data-parallel computation to both analyze and update the strategies. We can distribute a snapshot of price changes to all strategies at the start of each analysis and use this information to update each strategy’s positions prior to performing the analysis. This easily can be accomplished by tracking all price changes since the last analysis step and efficiently distributing them to the servers as part of the invocation mechanism for the analysis.

The following diagram illustrates how an IMDG can host a set of strategies and perform parallel analysis on the strategies while updating them with a live market feed containing snapshots of price changes. The analysis produces a stream of alerts to the trader (or to an automated trading system) for strategies that need rebalancing. In ScaleOut Analytics Server, the data parallel analysis can be implemented using a feature called parallel method invocation (PMI):

The net effect is that the hedge fund in real time can update its strategies and obtain alerts regarding positions that require rebalancing based on current market conditions. Take a look at this demo of a proof of concept implementation for 2K strategies tracking and analyzing a total of 40K positions using a cluster of four servers. This demo was implemented using a real-time Hadoop MapReduce engine implemented within ScaleOut hServer and delivered alerts within about 330 milliseconds (instead of more than 15 seconds for standard Hadoop).

Another Example: Inventory Reconciliation in E-Commerce

Consider another example of an e-commerce application that needs to reconcile orders and inventory in real time to avoid a shortfall in inventory. Orders flow into the system for a set of products (“SKUs”), and inventory changes flow into the system from several sources: orders to outside vendors are placed, products arrive at the warehouse, orders from customers are filled, defects are detected, etc. Especially with perishable goods, it’s vital to precisely track order commitments so that orders can be accurately filled (and customers stay happy).

Several different domain decompositions for reconciling orders to inventory could be used. Do we partition based on the orders, the inventory changes, or on some other basis? If we reconcile based on pending orders, we have to query the inventory changes for all items within each order, inducing substantial, performance-killing data motion. Likewise, if we partition the inventory changes, we have to query all orders that include a given inventory item, again inducing data motion.

To solve this problem, we can collect pending orders and inventory changes within their common SKUs and then use the SKUs to form a domain decomposition. As orders come into the system, we update the SKUs for all items within each order to track those orders. Likewise, as inventory changes flow in, we update the SKUs to maintain the latest inventory updates. This continuous process updates the SKUs on a real-time basis in a manner analogous to an ongoing database join operation. Now, a data-parallel analysis of the SKUs can reconcile all relevant orders and inventory changes for that SKU without causing data motion. This enables the cluster to perform a reconciliation across all SKUs in seconds instead of minutes.

The following diagram illustrates an IMDG hosting a collection of data representing SKUs that are being updated by incoming orders and inventory changes. The IMDG performs data-parallel reconciliation (“Eval”) of the SKUs and combines the results (“Merge”) while operations are ongoing:

Note that in this example, we do not use the data-parallel invocation mechanism to perform updates to the application’s state as we did with the hedge fund example. Instead, individual order and inventory updates flow into the application on a continuous basis, and the data-parallel analysis is performed on the state information as it changes. In both cases, the integration of data-parallel analysis within an operational system handling live data demonstrates the value of real-time analytics: analysis results are continuously generated and fed back to the system to optimize its ongoing operations.

Using an IMDG to Host a Data-Parallel Application

Lastly, note that an IMDG serves as an ideal host for both the application’s state and the data-parallel analysis. By storing data in memory and distributing it across a cluster of servers, it can keep up with continuous updates and scale to handle growing workloads (while maintaining high availability) just by adding servers. And it can perform data-parallel analysis on domains of data distributed across the servers, leveraging the IMDG’s automatic load-balancing and avoiding data motion across the network during the analysis.

Interestingly, the object-oriented programming model supported by IMDGs helps us to structure, distribute, identify, query, and analyze domains that we construct for real time analysis. We will explore that in an upcoming blog.

The post Creating Data-Parallel Computations for Real-Time Analytics appeared first on ScaleOut Software.

]]>

Scaling Real-Time Analytics with an IMDG

Sat, 25 Jan 2014 22:44:29 +0000

In the last blog we discussed how in-memory data grids (IMDGs) share the same architecture as parallel supercomputers. Parallel supercomputers typically add computing power by scaling “out” across a cluster of servers. Likewise, IMDGs scale out their in-memory data storage and analytics engine across service processes running on a cluster of servers. Let’s take a little deeper look at the benefits of scaling out, especially for computations in real-time analytics.

Scale Out to Handle Growing Workloads

Scaling out contrasts to the alternative technique of scaling “up” computing power by adding processors or special purpose processing hardware, such as a vector processor or GPU, to a single, shared memory, multiprocessor system. This technique originally gained popularity with mainframe systems prior to the advent of parallel supercomputing. However, mainframe architect Gene Amdahl recognized that the slowest element of a computing system constrains its overall throughput when executing a fixed size problem; he codified this as Amdahl’s Law. For example, if half of the computation time for a problem is consumed by a vector calculation that could be offloaded to an infinitely fast vector processor, the overall computation only runs 2X faster. Even if 90% of the computation time could be accelerated in this manner, the computation speeds up by no more than 10X.

Scaling out solves this problem by avoiding it. Scaling out adds servers to a cluster so that it can handle larger problem sizes and continue to increase throughput without increasing computation time (as long as the overheads do not grow faster than the number of servers!). This benefit was first observed independently by Cleve Moler and John Gustafson and codified as Gustafson’s Law of scalable speedup. When scaling out, the goal changes from trying to reduce the computation time for a fixed size problem to maintaining the same computation time as the workload increases in size. This nicely matches the needs of system architects who need to scale their systems to keep up with fast growing workloads.

IMDGs Scale Out Access to Data

Scaling out enables in-memory data grids to scale both in-memory data storage capacity and computing power so that they can perform real-time analytics on fast changing data. For example, consider the basic function of an IMDG to let client applications seamlessly access memory-based objects distributed across a cluster of servers. To read an object stored in an IMDG, a client application calls an API which makes a network request to an IMDG server. Assuming the client library can keep track of which server holds the specified object, this request can be satisfied with a single network round trip, consuming a small amount of CPU and NIC bandwidth on the server. As more requests arrive at the server, they consume increasing amounts of CPU and NIC bandwidth until these resources are maxed out.

This is where scalable speedup takes over. Even though the overall request rate that a server can handle is constrained by its available resources, adding a second server doubles the request rate without increasing response time. Likewise, adding more servers to the cluster linearly increases the throughput while maintaining fixed response times until the cluster’s network infrastructure eventually is saturated. Scaling out enables IMDGs to handle growing workloads just by adding servers.

IMDGs Scale Out Updates to Data

Storing and updating objects in an IMDG requires more network bandwidth than accessing them. To maintain high availability, IMDGs store one or more copies of every object on additional servers so that if a server or network connection fails, data can be accessed from an alternative server within the cluster. So, at a minimum, an update request requires twice as much network bandwidth as an access request. As a result, an update-heavy workload tends to saturate the NICs and network infrastructure faster than an access-heavy workload. However, in both cases, the overhead is still proportional to the request rate and also proportional to the number of servers in the cluster. This means that we can expect linear throughput growth as we add IMDG servers until the network itself is saturated. (At that point, we need a faster network, such as 10 Gbps Ethernet or InfiniBand.)

It’s easy to see the importance of the IMDG’s automatic, dynamic load balancing of stored objects in maintaining scalable speedup. This mechanism evenly distributes the workload across all servers within the IMDG’s cluster to make sure that maximum throughput is obtained, avoiding “hot spots” which would direct too many requests to a few servers and limit throughput.

Maintaining Scalable Speedup in Analytics Computations

What happens if we create a workload that induces network overhead which grows faster than the number of servers? This problem was often encountered in parallel supercomputing and gave rise to the aphorism “embarrassingly parallel” for applications which minimize communication overhead. It is just as relevant today: real-time analytics applications must make sure this problem does not occur and kill scalable speedup as the workload grows.

For example, consider a data analytics computation distributed across a set of clients, one per server in an IMDG cluster with N servers, with the client application running on the grid servers. Also assume each server’s computation must access and analyze N or more objects. This makes the total number of accesses required to perform the computation proportional to N*N. If these objects are randomly distributed within the IMDG, the network overhead to process this workload will grow as N-squared and quickly saturate the network, drastically limiting scalable speedup. A “task-parallel” application which parcels out tasks to the clients (one task per object) to complete this analysis could be expected to exhibit this behavior since it randomly selects objects in the IMDG.

However, if we arrange to perform the same analysis as a “data-parallel” computation in which each client only analyzes the objects which are located on the same servers as the client, we can avoid data motion and thereby eliminate order-N-squared network overhead (although we still have order-N inter-process communication overhead between the IMDG’s service process and clients). This is accomplished by letting the IMDG perform the analysis tasks in parallel across all cluster servers on local objects within each server. (In the same manner, Hadoop MapReduce task trackers perform map operations on co-located “splits” within HDFS.)

IMDGs Avoid Data Motion with Data-Parallel Computing

ScaleOut StateServer Pro’s “parallel method invocation” (PMI) feature implements a data-parallel computation model. This feature lets the user specify a Java or C# method to analyze an object and another method to combine the results of two analyses. (These are analogous to a Hadoop mapper and combiner except that, unlike Hadoop, PMI automatically combines results across all servers into a single result object.) The user also specifies a parallel query to select the objects to be analyzed. To run a PMI, the IMDG pre-stages the code on all grid servers, performs the parallel query, and then analyzes all selected objects in place without moving them across the network. Lastly, it combines the results on each server and then combines the results across the servers using a binary combining tree which minimizes execution time.

Performance Gain from Avoiding Data Motion

The performance benefits of the data-parallel approach are dramatic. To illustrate this, we captured performance measurements of a risk analysis computation in financial services called “back testing.” This analysis compares a variety of stock trading algorithms using recorded price histories for a collection of equities. Each price history was stored in a single object within the IMDG, and the clients were assigned equities to analyze. (Note that the IMDG’s object-oriented storage also could dynamically update the price histories from a ticker feed to enable real-time feedback to a trading system.)

We compared the task-parallel technique in which the clients analyzed a random set of equities to the data-parallel technique in which the clients only examined equities stored on the same server; the data-parallel technique was implemented using PMI. The following chart shows the throughput obtained for both the task-parallel and data-parallel techniques. Note how the data-parallel approach (red line) maintains linear performance scaling as the workload increases and IMDG servers are added to the cluster. In contrast, the task-parallel approach (blue line) fails to achieve performance scaling due to accessing objects from remote servers which creates substantial networking overhead.

IMDGs take full advantage of scalable speedup to scale their handling of access requests and to perform data-parallel computations. This enables them to run real-time analytics on fast-changing data held in the IMDG and maintain fast response time for large workloads. Unlike pure streaming systems, they combine the IMDG’s in-memory storage with scalable computation to implement complex applications in real-time analytics. We will explore some of these applications in an upcoming blog.

The post Scaling Real-Time Analytics with an IMDG appeared first on ScaleOut Software.

]]>