The post Introducing Geospatial Mapping for Real-Time Digital Twins appeared first on ScaleOut Software.
]]>The goal of real-time streaming analytics is to get answers fast. Mission-critical applications that manage large numbers of live data sources need to quickly sift through incoming telemetry, assess dynamic changes, and immediately pinpoint emerging issues that need attention. Examples abound: a telematics application tracking a fleet of vehicles, a vaccine distribution system managing the delivery of thousands of shipments, a security or safety application analyzing entry points in a large infrastructure (physical or cyber), a healthcare application tracking medical telemetry from a population of wearable devices, a financial services application watching wire transfers and looking for potential fraud — the list goes on. In all these cases, when a problem occurs (or an opportunity emerges), managers need answers now.
Conventional streaming analytics platforms are unable to separate messages from each data source and analyze them as they flow in. Instead, they ingest and store telemetry from all data sources, attempt a preliminary search for interesting patterns in the aggregated data stream, and defer detailed analysis to offline batch processing. As a result, they are unable to introspect on the dynamic, evolving state of each data source and immediately alert on emerging issues, such as the impending failure of a truck engine, an unusual pattern of entries and exits to a secure building, or a potentially dangerous pattern of telemetry for a patient with a known medical condition.
In-memory computing with software components called real-time digital twins overcomes these obstacles and enables continuous analysis of incoming telemetry for each data source with contextual information that deepens introspection. While processing each message in a few milliseconds, this technology automatically scales to simultaneously handle thousands of data sources. It also can aggregate and visualize the results of analysis every few seconds so that managers can graphically track the state of a complex live system and quickly pinpoint issues.
The ScaleOut Digital Twin Streaming Service is an Azure-based cloud service that uses real-time digital twins to perform continuous data ingestion, analysis by data source, aggregation, and visualization, as illustrated below. What’s key about this approach is that the system visualizes state information that results from real-time analysis — not raw telemetry flowing in from data sources. This gives managers curated data that intelligently focuses on the key problem areas (or opportunities). For example, instead of looking at fluctuating oil temperature, telematics dispatchers see the results of predictive analytics. There’s not enough time for managers to examine all the raw data, and not enough time to wait for batch processing to complete. Maintaining situational awareness requires real-time introspection for each data source, and real-time digital twins provide it.
In the ScaleOut Digital Twin Streaming Service, real-time data visualization can take the form of charts and tables. Dynamic charts effectively display the results of aggregate analytics that combine data from all real-time digital twins to show emerging patterns, such as the regions of the country with the largest delivery delays for a vaccine distribution system. This gives a comprehensive view that helps managers maintain the “big picture.” To pinpoint precisely which data sources need attention, users can query analytics results for all real-time digital twins and see the results in a table. This enables managers to ask questions like “Which vaccination centers in Washington state are experiencing delivery delays in excess of 1 hour and have seen more than 100 people awaiting vaccinations at least three times today?” With this information, managers can immediately determine where vaccine shipments should be delivered first.
With the latest release, the streaming service now offers geospatial mapping of query results combined with continuous queries that refresh the map every few seconds. For example, using this cloud service, a telematics system for a trucking fleet can continuously display the locations of specific trucks which have issues (the red dots on the map) in addition to watching aggregate statistics:
For applications like this, a mapped view of query results offers valuable insights about the locations where issues are emerging that would otherwise be more difficult to obtain from a tabular view. Note that the queried data shows the results of real-time analytics which are continuously updated as messages arrive and are processed. For example, instead of displaying the latest oil temperature from a truck, the query reports the results of a predictive analytics algorithm that makes use of several state variables maintained by the real-time digital twin. This declutters the dispatcher’s view so that only alertable conditions are highlighted and demand attention:
The following image shows an example of actual map output for a hypothetical security application that tracks possible intrusions within a nationwide power grid. The goal of the real-time digital twins is to assess telemetry from each of 20K control points in the power grid’s network, filter out false-positives and known issues, and produce a quantitative assessment of the threat (“alert level”). Continuous queries map the results of this assessment so that managers can immediately spot a real threat, understand its scope, and take action to isolate it. The map shows the results of results three continuous queries: high alerts requiring action, medium alerts that just need watching, and offline nodes (with the output suppressed here):
In this scenario a high alert has suddenly appeared in the grid at three locations (Seattle, New York, and Miami) indicating a serious, coordinated attack on the network. By zooming in and hovering over dots in the graph, users can display the detailed query results for each corresponding data source. Within seconds, managers have immediate, actionable information about threat assessments and can quickly visualize the locations and scope of specific threats.
In applications like these and many others, the power of in-memory computing with real-time digital twins gives managers a new means to digest real-time telemetry from thousands of data sources, combine it with contextual information that enhances the analysis, and then immediately visualize the results. This powerful technology boosts situational awareness and helps guide responses much better and faster than was previously possible.
The post Introducing Geospatial Mapping for Real-Time Digital Twins appeared first on ScaleOut Software.
]]>The post ScaleOut Discusses Contact Tracing in The Record appeared first on ScaleOut Software.
]]>Also read our Partner Perspective in the same issue, which explains how the Microsoft Azure cloud provides a powerful platform for hosting the ScaleOut Digital Twin Streaming Service and ensures high performance across a wide range of applications.
The post ScaleOut Discusses Contact Tracing in The Record appeared first on ScaleOut Software.
]]>The post Voluntary Contact Self-Tracing for Companies appeared first on ScaleOut Software.
]]>
How Voluntary Self-Tracing Helps
In a previous blog post, we explored how voluntary contact self-tracing can assist other contact tracing techniques in alerting people who may have been exposed to the COVID-19 virus. This technique enables participants to log interactions with others so that they can discover if they are in a chain of contacts originating with someone, often a stranger, who subsequently tests positive for the virus. These contacts could include friends at a barbeque, grocery checkers, hairdressers, restaurant waitstaff, taxi drivers, or other interactions.
In contrast to highly publicized, proximity-based contact tracing by using mobile devices, voluntary self-tracing avoids security and privacy issues that have threatened widespread adoption. It also adds human judgment to the process so that the chain of contacts captures only potentially risky interactions. This is accomplished in advance of a positive test, enabling immediate notifications when the need arises.
Voluntary self-tracing offers huge value in connecting strangers who otherwise would not be notified about the need for testing without arduous manual contact tracing. However, it imposes the burden that everyone participates in a common tracing system and consistently makes the effort to log interactions. While this might restrict its appeal for public use, it could be readily adopted by companies, which have well-known, slowly changing populations and established working relationships and protocols.
Helping Companies Get Back to Work
Consider a company that has multiple departments distributed across several locations. As employees come back to work, they typically interact closely with colleagues in the same department. If anyone in the department tests positive for COVID-19, it’s likely that all of these colleagues have been exposed and need to get tested. In addition, employees occasionally interact with colleagues in other departments, both at the same site and at remote sites. These interactions also need to be tracked to contain exposure within the organization, as illustrated in the following diagram:
Voluntary contact self-tracing can handle the most common scenarios by using the company’s employee database to automatically connect colleagues who work in the same department and interact daily. Employees need only manually log contacts they make with employees in other departments. These interactions are relatively infrequent and tracked for a limited period of time (typically two weeks). This approach streamlines the work required to track contacts, while enabling the company to immediately identify all employees who need to be notified, tested, and possibly isolated after one person tests positive.
In addition, employees can manually track information about contacts they make while on business travel, such as during airline flights, taxi rides, and meals at restaurants. That way, when an employee tests positive, these external contacts can be immediately alerted of possible exposure. This enables companies to assist their communities in contact tracing and help contain the spread of COVID-19.
Enabling Technology: In-Memory Computing
Many large companies have tens of thousands of employees and need to perform fast, efficient contact tracing. They require both immediate notifications and up-to-the-moment statistics that identify emerging trends, such as hot spots at one of their offices. To make this possible, a technology called in-memory computing can be used to track contacts and immediately alert all affected employees (and community touchpoints, such as restaurants) when anyone tests positive and alerts the system. Using a mobile app connected to a cloud service, it creates and maintains a dynamic web of contacts that evolves as interactions occur and time passes.
For example, when an employee tests positive and alerts the system, all colleagues in the same department are quickly notified, as are employees in other departments with whom interactions have occurred. The contact tracing system follows the chain of contacts across departments at all locations within the company. It also notifies community contacts, such as airlines and taxi companies, of possible exposures so that they can take the appropriate action.
Within the cloud service, the in-memory computing system maintains a software-based real-time digital twin for each employee. This software twin records and maintains all contacts for the employee, as well as all community contacts. It also removes non-recurring contacts after sufficient time passes and exposure is no longer likely. When an employee tests positive, the mobile app notifies the corresponding real-time digital twin in the cloud. This sets off the chain of communication that alerts all connected twins to the exposure and notifies their real-world counterparts.
Maximizing Situational Awareness
Real-time digital twins contain a wealth of dynamic, up-to-the-minute information that can be tapped to help managers maintain situational awareness about rapidly evolving exposures and ongoing progress to contain them. In-memory computing technology can aggregate this data and visually present the results to help immediately spot outbreaks and identify significant problem areas. For example, the following chart, which can be updated every few seconds, shows the number of employees who have just tested positive at each site within a company:
The chart shows a jump in cases for employees at the Florida site. Managers can then investigate the source of the outbreak by department at this site:
Not surprisingly, most cases are occurring in the Retail department, most likely because of its large number of interactions with customers, and this department needs to take additional steps to limit exposure. With real-time aggregate analytics, managers can also track other important indicators, such as the number of employees and sites affected by an outbreak, the average number of interconnected contacts, and the percentage of affected employees who have received notifications and taken action.
Getting Back to Work Safely
As companies strive to restore to a normal working environment, managers recognize the need to carefully track the occurrence of COVID-19 in the workplace and minimize its propagation throughout an organization. Immediately notifying and isolating all affected employees helps to limit the size of an outbreak, while analyzing the sources and evolution of incidents assists managers in the moment and as they develop new policies and strategies. With its ability to track and analyze fast-changing data in real time, in-memory computing technology offers a powerful and flexible toolset for contact tracing, helping employees get back to work safely.
The post Voluntary Contact Self-Tracing for Companies appeared first on ScaleOut Software.
]]>The post Using Real-Time Digital Twins for Aggregate Analytics appeared first on ScaleOut Software.
]]>
When analyzing telemetry from a large population of data sources, such as a fleet of rental cars or IoT devices in “smart cities” deployments, it’s difficult if not impossible for conventional streaming analytics platforms to track the behavior of each individual data source and derive actionable information in real time. Instead, most applications just sift through the telemetry for patterns that might indicate exceptional conditions and forward the bulk of incoming messages to a data lake for offline scrubbing with a big data tool such as Spark.
Maintain State Information for Each Data Source
An innovative new software technique called “real-time digital twins” leverages in-memory computing technology to turn the Lambda model for streaming analytics on its head and enable each data source to be independently tracked and responded to in real time. A real-time digital twin is a software object that encapsulates dynamic state information for each data source combined with application-specific code for processing incoming messages from that data source. This state information gives the code the context it needs to assess the incoming telemetry and generate useful feedback within 1-3 milliseconds.
For example, suppose an application analyzes heart rate, blood pressure, oxygen saturation, and other telemetry from thousands of people wearing smart watches or medical devices. By holding information about each user’s demographics, medical history, medications, detected anomalies, and current activity, real-time digital twins can intelligently assess this telemetry while updating their state information to further refine their feedback to each person. Beyond just helping real-time digital twins respond more effectively in the moment, maintaining context improves feedback over time.
Use State Information for Aggregate Analytics
State information held within real-time digital twins also provides a repository of significant data that can be analyzed in aggregate to spot important trends. With in-memory computing, aggregate analysis can be performed continuously every few seconds instead of waiting for offline analytics in a data lake. In this usage, relevant state information is computed for each data source and updated as telemetry flows in. It is then repeatedly extracted from all real-time digital twins and aggregated to highlight emerging patterns or issues that may need attention. This provides a powerful tool for maximizing overall situational awareness.
Consider an emergency monitoring system during the COVID-19 crisis that tracks the need for supplies across the nation’s 6,100+ hospitals and attempts to quickly respond when a critical shortage emerges. Let’s assume all hospitals send messages every few minutes to this system running in a central command center. These messages provide updates on various types and amounts of shortages (for example, of PPE, ventilators, and medicines) that the hospitals need to quickly rectify. Using state information, a real-time digital twin for each hospital can both track and evaluate these shortages as they evolve. It can look at key indicators, such as the relative importance of each supply type and the rate at which the shortages are increasing, to create a dynamic measure of urgency that the hospital receive attention from the command center. All of this data is continuously updated within the real-time digital twin as messages arrive to give personnel the latest status.
Aggregate analysis can then compare this data across all hospitals by region to identify which regions have the greatest immediate need and track how fast and where overall needs are evolving. Personnel can then query state information within the real-time digital twins to quickly determine which specific hospitals should receive supplies and what specific supplies should be immediately delivered to them. Using real-time digital twins, all of this can be accomplished in seconds or minutes.
This analysis flow is illustrated in the following diagram:
As this example shows, real-time digital twins provide both a real-time filter and aggregator of the data stream from each data source to create dynamic information that is continuously extracted for aggregate analysis. Real-time digital twins also track detailed information about the data source that can be queried to provide a complete understanding of evolving conditions and enable appropriate action.
Numerous Applications Need Real-Time Monitoring
This new paradigm for streaming analytics can be applied to numerous applications. For example, it can be used in security applications to assess and filter incoming telemetry (such as likely false positives) from intrusion sensors and create an overall likelihood of a genuine threat from a given location within a large physical or cyber system. Aggregate analysis combined with queries can quickly evaluate the overall threat profile, pinpoint the source(s), and track how the threat is changing over time. This information enables personnel to assess the strategic nature of the threat and take the most effective action.
Likewise, disaster-recovery applications can use real-time digital twins to track assets needed to respond to emergencies, such as hurricanes and forest fires. Fleets of rental cars or trucks can use real-time digital twins to track vehicles and quickly identify issues, such as lost drivers or breakdowns. IoT applications can use real-time digital twins to implement predictive analytics for mission-critical devices, such as medical refrigerators. The list goes on.
Summing Up: Do More in Real Time
Conventional streaming analytics only attempt to perform superficial analysis of aggregated data streams and defer the bulk of analysis to offline processing. Because of their ability to maintain dynamic, application-specific information about each data source, real-time digital twins offer breathtaking new capabilities to track thousands of data sources in real time, provide intelligent feedback, and combine this with immediate, highly focused aggregate analysis. By harnessing the scalable power of in-memory computing, real-time digital twins are poised to usher in a new era in streaming analytics.
We invite you to learn more about the ScaleOut Digital Twin Streaming Service, which is available for evaluation and production use today.
The post Using Real-Time Digital Twins for Aggregate Analytics appeared first on ScaleOut Software.
]]>The post Announcing the ScaleOut Digital Twin Streaming Service™ appeared first on ScaleOut Software.
]]>A major challenge for stream-processing applications that track numerous data sources in real time is to analyze telemetry relevant to each specific data source and combine this with dynamic, contextual information about the data source to enable immediate action when necessary. For example, heart-rate telemetry from a smart watch cannot be effectively evaluated in isolation. Instead, it needs to be combined with knowledge of each person’s age, health, medications, and activity to determine when an alert should be generated.
A second and equally daunting challenge for live systems is to maintain real-time situational awareness about the state of all data sources so that strategic responses can be implemented, especially when a rapid sequence of events is unfolding. Whether it’s a rental car fleet with 100K vehicles on the road or a power grid with 40K nodes subject to outages, system managers need to quickly identify the scope of emerging problems and rapidly focus resources where most needed.
Traditional platforms for streaming analytics attempt to look at the entire telemetry pipeline using techniques such as SQL query to uncover and act on patterns of interest. But this approach is complex and leads to superficial analysis in real time, forcing telemetry to be logged into a data lake for later analysis using Spark or other tools. How do you trigger an alert to the wearer of a smart watch at the exact moment that the combination of telemetry fluctuations and knowledge about the individual’s health indicate that an alert is needed?
The key to creating straightforward stream-processing applications that can deal with these challenges lies in a software concept called the “real-time digital twin model.” Borrowed from its use in the field of product life-cycle management, real-time digital twins host application code that analyzes incoming telemetry (event messages) from each individual data source and maintains dynamically evolving information about the data source. This approach refactors and simplifies application code (which can be written in standard Java, C#, or JavaScript) to just focus on a single data source, introspect deeply, and better predict important events.
The following diagram illustrates how the ScaleOut Digital Twin Streaming Service hosts real-time digital twins that receive telemetry from individual data sources and send responses, including commands and alerts:
Because real-time digital twins maintain and dynamically update key information about each data source, aggregate analytics — essentially, continuous queries — can continuously look for patterns in this curated data instead of in just the raw telemetry. This enables immediate, focused insights that enhance situational awareness. For example, the streaming service can generate a bar chart every few seconds to aggregate and highlight alerts by region generated by examining properties of real-time digital twins for thousands of data sources:
The ScaleOut Digital Twin Streaming Service plugs into popular event hubs, such as Azure IoT Hub, AWS IoT Core, and Kafka, to extract event messages and forward them to real-time digital twin instances, one for each data source. It then triggers application code to process the messages and gives it immediate access to memory-based contextual information for the data source. Application code can generate alerts, command devices, update the contextual information, and read or update databases as needed. This code can be thought of as similar to a serverless function with the major distinction that it is automatically supplied contextual information and does not have to maintain it in an external data store.
This highly scalable cloud service is designed to simultaneously and cost-effectively track telemetry from millions of data sources and provide real-time feedback in milliseconds while simultaneously performing continuous, aggregate analytics every few seconds. A powerful UI enables fast deployment of real-time digital twin models created using the ScaleOut Digital Twin Builder software toolkit. The UI lets users build graphical widgets which create and chart aggregate statistics. Under the floor, a powerful in-memory data grid with an integrated compute engine transparently ensures fast, predictable performance.
Given the current COVID-19 crisis, here’s a use case in which the streaming service can assist in prioritizing the distribution of critical medical supplies to the nation’s hospitals. Hospitals distributed across the United States can send status updates to the cloud service regarding their shortfall of supplies such as ventilators and personal protective equipment. Within milliseconds, a dedicated real-time digital twin instance for each hospital can analyze incoming messages to track and evaluate the need for supplies, determine the hospital’s overall shortfall, and assess the urgency for immediate action, as depicted below:
The streaming service can then simultaneously analyze these results across the population of digital twin instances to determine in seconds which regions are exhibiting the most critical shortfall. This alerts logistics managers, who can then query the digital twins to identify specific hospitals and implement a strategic response:
The real-time digital twin approach creates a breakthrough for application developers that both simplifies application development and enhances introspection. It’s ideal for a wide range of applications, including real-time intelligent monitoring (the example above), Industrial Internet of Things (IIoT), logistics, security and disaster recovery, e-commerce recommendations, financial services, and much more. The ScaleOut Digital Twin Streaming Service is available now. We invite interested users to contact us here to learn more.
The post Announcing the ScaleOut Digital Twin Streaming Service™ appeared first on ScaleOut Software.
]]>The post Real-Time Digital Twins Simplify Development appeared first on ScaleOut Software.
]]>Writing applications for streaming analytics can be complex and time consuming. Developers need to write code that extracts patterns out of an incoming telemetry stream and take appropriate action. In most cases, applications just export interesting patterns for visualization or offline analysis. Existing programming models make it too difficult to perform meaningful analysis in real time.
This obstacle clearly presents itself when tracking the behavior of large numbers of data sources (thousands or more) and attempting to generate individualized feedback within several milliseconds after receiving a telemetry message. There’s no easy way to separately examine incoming telemetry from each data source, analyze it in the context of dynamically evolving information about the data source, and then generate an appropriate response.
An Example: Contact Self-Tracing
For example, consider the contact self-tracing application described in a previous blog. This application tracks messages from thousands of users logging risky contacts who might transmit the COVID19 virus to them. For each user, a list of contacts must be maintained. If a user signals the app that he or she has tested positive, the app successively notifies chained contacts until all connected users have been notified.
Implementing this application requires that state information be maintained for each user (such as the contact list) and updated whenever a message from that user arrives. In addition, when an incoming message signals that a user has tested positive, contact lists for all connected users must be quickly accessed to generate outgoing notifications.
In addition to this basic functionality, it would be highly valuable to compute real-time statistics, such as the number of users reporting positive by region, the average length of connection chains for all contacts, and the percentage of connected users who also report testing positive.
A Conventional Implementation
This application cannot be implemented by most streaming analytics platforms. As illustrated below, it requires standing up a set of cooperating services (encompassing a variety of skills), including:
Implementing and integrating these services requires significant work. After that, issues of scaling and high availability must be addressed. For example, how do we keep the database service from becoming a bottleneck when the number of users and update rate grow large? Can the web front end process notifications (which involve recursive chaining) without impacting responsiveness? If not, do we need to offload this work to an application server farm? What happens if a web or application server fails while processing notifications?
Enter the Real-Time Digital Twin
The real-time digital twin (RTDT) model was designed to address these challenges and simplify application development and deployment while also tackling scaling, high availability, real-time analytics, and visualization. This model uses standard object-oriented techniques to let developers easily specify both the state information to be maintained for each data source and the code required to process incoming messages from that data source. The rest is automatically handled by the hosting platform, which makes use of scalable, in-memory computing techniques that ensure high performance and availability.
Another big plus of this approach is that the state information for each instance of an RTDT can be immediately analyzed to provide important statistics in real time. Because this live state information is held in memory distributed across of a cluster of servers, the in-memory computing platform can analyze it and report results every few seconds. There’s no need to suffer the delay and complexity of copying it out to a data lake for offline analysis using an analytics platform like Spark.
The following diagram illustrates the steps needed to develop and deploy an RTDT model using the ScaleOut Digital Twin Streaming Service, which includes built-in connectors for exchanging messages with data sources:
Implementing Contact Self-Tracing Using a Real-Time Digital Twin Model
Let’s take a look at just how easy it is to build an RTDT model for tracking users in the contact self-tracking application. Here’s an example in C# of the state information that needs to be tracked for each user:
public class UserTwin : DigitalTwinBase { public string Alias; public string Email; public string MobileNumber; public Status CurrentStatus; // Normal, TestedPositive, or Notified public int NumHopsIfNotified; public List<Contact> Contacts; public Dictionary<string, Contact> Notifiers; }
This simple set of properties is sufficient to track each user. In additional to phone and/or email, each user’s current status (i.e., normal, reporting tested positive, or notified of a contact who has tested positive) is maintained. In case the user is notified by another contact, the number of hops to the connected user who reported tested positive is also recorded to help create statistics. There’s also a list of immediate contacts, which is updated whenever the user reports a risky interaction. Lastly, there’s a dictionary of incoming notifications to help prevent sending out duplicates.
The RTDT stream-processing platform automatically creates an instance of this state object for every user when the an account is created. The user can then send messages to the platform to either record an interaction or report having been tested positive. Here’s the application code required to process these messages:
foreach (var msg in newMessages) { switch (msg.MsgType) { case MsgType.AddContact: newContact = new Contact(); newContact.Alias = msg.ContactAlias; newContact.ContactTime = DateTime.UtcNow; dt.Contacts.Add(newContact); break; case MsgType.SignalPositive: dt.CurrentStatus = Status.SignaledPositive; // signal all of our contacts that we have tested positive: notifyMsg = new UserTwinMessage(); notifyMsg.Id = dt.Alias; notifyMsg.MsgType = MsgType.Notify; notifyMsg.ContactAlias = dt.Alias; notifyMsg.ContactTime = DateTime.UtcNow; notifyMsg.NumHops = 0; foreach (var contact in dt.Contacts) { msgResult = context.SendToTwin("UserTwin", contact.Alias, notifyMsg); } break; }}
Note that when a user signals that he or she has tested positive, this code sends a Notify message to all RTDT instances corresponding to users in the contact list. Handling this message type requires one more case to the above switch statement, as follows:
case MsgType.Notify: if (dt.CurrentStatus != Status.SignaledPositive) dt.CurrentStatus = Status.Notified; // if we have already heard from the root contact, ignore the message: if (msg.ContactAlias == dt.Alias || dt.Notifiers.ContainsKey(msg.ContactAlias)) break; // otherwise, add the notifier and signal the user: else { if (dt.NumHopsIfNotified == 0) dt.NumHopsIfNotified = msg.NumHops; newContact = new Contact(); newContact.Alias = msg.ContactAlias; newContact.ContactTime = msg.ContactTime; newContact.NumHops = msg.NumHops + 1; dt.Notifiers.Add(msg.ContactAlias, newContact); notifyMsg = new UserTwinMessage(); notifyMsg.Id = dt.Alias; notifyMsg.MsgType = MsgType.Notify; notifyMsg.ContactAlias = msg.ContactAlias; notifyMsg.ContactTime = msg.ContactTime; notifyMsg.NumHops = msg.NumHops + 1; msgResult = context.SendToDataSource(notifyMsg); // finally, notify all our contacts except the root if it's a contact: foreach (var contact in dt.Contacts) { if (contact.Alias != msg.ContactAlias) msgResult = context.SendToTwin("UserTwin", contact.Alias, notifyMsg); } } break;
That’s all there is to it. The key point is that the amount of code required to implement this application is small. Compare that to standing up web, application, database, and analytics services. In addition, integrated, real-time analytics within the RTDT platform can examine state variables to easily generate and visualize key statistics. For example, the CurrentStatus property and a Region property (not shown here) can be used to determine the average number of users who have tested positive by region. Likewise, the NumHopsIfNotified property can be used to determine the average number of connections traversed to notify users.
Summing Up
There’s no doubt that it’s a daunting challenge to create streaming analytics applications that track large numbers of data sources and respond individually to each one while simultaneously generating aggregate statistics that help maintain situational awareness. As we have seen, real-time digital twins can cut through this complexity and enable powerful applications to be quickly built with minimal code. This simplicity also makes them “agile” in the sense that they can be easily modified or extended to handle evolving requirements. You can find detailed information here to help you learn more.
The post Real-Time Digital Twins Simplify Development appeared first on ScaleOut Software.
]]>The post Voluntary & Anonymous Contact “Self”-Tracing at Scale appeared first on ScaleOut Software.
]]>A New Approach: Log Our Own Contacts in Advance
As we all know, getting people back to work will require testing and contact tracing. The latter will need armies of people to track down all contacts of each person who tests positive for coronavirus. Leading software companies are building mobile apps to log the places we have been and determine possible contacts. However, these apps will be complex by relying on Bluetooth wireless technology to detect proximity, and they raise concerns regarding both accuracy and privacy. Experts have stated that humans need to be in the loop for contact tracing to be effective. Maybe there’s a hybrid approach that offers promise in the near term.
What if we could easily and anonymously let people keep track of encounters as they occur that they judge to be potentially risky, such as with work colleagues, merchants, neighbors, and friends? As people log these contacts using aliases for privacy, cloud software could build a web of connections, often connecting strangers through intermediaries. Later, when a person tests positive, he or she could notify this software. It then could follow the breadcrumbs and anonymously alert via text message or email all people who recently came into contact and should be tested and/or self-isolated.
Here’s an example. When a grocery checker with the alias “flash88” tests positive for COVID-19, he can anonymously alert a chain of people who are connected by direct or intermediate contacts who have logged significant, recent encounters, such as haircuts or backyard barbeques:
Although voluntarily “self-tracing” our contacts would require proactive effort for each of us to log individual encounters, it could be done quickly and simply by just entering a person’s anonymous alias into a mobile app or web page. Most people likely would choose to do this only for lengthy interactions, such as with colleagues working closely together, friends at a barbeque, or perhaps a hairdresser or server at a restaurant. Location information could be included if permitted. What emerges from countless individual actions is a massive, continuously evolving web of contacts that can be immediately called upon for alerting when anyone tests positive.
For self-tracing to be effective, people who come into mutual contact need to register aliases and contact information (email or mobile number) with the software. As each person encounters a potentially risky contact and logs that contact’s alias, the system builds a record of the contact just for the period of time that risk of infection is possible, possibly two weeks. Regular work contacts could automatically be refreshed on a schedule.
The power of self-tracing is its ability to give us automatic, immediate — and anonymous — notification of exposure risk from a chain of contacts. While this tool does not replace standard contact tracing, it can add important value in its simplicity, timeliness, and incorporation of human judgment. As an alternative to more complicated and invasive Bluetooth-based contact tracing software, it could help accelerate the return to work for colleagues and businesses while offloading some of the work for contact tracers.
Advanced Capabilities
Beyond just tracking contacts, cloud-hosted software can keep track of important statistics, such as when and where (if permitted) a person tested positive or was alerted and how many in a chain separate a person from the source of an alert. Real-time analytics can continuously evaluate this data to immediately identify important trends, such as the locations where an unusually large number of alerts are occurring. This allows personnel to quickly respond and institute a self-isolation protocol where necessary.
Real-time analytics also can generate other useful statistics, such as the average chain length that results in co-infection and the percentage of connected people who actually become co-infected. If anonymous demographics are maintained and submitted by users, statistics can be computed by gender, age, and other parameters useful to healthcare professionals.
The Tech: In-Memory Cloud Computing Makes Self-Tracing Fast and Scalable
Cloud-hosted software technology called in-memory computing makes this contact tracing technique fast, scalable and quickly deployable. (It also can be used to assist other contact tracing efforts and for asset tracking.) While simultaneously tracking contacts for millions of people, this software can alert all related contacts in a matter of seconds after notification, and it can send out many thousands of alerts every second.
When a person tests positive and notifies the software, a cloud-based system uses the record of contacts kept with each person to send out alerts, and it follows these records from contact to contact until all mutually connected people have been alerted. In addition, human contact tracers could take advantage of this network (if permitted by users) to aid in their investigations.
ScaleOut Software has developed an in-memory, cloud-hosted software technology called real-time digital twins which make hosting this application easy, fast, and flexible. As illustrated below, cloud software can maintain a real-time digital twin for each user being tracked in the system to keep dynamic data for both alerting and real-time analytics. In addition to processing alerts, this software continuously extracts dynamic data from the real-time digital twins and analyzes it every few seconds. This makes it possible to generate and visualize the statistics described above in real time. Using real-time digital twins enables an application like this to be implemented within just a few days instead of weeks or months.
Note: To the extent possible, ScaleOut Software will make its cloud-based ScaleOut Digital Twin Streaming Service available free of charge (except for fees from the cloud provider) for public institutions needing help tracking data to assist in the COVID19 crisis.
Summing Up
Anonymous, voluntary, contact self-tracing has the potential to fill an important gap as a hybrid approach between manual contact tracing and complex, fully automated, location-based technologies. In work settings, it could be especially useful in protecting colleagues and customers. It offers a powerful tool to help contain the spread of COVID19, maximize situational awareness for healthcare workers and contact tracers, and minimize risks from exposure for all of us.
The post Voluntary & Anonymous Contact “Self”-Tracing at Scale appeared first on ScaleOut Software.
]]>The post Use Distributed Caching to Accelerate Online Web Sites appeared first on ScaleOut Software.
]]>In this time of extremely high online usage, web sites and services have quickly become overloaded, clogged trying to manage high volumes of fast-changing data. Most sites maintain a wide variety of this data, including information about logged-in users, e-commerce shopping carts, requested product specifications, or records of partially completed transactions. Maintaining rapidly changing data in back-end databases creates bottlenecks that impact responsiveness. In addition, repeatedly accessing back-end databases to serve up popular items, such as product descriptions and news stories, also adds to the bottleneck.
The Solution: Distributed Caching
The solution to this challenge is to use scalable, memory-based data storage for fast-changing data so that web sites can keep up with exploding workloads. A widely used technology called distributed caching meets this need by storing frequently accessed data in memory on a server farm instead of within a database. This speeds up accesses and updates while offloading back-end database servers. Also called in-memory data grids, distributed caches, such as ScaleOut StateServer®, use server farms to both scale storage capacity and accelerate access throughput, thereby maintaining fast data access at all times.
The following diagram illustrates how a distributed cache can store fast-changing data to accelerate online performance and offload a back-end database server:
The Technology Behind Distributed Caching
It’s not enough simply to lash together a set of servers hosting a collection of in-memory caches. To be reliable and easy to use, distributed caches need to incorporate technology that provides important attributes, including ease of integration, location transparency, transparent scaling, and high availability with strong consistency. Let’s take a look at some of these capabilities.
To make distributed caches easy to use and keep them fast, they typically employ a “NoSQL” key/value access model and store values as serialized objects. This enables web applications to store, retrieve, and update instances of their application-defined objects (for example, shopping carts) using a simple key, such as a user’s unique identifier. This object-oriented approach allows distributed caches to be viewed as more of an extension of an application’s in-memory data storage than as a separate storage tier.
That said, a web application needs to interact with a distributed cache as a unified whole. It’s just too difficult for the application to keep track of which server within a distributed cache holds a given data object. For this reason, distributed caches handle all the bookkeeping required to keep track of where objects are stored. Applications simply present a key to the distributed cache, and the cache’s client library finds the object, regardless of which server holds it.
It’s also the distributed cache’s responsibility to distribute access requests across its farm of servers and scale throughput as servers are added. Linear scaling keeps access times low as the workload increases. Distributed caches typically use partitioning techniques to accomplish this. ScaleOut StateServer further integrates the cache’s partitioning with its client libraries so that scaling is both transparent to applications and automatic. When a server is added, the cache quietly rebalances the workload across all caching servers and makes the client libraries aware of the changes.
To enable their use in mission-critical applications, distributed caches need to be highly available, that is, to ensure that both stored data and access to the distributed cache can survive the failure of one of the servers. To accomplish this, distributed caches typically store each object on two (or more) servers. If a server fails, the cache detects this, removes the server from the farm, and then restores the redundancy of data storage in case another failure occurs.
When there are multiple copies of an object stored on different servers, it’s important to keep them consistent. Otherwise, stale data due to a missed update could inadvertently be returned to an application after a server fails. Unlike some distributed caches which use a simpler, “eventual” consistency model prone to this problem, ScaleOut StateServer uses a patented, quorum-based technique which ensures that all stored data is fully consistent.
There’s More: Parallel Query and Computing
Because a distributed cache stores memory-based objects on a farm of servers, it can harness the CPU power of the server farm to analyze stored data much faster than would be possible on a single server. For example, instead of just accessing individual objects using keys, it can query the servers in parallel to find all objects with specified properties. With ScaleOut StateServer, applications can use standard query mechanisms, such as Microsoft LINQ, to create parallel queries executed by the distributed cache.
Although they are powerful, parallel queries can overload both a requesting client and the network by returning a large number of query results. In many cases, it makes more sense to let the distributed cache perform the client’s work within the cache itself. ScaleOut StateServer provides an API called Parallel Method Invocation (and also a variant of .NET’s Parallel.ForEach called Distributed ForEach) which lets a client application ship code to the cache that processes the results of a parallel query and then returns merged results back to the client. Moving code to where the data lives accelerates processing while minimizing network usage.
Distributed Caches Can Help Now
Online web sites and services are now more vital than ever to keeping our daily activities moving forward. Since almost all large web sites use server farms to handle growing workloads, it’s no surprise that server farms can also offer a powerful and cost-effective hardware platform for hosting a site’s fast-changing data. Distributed caches harness the power of server farms to handle this important task and remove database bottlenecks. Also, with integrated parallel query and computing, distributed caches can now do much more to offload a site’s workload. This might be a good time to take a fresh look at the power of distributed caching.
The post Use Distributed Caching to Accelerate Online Web Sites appeared first on ScaleOut Software.
]]>The post Track Thousands of Assets in a Time of Crisis Using Real-Time Digital Twins appeared first on ScaleOut Software.
]]>What are real-time digital twins and why are they useful here?
A “real-time digital twin” is a software concept that can track the key parameters for an individual asset, such as a box of masks or a ventilator, and update these parameters in milliseconds as messages flow in from personnel in the field (or directly from smart devices). For example, the parameters for a ventilator could include its identifier, make and model, current location, status (in use, in storage, broken), time in use, technical issues and repairs, and contact information. The real-time digital twin software tracks and updates this information using incoming messages whenever significant events affecting the ventilator occur, such as when it moves from place to place, is put in use, becomes available, encounters a mechanical issue, has an expected repair time, etc. This software can simultaneously perform these actions for hundreds of thousands of ventilators to keep this vital logistical information instantly available for real-time analysis.
With up-to-date information for all ventilators immediately at hand, analysts can ask questions like:
These questions can be answered using the latest data as it streams in from the field. Within seconds, the software performs aggregate analysis of this data for all real-time digital twins. By avoiding the need to create or connect to complex databases and ship data to offline analytics systems, it can provide timely answers quickly and easily.
Besides just tracking assets, real-time digital twins also can track needs. For example, real-time digital twins of hospitals can track quantities of needed supplies in addition to supplies of assets on hand. This allows quick answers to questions such as:
Unlike powerful big data platforms which focus on deep and often lengthy analysis to make future projections, what real-time digital twins offer is timeliness in obtaining quick answers to pressing questions using the most current data. This allows decision makers to maximize their situational awareness in rapidly evolving situations.
Of course, keeping data up to date relies on the ability to send messages to the software hosting real-time digital twins whenever a significant event occurs, such as when a ventilator is taken out of storage or activated for a patient. Field personnel with mobile devices can send these messages over the Internet to the cloud service. It also might be possible for smart devices like ventilators to send their own messages automatically when activated and deactivated or if a problem occurs.
The following diagram illustrates how real-time digital twins running in a cloud service can receive messages from thousands of physical data sources across the country:
How Do Real-Time Digital Twins Work?
What gives real-time digital twins their agility compared to complex, enterprise-based data management systems is their simplicity. A real-time digital twin consists of two components: a software object describing the properties of the physical asset being tracked and a software method (that is, code) which describes how to update these properties when an incoming message arrives. This method also can analyze changes in the properties and send an alert when conditions warrant.
Consider a simple example in which a message arrives signaling that a ventilator has been activated for a patient. The software method receives the message and then records the activation time in a property within the associated object. When the ventilator is deactivated, the method can both record the time and update a running average of usage time for each patient. This allows analysts to ask, for example, “What is the average time in use for all ventilators by state?” which could serve as indication of increased severity of cases in some states.
Because of this extremely simple software formulation, real-time digital twins can be created and enhanced quickly and easily. For example, if analysts observe that several ventilators are being marked as failed, they could add properties to track the type of failure and the average time to repair.
The power of real-time digital twin approach lies in the use of a scalable, in-memory computing system which can host thousands (or even millions) of twins to track very large numbers of assets in real time. The computing system also has the ability to perform aggregate analytics in seconds on the continuously evolving data held in the twins. This enables analysts to obtain immediate results with the very latest information and make decisions within minutes.
In this time of crisis, it’s likely the case that the technology of real-time digital twins has arrived at the right time to help our overtaxed medical professionals.
Note: To the extent possible, ScaleOut Software will make its cloud-based ScaleOut Digital Twin Streaming Service available free of charge (except for fees from the cloud provider) for institutions needing help tracking data to assist in the COVID19 crisis.
The post Track Thousands of Assets in a Time of Crisis Using Real-Time Digital Twins appeared first on ScaleOut Software.
]]>The post Real-Time Digital Twins: A New Approach to Streaming Analytics appeared first on ScaleOut Software.
]]>A conventional pipeline combines telemetry from all data sources into a single stream which is queried by the user’s streaming analytics application. This code often takes the form of a set of SQL queries (extended with time-windowing semantics) running continuously to select interesting events from the stream. These query results are then forwarded to a data lake for offline analytics using tools such as Spark and for data visualization. Query results also might be forwarded to cloud-based serverless functions to trigger alerts or other actions in conjunction with access to a database or blob store.
These techniques are highly effective for analyzing telemetry in aggregate to identify unusual situations which might require action. For example, if the telemetry is tracking a fleet of rental cars, a query could report all cars by make and model that have reported a mechanical problem more than once over the last 24 hours so that follow-up inquiries can be made. In another application, if the telemetry stream contains key-clicks from an e-commerce clothing site, a query might count how many times a garment of a given type or brand was viewed in the last hour so that a flash sale can be started.
A key limitation of this approach is that it is difficult to separately track and analyze the behavior of each individual data source, especially when they number in the thousands or more. It’s simply not practical to create a unique query tailored for each data source. Fine-grained analysis by data source must be relegated to offline processing in the data lake, making it impossible to craft individualized, real-time responses to the data sources.
For example, a rental car company might want to alert a driver if she/he strays from an allowed region or appears to be lost or repeatedly speeding. An e-commerce company might want to offer a shopper a specific product based on analyzing the click-stream in real time with knowledge of the shopper’s brand preferences and demographics. These individualized actions are impractical using the conventional tools of real-time streaming analytics.
However, real-time digital twins easily bring these capabilities within reach. Take a look at how the streaming pipeline differs when using real-time digital twins:
The first important difference to note is that the execution platform automatically correlates telemetry events by data source. This avoids the need for the application to select events by data source using queries (which is impractical in any case when using a conventional pipeline with many data sources). The second difference is that real-time digital twins maintain immediately accessible (in-memory) state information for each data source which is used by message-processing code to analyze incoming events from that data source. This enables straightforward application code to immediately react to telemetry information in the context of knowledge about the history and state of each data source.
For example, the rental car application can keep each driver’s contract, location history, and the car’s known mechanical issues and service history within the corresponding digital twin for immediate reference to help detect whether an alert is needed. Likewise, the e-commerce application can keep each shopper’s recent product searches along with brand preferences and demographics in her/his digital twin, enabling timely suggestions targeted to each shopper.
The power of real-time digital twins lies in their ability to make fine-grained analysis and responses possible in real time for thousands of data sources. They are made possible by scalable, in-memory computing technology hosted on clusters of cloud-based servers. This provides the fast response times and scalable throughput needed to support many thousands of data sources.
Lastly, real-time digital twins open the door to real-time aggregate analytics that analyze state data across all instances to spot emerging patterns and trends. Instead of waiting for the data lake to provide insights, aggregate analytics on real-time digital twins can immediately surface patterns of interest, maximizing situational awareness and assisting in the creation of response strategies.
With aggregate analytics, the rental car company can identify regions with unusual delays due to weather or highway blockages and then alert the appropriate drivers to suggest alternative routes. The e-commerce company can spot hot-selling products perhaps due to social media events and respond to ensure that inventory is made available.
Real-time digital twins create exciting new capabilities that were not previously possible with conventional techniques. You can find detailed information about ScaleOut Software’s cloud service for real-time digital twins here.
The post Real-Time Digital Twins: A New Approach to Streaming Analytics appeared first on ScaleOut Software.
]]>The post The Next Generation in Logistics Tracking with Real-Time Digital Twins appeared first on ScaleOut Software.
]]>Traditional platforms for streaming analytics don’t offer the combination of granular data tracking and real-time aggregate analysis that logistics applications in operational environments such as these require. It’s not enough just to pick out interesting events from an aggregated data stream and then send them to a database for offline analysis using Spark. What’s needed is the ability to easily track incoming telemetry from each individual store so that issues can be quickly analyzed, prioritized, and handled. At the same time, it’s important to be able to combine data and generate analytics results for all stores in real time so that strategic decisions, such as running flash sales or replenishing hot-selling inventory, can be made without unnecessary delay.
The answer to these challenges is a new software concept called the “real-time digital twin.” This breakthrough approach for streaming analytics lets application developers separately track and analyze incoming telemetry from each individual store while the platform handles the task of filtering out telemetry associated with other stores. This dramatically simplifies application code and automatically scales its use by letting the execution platform run this code simultaneously for all stores. In addition, the platform provides fast, in-memory data storage so that the application can easily and quickly record both telemetry and analytics results for each store. Lastly, built-in, aggregate analysis tools make it easy to immediately roll up these analytics results across all stores and continuously keep track of the big picture.
The ScaleOut Digital Twin Streaming Service, which runs in the Microsoft Azure cloud, hosts real-time digital twins for applications like these that need to track thousands of data sources. It can receive telemetry from retail stores over the Internet using event delivery systems such as Azure IoT Event Hub, AWS, Kafka, and REST, and it can respond back to the retail stores in milliseconds. In addition, the cloud service hosts aggregate analytics that can continuously complete every few seconds, avoiding the need to wait for offline results from a data lake.
The following diagram illustrates a nationwide chain of retail outlets streaming telemetry to their counterpart real-time digital twins running in the cloud service. It also shows real-time aggregate results being fed to displays for immediate consumption by operations managers.
With this ground-breaking technology, operations managers now can easily manage logistics for thousands of retail outlets, immediately identify and respond to issues, and optimize operations in real time. By harnessing the power and simplicity of the real-time digital twin model, application developers can quickly implement and customize streaming analytics to meet these mission-critical requirements. With the real-time digital twin model, the next generation of streaming analytics has arrived.
The post The Next Generation in Logistics Tracking with Real-Time Digital Twins appeared first on ScaleOut Software.
]]>The post Real-Time Digital Twins Simplify Code in Streaming Applications appeared first on ScaleOut Software.
]]>The problem with this approach is that important insights requiring quick action are not immediately available. In applications which require real-time responses, situational awareness suffers, and decision making must wait for human analysts to wade through the telemetry. For example, consider a large power grid with tens or hundreds of thousands of nodes, such as power poles, controllers, and other critical elements. When outages (possibly due to cyber-attacks) occur, they can originate simultaneously in numerous locations and spread quickly. It’s critical to be able to prioritize the telemetry from each data source and determine within a few seconds the scope of the outages. This maximizes situational awareness and enables a quick, strategic response.
The following diagram with 20,000 nodes in a simulated power grid gives you an idea of how many data sources need to be simultaneously tracked and analyzed:
The solution to this challenge is to refactor the problem so that application code can ignore the overall complexity of the event stream and focus on using domain-specific expertise to analyze telemetry from a single data source. The concept of “real-time digital twins” makes this possible. Borrowed from its usage in product life-cycle management and simulation, a digital twin provides an object-oriented container for hosting application code and data. In its usage in streaming analytics, a real-time digital twin hosts an application-defined method for analyzing event messages from a single data source combined with an associated data object:
The data object holds dynamic, contextual information about a single data source and the evolving results derived from analyzing incoming telemetry. Properties in the data objects for all data sources can be fed to real-time aggregate analysis (performed by the stream-processing platform) to immediately spot patterns of interest in the analytic results generated for each data source.
Here is an illustration of real-time digital twins processing telemetry from towers in a power grid from the above example application:
Real-time digital twins dramatically simplify application code in two ways. First, they automatically handle the correlation of incoming event messages so that the application just sees messages from a single data source. Second, they make contextual data immediately available to the application. This removes a significant amount of application plumbing (and a source of performance bottlenecks) and hides it in the execution platform. Note that the platform can seamlessly scale its performance by running thousands of real-time digital twins in parallel.
The net result is that the application developer only needs to write a single method that embeds domain-specific logic that can interpret telemetry from a given data source in the context of dynamic information about that specific data source. It can then signal alerts and/or send control messages back to the data source. Importantly, it also can record the results of its analysis in its data object and subject these results to real-time, aggregate analytics.
In the above power grid example, the application requires only of a few lines of code to embed a set of rules for interpreting state changes from a node in the power grid. These state changes indicate whether a cyber-attack or outage is suspected or another abnormal condition exists. The rules decide whether this node’s location, role, and history of state changes, which might include numerous false alarms, justify raising an alert status that’s fed to aggregate analysis. The code runs with every incoming message and updates the alert status along with other useful information, such as the frequency of false alarms.
The real-time digital twin effectively filters incoming telemetry using domain-specific knowledge to generate real-time results, and aggregate analysis, also in real-time, intelligently identifies the nodes which need immediate attention. It does this with a surprisingly small amount of code, as illustrated below:
This is one of many possible applications for real-time digital twins. Others include fleet and traffic management, healthcare, financial services, IoT, and e-commerce recommendations. ScaleOut Software’s recently announced cloud service provides a powerful, scalable platform for hosting real-time digital twins and performing aggregate analytics with an easy to use UI. We invite you to check it out.
The post Real-Time Digital Twins Simplify Code in Streaming Applications appeared first on ScaleOut Software.
]]>The post ScaleOut to Exhibit at Internet of Things World appeared first on ScaleOut Software.
]]>ScaleOut’s digital twin technology takes the digital twin concept well beyond its original application within product life-cycle management. Now digital twins can be used for stateful stream-processing, enabling deeper introspection than previously possible while simplifying the design process. Real-time digital twins enable automatic event correlation by data source, dynamic static tracking, pluggable analytics algorithms, aggregate analysis in real time, transparent scaling, and more. Object-oriented design techniques make them accessible to mainstream Java, C#, and JavaScript developers.
ScaleOut Software offers a full product suite for developing real-time digital twins, including the ScaleOut Digital Twin Builder software toolkit and the ScaleOut StreamServer® in-memory computing platform.
The post ScaleOut to Exhibit at Internet of Things World appeared first on ScaleOut Software.
]]>The post Use Parallel Analysis – Not Parallel Query – for Fast Data Access and Scalable Computing Power appeared first on ScaleOut Software.
]]>To help ensure fast data access and scalability, IMDGs usually employ a straightforward key/value storage model. This model works well for storing large, object-oriented collections of business-logic state, such as the examples listed above. Each object is stored in the grid as a serialized version of the application’s in-memory counterpart and is accessed with a unique key defined within a namespace (as shown in the diagram below). (Namespaces typically hold objects of a single, language-defined type.) In contrast to relational, graph-oriented, and other more complex storage models, key/value stores usually deliver faster data access because key lookups can be quickly completed with low overhead.
Application developers often deploy IMDGs as a distributed cache that sits between an application and its database; the IMDG offloads ephemeral data from the database. For example, it can be used to host short-lived business logic state used to prepare transactions. Offloading the database boosts performance, reduces bottlenecks, and lowers costs.
When used as a cache, developers often view an IMDG with a database mindset and access grid data using traditional, database-oriented techniques, such as SQL query, instead of key-based lookup. For example, an object describing a customer might be retrieved by querying the customer’s last name instead of performing a key-based lookup of the customer’s unique account number. That’s where performance problems can begin.
When used in an IMDG, a typical query seeks to access a set of objects matching specific properties. Just as a relational database queries a table by attributes, an IMDG queries a distributed, in-memory, object collection by matching class-based properties. In both cases, many results may match a query and be returned to the requesting client. When used sparingly, IMDG queries work quite well. However, when applications rely on query as the primary access model, access throughput can be seriously degraded, and overall application performance can suffer in two ways.
First, unlike key-based access, which is directed to a specific grid server to retrieve an object, a query requires the participation of all grid servers. This enables the IMDG to find all matching objects, which potentially reside on multiple servers within the distributed store. So the overhead to perform a query requires O(N) overhead on a cluster of N grid servers, while a key lookup only requires O(1) overhead. Since an IMDG typically has many clients simultaneously making access requests, the combined overhead for many parallel queries can quickly grow. For this reason, query should be avoided when a key lookup will suffice.
A bigger problem with query is that when it matches many results, a large amount of data may need to be returned over the network to the requesting client for processing, as illustrated below. This can quickly saturate the network (and bog down the client). When you consider that an IMDG can easily host terabytes of data distributed across several servers, it’s not surprising to see a single query return 10s of megabytes (MB) of results. A gigabit network can only move about a peak of 128MB/sec (although delays can start increasing at about half that), so large queries can (and often do) overload the network. And after a query returns the requested objects to the client for processing, the client must then wade through them all, potentially creating another bottleneck on the client.
What’s interesting about this dilemma is that the IMDG’s apparent weakness is actually its key strength. It’s all in how the application looks at the problem to be solved. Instead of querying the grid, what if we just moved this work from the client into the grid and performed it there? This would enable the application to avoid bottlenecks and harness the IMDG’s scalable computing power to boost performance.
The technique is called data-parallel computing, and many IMDGs (like ScaleOut StateServer Pro®) provide APIs that make it easy to use in languages like Java, C#, and C++. In its simplest form, the application ships off a class-defined method (call it “Eval”) to execute in the grid along with a query specification, and the IMDG distributes the work across all of its servers, querying each server locally and then running the application’s method on the selected objects. The application optionally can define a second method (call it “Merge”) to combine the results and return them back to the client. Running a method in the grid can be compared to executing a stored procedure in a database. The following diagram illustrates this concept:
Using data-parallel computing instead of parallel query gives an application two big wins. First, moving the code to the data dramatically reduces the amount of data transferred over the network since the results of the computation are usually dramatically smaller than the size of the original query results. Second, the grid’s scalable computing power reduces execution time while avoiding a bottleneck in the client. This scales the application’s throughput as the size of the workload increases.
We have seen this computing model’s utility in countless applications. Consider, for example, a hedge fund storing portfolios of stocks in an IMDG as objects, where each portfolio tracks a given market sector (high tech, energy, healthcare, etc.). When the stock exchange’s ticker feed updates a stock price, the hedge fund needs to evaluate all corresponding portfolios to see if rebalancing is needed. The obvious way to implement this is to query the grid for all portfolios containing the updated stock and then analyze them in the client. However, this requires large amounts of data to cross the network and creates lots of work for the client. Instead, the client can simply kick off a data-parallel computation in the grid on all portfolios that contain the stock and let the grid perform this work quickly (and scalably). Using this technique, a hedge fund was able to see the time for rebalancing drop from several minutes to less than half a second.
Although it often masquerades as a distributed cache, an IMDG actually is a scalable, in-memory computing platform – not that different from a parallel supercomputer running on commodity hardware. With a small change in mindset, developers easily can harness its computing power to eliminate bottlenecks and reap big dividends in performance.
The post Use Parallel Analysis – Not Parallel Query – for Fast Data Access and Scalable Computing Power appeared first on ScaleOut Software.
]]>The post The Power of an In-Memory Data Grid for Hosting Digital Twin Models appeared first on ScaleOut Software.
]]>The secret to the power of the digital twin model is its focus on organizing real-time data by the data source to which it corresponds. What this means is that all event messages from each data source are correlated into a time-ordered collection, which is associated with both dynamic state information and historical knowledge of that data source. This gives the stream-processing application a rich context for analyzing event messages and determining what actions need to be taken in real time.
For example, consider an application which tracks a rental car fleet to look for drivers who are lost or driving recklessly. This application can use the digital twin model to correlate real-time telemetry (e.g., location, speed) for each car in the fleet and combine that with data about the driver’s contract, safety record with the rental car company, and possibly driving history. Instead of just examining the latest incoming events, the application now has much more information instantly available to judge when and whether to signal an alert.
When implementing a stateful stream-processing application using the digital twin model, an object-oriented approach offers obvious leverage in managing the data and event processing associated with this model. For each type of data source, the developer can create a data type (object “class”) which describes the event collection, real-time state data, and analysis code to be executed when a new event message arrives. Instances of this data type then can be created for each unique data source as its digital twin for stream-processing. Here is a depiction of a digital twin object and its associated methods:
This object-oriented approach makes an in-memory data grid (IMDG) with integrated in-memory computing, (e.g., ScaleOut StreamServer) an excellent stream-processing platform for real-time digital twin models. IMDGs have a long history dating back almost two decades. They were originally created as middleware software to host large populations of fast-changing data objects, such as ecommerce session-state and shopping carts, in memory instead of database servers, thereby speeding up access. An IMDG implements a software-based, key-value store of serialized objects that spans a cluster of commodity servers (or cloud instances). Its architecture provides cost-effective scalability and high availability, while hiding the complexity of distributed in-memory storage from the applications which use them. It also can take full advantage of a cluster’s computing power to run application code within the IMDG — where the data lives — to maximize performance and avoid network bottlenecks.
Using an example from financial services, the following diagram illustrates an application’s logical view of an IMDG as a collection of objects (stock sectors in this case) and its physical implementation as in-memory storage spanning a cluster of servers:
It should now be clear why IMDGs offer a great way to host digital twin models and perform stateful-stream processing. An IMDG can store instances of a digital twin model and scale its storage capacity and event-processing throughput by adding servers to hold many thousands of instances as needed for all the data sources. The grid’s key-value storage model automatically correlates incoming event messages for the corresponding digital twin instance based on the data source’s identifying key. When the grid delivers an event to its digital twin, it then runs the model’s application code for event ingestion, analysis, and alerting in real time.
As an example, the following diagram illustrates the use of an IMDG to host digital twin objects for rental cars. The arrows indicate that the IMDG directs events from each car to its corresponding digital twin for event-processing.
As part of event-processing, digital twins can create alerts for human attention and/or feedback directed at the corresponding data source. In addition, the collection of digital twin objects stored in the IMDG can be queried or analyzed using data-parallel techniques (e.g., MapReduce) to extract important aggregate patterns and trends. For example, the rental car application could alert managers when a driver repeatedly exceeds the speed limit according to criteria specific to the driver’s age and driving history. It also could allow a manager to query the status of a specific car or compute the maximum excessive speeding for all cars in a specified region. These data flows are illustrated in the following diagram:
Because they are hosted in memory, digital twin models can react very quickly to incoming events, and the IMDG can scale by adding servers to keep event processing times fast even when the number of instances (objects) and/or event rates become very large. Although the in-memory state of a digital twin is restricted to the event and state data needed for real-time processing, the application can reference historical data from external database servers to broaden its context, as shown below. For example, the rental car application could access driving history only when incoming telemetry indicates a need for it. It also could store past events in a database for archival purposes.
What makes an IMDG an excellent fit for stateful stream-processing is its ability to transparently host both the state information and application code within a fast, highly scalable, in-memory computing platform and then automatically direct incoming events to their respective digital twin instances within the grid for processing. These two key capabilities, namely correlating events by data sources and analyzing these events within the context of the digital twin’s real-time state information, give applications powerful new tools and create an exciting new way to think about stateful stream-processing.
The post The Power of an In-Memory Data Grid for Hosting Digital Twin Models appeared first on ScaleOut Software.
]]>The post Video: ScaleOut Software Joins Forces with IBM POWER8* Systems appeared first on ScaleOut Software.
]]>ScaleOut Software joins forces with IBM to help POWER8 system users achieve unprecedented speeds for live, fast-changing, big data analysis in Hadoop environments.
In the following IBM Systems ISV testimonial video, ScaleOut Software CEO and Founder, Dr. William Bain, describes the benefits of optimizing ScaleOut Software’s in-memory computing technology on the IBM POWER8 platform: POWER8 and ScaleOut Software: In-memory computing for operational intelligence.
https://www.youtube.com/watch?v=qSWx5GY18j4
The alliance with IBM is a marriage between two best-in-class, high-performance hardware and software technologies optimized for low latency and real-time execution. The POWER8 system offers unique capabilities in its hardware and a highly optimized Java execution environment. ScaleOut Software’s in-memory computing technology brings techniques from parallel supercomputing and big data to live applications, and we have optimized our software to take advantage of the POWER8’s breakthrough architecture.
The combination of ScaleOut hServer®, an in-memory MapReduce execution platform and IBM POWER8 equips organizations with the ability to track and analyze constantly changing data, empowering them to make more informed decisions in the moment. For example, benchmark testing with a financial services application showed that ScaleOut hServer, on an IBM POWER8 cluster, can track 20M positions, analyze strategies, and alert traders in under 3.4 seconds, in contrast to the 15 minutes or more that conventional techniques require. ScaleOut hServer can track real-time changes to investment strategies throughout the trading day and intelligently identify imbalances on a portfolio-by-portfolio basis.
Unlike other in-memory computing and data-parallel execution engines, ScaleOut hServer provides three distinct advantages. First, ScaleOut hServer’s in-memory data grid holds live, fast-changing data that can be simultaneously updated and analyzed at in-memory speeds. Second, ScaleOut hServer reduces the overhead associated with standard Hadoop distributions, by avoiding unnecessary batch scheduling, disk access, data shuffling, and mandatory key sorting. Finally, ScaleOut hServer on IBM POWER8 allows IT departments to run MapReduce jobs with zero code changes, streamlining development and accelerating time to results.
This powerful combination represents the future of in-memory computing, giving customers the ability to analyze live, fast-changing data as it fluctuates in real time. With ScaleOut hServer running on the POWER8 system, customers now can run in-memory MapReduce applications faster than ever and for the first time also on live, fast-changing data.
* IBM and POWER8 are registered trademarks of IBM Corporation.
The post Video: ScaleOut Software Joins Forces with IBM POWER8* Systems appeared first on ScaleOut Software.
]]>