Understanding the limits of replication and redundancy under edge architectures

Executive Summary

Edge computing and IoT-based distributed architectures differ from architectures based on orchestration frameworks targeted for implementation within a data center.
Edge computing and IoT architectures are intended for dedicated devices used over a wide geography.
As such, the redundancy and replications techniques used for systems hosted in data centers do not apply.
In order to address this difference, architects need to alter the way they think about redundancy and replication within the edge computing paradigm.

Replication and redundancy have been key components of computing for a long time, since the heyday of the mainframe. Back then, if a mainframe lost power, everything stopped. Organizations addressed this risk by keeping generators and power supplies on hand to supply redundant electrical backups. If power from the main power grid failed, the generators took over. No electricity was lost.

Mainframes also store data exclusively on or within the machine. Thus, if the storage mechanism failed, data was lost. So companies backed up the data to tape and this was an early form of data replication.

When personal computers first appeared, they, too, used the same redundancy and replication techniques used for mainframes. A user had an uninterrupted power supply close by in case of power failure. Data was replicated to tape or floppy drive.

Things changed when networking PCs together made distributed computing possible. This was particularly telling in database technology. Companies networked a number of computers together. One computer hosted the database server. Other computers acted as file servers that stored the data the database used.

Eventually, database technology matured to the point where the database was smart enough to replicate data among a variety of machines. Database technology progressed even further. Multiple databases that had the same processing logic were placed behind a load balancer – a traffic cop, if you will. The load balancer routed incoming traffic among the various redundant database servers. This redundancy avoided overloading the system.

Redundancy and replication have withstood the test of time. Both are used extensively today, most noticeably with applications that are hosted in a data center and accessed over the internet. Yet, as popular as replication and redundancy are, they are not without limits. These limitations become particularly apparent when working with edge computing and the Internet of Things.

Distributed systems at the edge are not the same as distributed systems that are hosted within a data center. Edge computing is a new approach to machine distribution that requires new thinking. This difference requires those designing distributed applications for edge computing to reconceptualize replication and redundancy.

The purpose of this article is to examine new ways to think about replication and redundancy as it relates to distributed edge computing. The place to start is to understand the essential difference between the traditional approach to distributed computing, which focuses on the data center, and distributed computing on edge devices and the Internet of Things.

Typical redundancy is a distributed system in a data center

A typical approach to distributed computing is to use a pattern in which replicas of a particular algorithm are represented by a service layer. Then, the service becomes one of many other services represented by different algorithms that are accessed via some sort of gateway mechanism. Each service has load balancing capabilities that ensures that no one instance of its underlying algorithms are overloaded. (See Figure 1.)

Figure 1: A typical pattern in distributed architecture in which redundancy ensures availability and efficient performance

Kubernetes uses this type of distribution pattern, as does Docker Swarm.

The benefit of this pattern is that using redundant algorithms ensures resilience. If one of the instances goes down, other identical instances of the algorithm are still available to provide computing logic. And, if automatic replication is in force, when an instance goes down, the replication mechanisms can try to resurrect it. If the instance can’t be reinstated, the replication mechanism will create a new one to take its place. Replication of this type is used by Kubernetes with its Deployment resource.

As powerful as this type of architecture is, it’s not magical. A lot of work needs to go into getting and keeping an architecture of this type up and running. First and foremost, the various components that make up the system need to know a good deal about each other. At the logical level, a service needs to know about its algorithms, and the gateway mechanism needs to know about the services it’s supporting. At the physical level, service and algorithms reside on separate machines; therefore, access between and among machines needs to be granted accordingly. This can become an arduous task. Imagine an architecture as the one shown below in Figure 2. The service lives on one machine, and each instance and its algorithms live on a distinct machine. Should one machine go down, then another one needs to replace it. In the old days, this meant that someone actually had to go down to a data center and physically install the machine and then add it to the network.

Figure 2: Replication can be very hard to support at the hardware level, particularly when a new machine needs to be added to the system.

Of course, modern distributed technologies have evolved to the point where machine replacement means nothing more than spinning up a virtual machine on a host computer and then adding that VM to the network. However, while automation will do the work, the laws of time and space still exist. It takes time to spin up the VM, and that new VM needs to be added to the network and made available to the application.Fortunately, orchestration technologies for Linux containers, most notably Kubernetes, have significantly reduced the risk of large-scale failure, even at the hardware level. However, while this type of pattern works well within the physical confines of a data center or among many data centers, systems that rely on redundancy and replication experience significant limitations when it comes to edge computing.

The limits of redundancy and replication in edge architecture

The essential idea of edge computing is that a remote device has the ability to execute predefined computational logic and also communicate to other devices to do work. One of the more common examples of edge computing is the red-light traffic camera.

A municipality places a camera at a traffic intersection controlled by a red-light. When a motor vehicle runs a red light, the camera has the intelligence to detect the violation and take a photo of the offender. Also, the device is able to send to another computer that acts as a data collector the photo of the offending vehicle along with some metadata describing the time of the violation. The collector can either process the photo and metadata on its own or pass it all on to other intelligence that can do the analysis. (See Figure 3, below.)

Figure 3: Red-light traffic cameras are a commonplace example of edge computing.

What distinguishes the red-light traffic camera as an edge device is that it has intelligence. Unlike a closed circuit television system in which the camera does nothing more than transmit an ongoing video signal back to a television monitor in another location, a red-light traffic camera understands some of what it sees to make law enforcement decisions. There is no human evaluating the video transmission. Computational intelligence does it all. Cameras are distributed above the city, and each camera has the ability to communicate back to a central collector. Thus, you can think of red-light camera systems as distributed architecture.

But, while a red-light camera system is indeed a type of distributed architecture, it does have a significant shortcoming. Such a system is incapable of supporting automated redundancy and replication.

Think about it.

Should the red-light camera on the corner of Main St. and 6th Ave go offline, that capability for monitoring traffic goes away too. No red-light violations will be reported by that device until a technician goes out into the field and repairs the camera.

So then, given the inherent limitation of this type of distributed architecture, how do we create traffic camera systems that have redundancy built in? The easiest solution is to put a number of traffic cameras at each interaction but make only one operational. If the operational camera goes offline, intelligence back on the controller will take notice that the first camera is not working and turn on the backup to take its place. (See Figure 4.)

Figure 4: Edge devices in the real world require real world redundancy.

Having backup devices on hand is a typical way of doing redundancy in the real world. Hospitals are designed with generators that provide electricity in the event of a power failure from the public grid. Practically all industrial-strength data centers use backup power generators too.

The notion of physical backup is not confined to electricity only. Professional rock bands always travel with an extra set of amplifiers to ensure that if an amplifier malfunctions on stage, a replacement is readily available to plugin. Guitarists usually have a backup guitar on hand in case a string breaks. As they say, the show must go on even in the world of edge computing.

Edge computing vs the data center

The most important thing to understand about edge architectures is that they are different from architectures that are intended for devices in a data center.

These days most devices in a data center are virtualized in terms of computing resources and networking. Thus, they can be replenished easily using automation because of their virtual nature. Kubernetes can easily redirect traffic away from a failing piece of hardware. And, if the alternative hardware becomes overworked, modern provisioning software can automatically detect available hardware and spin up a new VM accordingly. Then Kubernetes can take over and create the virtual assets needed to keep things going.

Of course, things can go very wrong quickly when a data center goes offline or a network wire gets cut by accident. However, while these cases can be catastrophic, they are rare. More often than not, failures occur among virtual devices.

On the other hand, edge devices are real, not virtual, and a whole class of edge devices is mobile, for example, robots, tractors, forklifts, and delivery trucks. Thus, the techniques that are usual for data center replication do not apply. Replication is very much about the physical device and the geography in which it operates. For example, how do you replicate intelligence in a cell phone performing some mission-critical operation on an oil rig in the middle of the North Sea? How do you provide redundancy for a robotic tractor tilling an irrigated field in a remote area of Sub-Saharan Africa? Even at a consumer level, the Internet-enabled refrigerator in my house is in my house! If it fails, I can only go across the street and use my neighbor’s if I have very generous neighbors.

The essential question becomes, how does a company implement redundancy and replication in edge architecture?

When designing edge architecture and architectures for IoT, it is essential to remember that these devices exist as physical entities in the world and need to be accommodated as such. There is no virtual magic to be had. If you want to build redundancy into your edge architecture, as shown in the red-light traffic camera example above, it needs to be done on the physical plane.

You need to plan for backup devices that are readily available in terms of time and real space. This means having physical backups on hand, whether the device is a cell phone or forklift. Yes, this approach is a bit old school, but nonetheless, the solution is valid. Bringing one-size-fits-all virtualization thinking to real assets in the real world won’t work. When it comes to edge architectures, the devil is in the device. The takeaway is simple: have a physical backup on hand.