When disaster strikes, why do our cell phones go on the blink? Whether it is full-scale catastrophes such as Japan's earthquake and tsunami or the flooding in Mumbai every monsoon season, cell phone networks often seem to jam when users need them the most.

While we may write this off as just one of those things we have to live with, telecom behemoth AT&T takes network failure quite seriously. And it probably has no choice.

On any given day, its network powers over a billion devices, services 100 million mobile phone users and zips a mind-boggling 30 petabytes of data around the world. Not to mention the fact that every Fortune 1000 company is an AT&T client.

Making sure those customers stay connected at all times is the raison d'etre for AT&T's Global Network Operations Centre (G-NOC for short), a facility tucked away in sylvan surroundings at Bedminster, New Jersey.

Command and control

Walking into the G-NOC is much like being air-dropped into a movie set for the Star Wars. At the heart of the facility is a 250-foot wide wall devoted entirely to LCD screens. The maps, line graphs and scrolling tickers on the 141 screens provide a live update on the central nervous system of AT&T, the vast wire-line and wireless network that criss-crosses the globe.

On these screens, 130 staffers keep a 24/7 vigil on the ebb and flow of network traffic. Steve Moser, head of the visitor program at the facility claims that G-NOC is the largest command and control centre of its kind in the world. What’s the point in all this state-of-the-art equipment? After all, many Indian telcos manage quite well by outsourcing both their passive infrastructure and network management to third-parties. Those who do own their cell sites manage disasters through power backups or mobile units.

An eye on ‘incidents'

AT&T's credo though, is that prevention is better than cure. By spotting disruptions in the network before they snowball into a major problem, the company hopes to actually anticipate disasters and other network problems so that it can prepare for them.

The ‘incidents’ that G-NOC looks to get in front of could range from a surge in text messages due to an American Idol episode on television, to a major disruption in lines caused by a tornado, flood or an earthquake in a remote corner of the world. By detecting problems while they are still manageable, AT&T hopes to adjust traffic or deploy emergency response so that network disruption is minimised.

The round-the-clock surveillance of the network serves as an early warning system of impending disaster. On March 11, 2011 engineers at the G-NOC first noticed something amiss when inbound call traffic into the city of Tokyo began to back up, jamming phone lines. Within minutes, the console that monitors the company's global undersea cable network detected malfunction in the cables in and around Japan.

The team then quickly parsed these events with seismic alerts from the US Geological Survey (which had begun to flow in) to launch an emergency response. It was only then that first reports of the disaster had begun to flash on the CNN news screen.

Cues such as these give AT&T's Network Disaster Recovery (NDR) team a head-start in setting into motion a practised emergency response. Once a problem is detected, traffic flowing into the disaster site is immediately re-routed using proprietory automated tools.

Specific warnings are then disseminated to the company's offices, its network partners, corporate clients around the world and emergency services. Stand-by equipment is despatched to the location to restore temporary connectivity.

‘Shaping' traffic

Where network congestion is severe, and it often is with natural disasters, the GNOC deliberately steps in to ‘shape’ the traffic according to its requirements. Moser notes that when a calamity strikes, inbound traffic into the area often increases manifold, denying essential connectivity to users within the affected area. This is often what makes your mobile phone network jam during those crucial minutes.

G-NOC’s priority on such occasions is to make the maximum outbound capacity available to users in the disaster zone, so that they may reach out for help. To achieve this, it may deliberately introduce additional latency into inbound calls or even restrict inbound traffic, in order to keep connections up and running for the affected people.

To demonstrate this, Moser pulls up a map of the United States within a short while of Hurricane Katrina making landfall in New Orleans in August 2005. The normally clear screen shows thousands of bright lines converging on New Orleans, a sign of a network overloaded by anxious people trying to reach out to the city's residents.

Network on the move

In the case of natural disasters, efforts to re-route traffic also need to be supplemented by providing communications equipment to the affected area. If physical infrastructure in the area is wiped out, as it was during the 9/11 attacks on the World Trade Centre, the priority is to establish outbound communications access as quickly as possible for use by emergency services such as hospitals or the police department.

After homing in on a temporary site close to the affected location, AT&T deploys its fleet of technology and support trailers, emergency communication vehicles and satellite COLTs ( Cells on Light Trucks) to ensure connectivity. This functions as a stand by arrangement until the network is up and running once again.

Plotting the traffic

Of course, it isn't just to deal with cataclysmic events that AT&T is sinking $20 billion into its network infrastructure every year. It is also to deal with smaller but more frequent problems that afflict its network.

With data traffic growing by leaps and bounds, hacking attempts pose a rising threat to security. The AT&T network deals with over 1 million hacking attempts each day.

To spot these attempts, the company maps patterns in voice, text and data traffic over long periods to arrive at the expected levels of activity on a weekly, daily and even hourly basis. “Human behaviour is very predictable”, says Moser. What he is referring to is the repetitive patterns in voice, text and data traffic that play out everyday.

These patterns are then used by G-NOC to track actual voice and text traffic through the day to detect anything out of the ordinary. Any unusual spike, usually signalling a hacking attempt, prompts the network to isolate and re-route the traffic to a different site, where it is ‘scrubbed' and filtered for malicious content.

This mapping exercise also helps AT&T build in a buffer for the predictable surges in traffic that can burden the network. These could range from a popular television show that requires viewers to text their votes, to the thousands of New Year text messages that zip across phone lines at the stroke of midnight every December 31.

(The author visited AT&T facilities as the company's guest)

comment COMMENT NOW