The big data revolution

18 August 2014
Construction of the MeerKAT antenna. Image courtesy of SKA South Africa.
Construction of the MeerKAT antenna. Image courtesy of SKA South Africa.

Big data - large and complex data sets that are difficult to process using traditional computational hardware and software - is a popular term that has gained ground in the last decade, not only in everyday life, but also in research. UCT is taking the lead in creating the framework to allow African researchers to get to grips with the increasing volume, velocity and variety of data, and turn it into actionable knowledge.

The modern world is experiencing a data deluge: "The data flow is so fast that the total accumulation of the past two years - a zelabyte - dwarfs the prior record of human civilisation," wrote Harvard Magazine in March 2014. No-one is immune - least of all researchers, who now work with large data sets that are increasing exponentially in size, complexity and velocity. While data-intensive research cuts across the disciplines - including engineering, computer science, archival science, humanities, economics and finance - the sciences frequently find themselves at the frontier of attempting to solve big data problems.

In 2001, Gartner Inc, a leading information and technology research and advisory company, defined big data as: "High-volume, -velocity and -variety information assets that require cost-effective, innovative forms of information processing for enhanced insight and decision-making."

The more data available, the bigger the potential for more accurate analyses. And with more accurate information available, decisions can be made with increasing confidence, while better decision-making leads to greater efficiencies, cost reductions and reduced risks across societies.

But this will only happen if the capacity exists to interpret that data correctly. Getting from data to knowledge is not a simple equation, and the need to support data-intensive research at universities is therefore acute. Increasingly, researchers are battling to move data sets between collaborator sites, test parameters to optimise analysis of large data sets, and faciliate access to these sets by international research communities. In the last 12 months, UCT researchers have more than doubled their use of central high-performance computer (HPC) facilities for data-intensive research. Demand for research storage is growing substantially each year, and specialist support and analysts are in constant demand.

"Without the right support, UCT researchers risk diverting time and resources into the development and maintenance of potentially sub-standard or inefficient solutions, or just generally taking much more time to do meaningful analysis," says Professor Danie Visser, deputy vice-chancellor with responsibility for research. "There is increased potential for valuable data or intellectual property to be lost, stolen or corrupted, and for significant interruptions to research activity. A centralised effort is needed to provide a mechanism for researchers to learn from one another and develop institutional best practice."

It is partly for this reason that UCT is establishing an eResearch Centre. eResearch is focused on how information and communications technologies help researchers to collect, process, store, manage, find, share and re-use information. In line with moves at other leading international research institutions, the centre will provide integrated support to help researchers do work that is high-impact, locally relevant and internationally competitive. One of its important roles will be in managing large data sets.

The university could go further, however, building on the eResearch Centre to take the intellectual lead in establishing expertise and driving innovation in data science. There are a number of global and national developments that the university can harness.

An accelerating challenge

One such opportunity lies with the European Organisation for Nuclear Research (CERN), home of the Large Hadron Collider (LHC), says Dr Andrew Hamilton, a lecturer in the Department of Physics and researcher in high-energy particle physics at CERN. The 27-kilometre LHC is the world's largest particle accelerator, and it is one of the most important tools available to high-energy physicists in their bid to investigate the fundamental particles that make up the universe. The UCT-CERN Research Centre is part of two experiments running at the LHC: the ATLAS experiment, which explores the fundamental particles of the standard model of particle physics, and ALICE, which is aimed at understanding the quark-gluon plasma.

The nature of its work means that the LHC has been grappling with the problem of big data sets for 20 years. Many of the particles it investigates (like the famous Higgs boson) need to be created by colliding protons at what is close to the speed of light. Because the particles are very rare, tens of millions of collision 'events' are produced per second, which need to be captured and read by a detector. If researchers were to read out every single event, they would need around 40 terabytes per second, which is way beyond the confines of current technology. Using high-speed filtering called 'triggers', researchers can get this down to hundreds of megabytes per second.

While this may already sound like a tall order, it is only part of the big data challenge faced by the LHC. In order to define expectations (so that researchers know what they expect to see), the entire detector is digitally simulated in excruciatingly fine detail, and a computer algorithm is written to simulate the billions of events that might produce a rare particle like the Higgs boson. All of this data then needs to be stored.

Factor into this the collaborations involved (just one of the LHC's seven experiments has 1 000 members belonging to 116 institutions, based in 33 countries), and the scale of the challenge is evident.

The LHC has created a solution to this very big data problem - the Worldwide LHC Computing Grid (WLCG), a form of cloud computing that stores and processes information. It currently has 150 centres in 40 countries and operates through a tiered structure that varies in the funding model, the storage space and the number of cores required.

Creating the infrastructure required to play in this big data league sounds expensive; however, money is not the greatest challenge for UCT, according to Dr Hamilton. "We need the people trained in high-performance computing to operate and administer these facilities," he says.

Which is where the WLCG, "one of the largest research computing networks in the world", comes in. "If UCT wants to be a leader in big data research, we need to demonstrate that we can operate a big data centre that participates on the global scale. The WLCG gives us the opportunity to both contribute to one of the largest research computing networks on the planet, and learn from their expertise at the same time."

Complex climate models

The challenge to provide trained people connects all big data projects, whatever their divergent hardware requirements. For instance, acquiring and retaining skilled researchers is the greatest strategic challenge in climate research, according to Professor Bruce Hewitson, director of the Climate System Analysis Group.

This is partly because of the complexity of the big data research challenge in the field, particularly around climate modelling. "There are about 40 mainstream global climate models," says Hewitson. "All produce comparable resolution results, but no model is perfect. Some uncertainty arises from model error - how it's coded up. Some is from the inherent variability in the climate system. So even in one category of modelling, we don't have a clear analytical approach."

Add in regional models and other impact models, and you have a major challenge in distilling information so that it can be analysed and produce defensible regional information. It is the single leading challenge in climate modelling, according to Hewitson, and there has been no substantive advance on it in the last decade.

Researchers acquire skills in this field only by working in an applied environment - learning on the job. However, the academic job market for graduates is minimal, and this is compounded by the likelihood of losing highly skilled people to the private sector or research institutions overseas, where there is more opportunity and higher pay.

"We also have too many single points of failure," says Hewitson. "We have world-class capacity, but if one individual leaves, a whole thread of activity collapses."

However, it is in this world-class capacity that Hewitson finds cause for optimism: "UCT climate research punches above its weight internationally. We have the largest single concentration of capacity on the continent, and are leading research from the perspective of a developing nation."

UCT's climate research also has unique cross-disciplinary skills, and this gives the university a competitive advantage globally. "We have a chance to take a research lead on these global research challenges," says Hewitson, "and to contribute to major policy development in Africa."

In order to take this lead, we will have to find ways to face up to the big data challenges. Part of the answer might lie in the stars.

Big data from the sky

The research field that is wrestling with the biggest data of all is astronomy, and Africa's biggest science project, the MeerKAT Large Surveys, has put UCT at the coalface of this challenge. The MeerKAT radio telescope is a precursor to the square kilometre array (SKA) telescope. Led by four UCT astronomers, MeerKAT is expected to gather up to one petabyte (1 000 terabytes) of data per year, so advanced tools will have to be developed to process, analyse and store the data.

There is nothing small about the SKA - it is the biggest science project ever carried out on African soil. When completed, it will be the world's largest radio telescope, located in Africa and Australia, but shared by astronomers around the globe. Each MeerKAT antenna, which will be incorporated into the mid-frequency component of SKA Phase 1 when that instrument is constructed, stands 19.5 metres tall and weighs 42 tons. The complete MeerKAT array will have 64 receptors - antennas with receivers, digitisers and other electronics. Connected by 170 kilometres of underground fibre optic cable, the 64 receptors will operate as a single, highly sensitive astronomical instrument, controlled and monitored remotely from the MeerKAT control room in Cape Town. When fully operational, the MeerKAT will generate enough data from the antennas to fill about four-and-a-half million standard 4.7-gigabyte DVDs in a day.

An eminent role player in this field has been attracted from Canada: Professor Russ Taylor, one of the founding international SKA project scientists, who joined the university early in 2014 as joint UCT/University of the Western Cape SKA Research Chair, will coordinate a big data vision for radio astronomy in South Africa: "The global SKA big science data world is coming to South Africa this decade," says Taylor, adding that it is probably one of the two projects in the world driving a big data revolution in astronomy.

A research revolution

In addition to his own research with SKA, Taylor will also be working to put South Africa - and Africa - on the map in terms of data capacity. "Global research leadership requires that we have the capacity to extract information from big data," he says, and there is likely to be a limited window of opportunity to establish national leadership in data-intensive research and a global presence in this emerging field.

While big data is by its very nature a massive challenge to the university, it is also a driver of the transformation of science, and by extension, a driver of global change; and UCT is already part of the revolution.

"We are not trying to break into a field where we are absent - we are already there," says Visser. "If we grasp the opportunity to take leadership in this area, we can really make a difference in the country and to science around the world: solving Africa's issues, but also making Africa part of the global solutions."

1 bit = a binary digit (0 or 1)
8 bits = 1 byte (the average sentence equals 100 bytes)
1 kilobyte = 1 000* bytes (1 kilobyte would be equal to a short paragraph)
1 megabyte = 1 000 kilobytes (1 megabyte could hold a 200-page book)
1 gigabyte = 1 000 megabytes (a standard DVD holds 4.7 gigabytes)
1 terabyte = 1 000 gigabytes (1 terabyte could hold 1 000 copies of the Encyclopaedia Brittanica)
1 petabyte = 1 000 terabytes (close to 250 000 standard DVDs)

*This is the decimal interpretation most commonly used for disk storage; in a binary interpretation, used for processor or virtual storage, the value is 1 024 for each.

Story by Tracy Gilpin, Carolyn Newton and Jane Notten.

back to top

Creative Commons License This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

Please view the republishing articles page for more information.