In 1972, I watched Gene Cernan and Harrison Schmitt walk on the moon (the third astronaut of the Apollo 17 crew, Ron Evans, remained in orbit around the moon in the command module). It’s a lasting memory, as you’d expect for a boy with nerdish tendencies.
There were incredible advances and achievements on all fronts of technology that made that mission possible. However, it was a comment from a former colleague that prompted my interest in one particular piece of Apollo equipment. I had a new role that involved a compute farm with thousands of CPUs, prompting him to observe that ‘man got to the moon with little more than a pocket calculator’. It’s an exaggeration, but not a large one.
Designed at MIT, the most powerful version of the Apollo Guidance Computer (AGC) had a 1MHz 16bit processor, 4KB of RAM (yes, 4096 bytes) and 72KB of ROM. The memory was actual core – magnetic metal rings threaded with wires. A comparison with a modern computer is hard to do. The AGC could execute roughly 85,000 instructions per second, with support for integers but not floating point. By comparison, the hardware I bought from eBay to run my Coursera homework cost £65, has 2GB of RAM and can achieve 110 gigaflops at double precision. That’s half a million times the amount of RAM and roughly 1.3 million times the processing power.
Given that you can rent a server from Amazon that is capable of 4.5 teraflops for $0.65 per hour (or a thousand of them in a cluster if you need to), what are we doing with the incredible power of modern hardware?
‘Big Data’ refers to the capture, storage and analysis of datasets that are so large and complex that they cannot be analysed by traditional means. The problem is defined along three axes known as the three Vs, namely Volume, Variety and Velocity. In other words, processing a vast amount of disparate, possibly unstructured data, arriving at varying rates.
As with many new or emerging technologies, there is a lot of hype surrounding Big Data. Forbes ran a story on the ‘eerie accuracy’ of Target’s marketing, allegedly identifying a pregnant teen before her own father knew she was pregnant and sending her coupons for baby items. They even report that Target include a random selection of additional coupons with their targeted set, so as to hide the accuracy of their forensic marketing.
Another example is that of Google Flu Trends . A paper in Nature described how Google used search data to predict the spread of influenza in North America in near real-time, with a 97% accuracy compared to CDC data, but without the normal 1-2 week lag associated with data collection and analysis, using the ‘found data’ in search terms.
No wonder Harvard Business Review declared that Data Scientist would be the sexiest job of the 21st century
Lets not get carried away.
What is not reported by Forbes is how many people are identified incorrectly by Target.
People who believe the hype feel let down when it turns out not to be the whole truth. When it was discovered that Google Flu Trends overestimated the spread of Flu by 50% in 2013, the failure was reported and commented on widely:
- “Don’t be blinded by Big Data” Michael Healey, Information Week
- “Google Flu Trends’ Failure Shows Good Data > Big Data “Kaiser Fung, Harvard Business Review
- “Big data: are we making a big mistake?” Tim Harford, FT Magazine
- “Google Flu Trends is no longer good at predicting flu, scientists find” Charles Arthur, The Guardian
One story went as far as saying that Google had been wrong since 2011, although the data made public by Google seems to contradict this.
What does the data show?
99% of pilots that crash their planes have brushed their teeth that morning. However, no one would believe that a pilot with dirty teeth is safer -correlation does not imply causation. Whilst the CDC is tracking actual flu data, Google is tracking correlations between search terms and the spread of the disease. It might be that Google itself has changed the data, as the search page suggests common search terms and in the process makes them more common. When I type ‘Have I’ into the search box, ‘Have I got Flu’ is the second highest suggested search.
Not all of the recent stories on Google Flu were negative, however. The conclusion seems to be that we should understand how and when to use these techniques, and how to use their results together with the other tools in our arsenal.
Will our technology resources continue to increase?
Technology has moved on, but we continue to create data at an accelerating rate. Estimates indicate that the rate of data creation is increasing annually by 40%. To put this into context, 92% of the world’s data was created in the last two years.
Steven Wright once said “You can’t have everything. Where would you put it?” The demand for disk storage will soon outstrip supply, as the world does not have enough fabrication capability to meet the future demand for disk storage. By 2020 we will be creating 10 zettabytes of data annually, but creating only 4 zettabytes of new hard disk storage.
Admittedly, much of this data is cat GIFs and Facebook statuses, but that doesn’t make it less of a problem. The demand is there; you may think your data is more important, but that won’t get you the storage.
What problems do banks face in the big data space?
Although technology plays a part, big data is not solely an IT problem.
- Data silos
It’s often said that we need to break down silos in order to leverage technology. The same is true of data. Big data projects rely on a variety of data in many different forms. That data must be easily accessible to all, which is not a common situation in banking.
The tendency of banks to ‘build it yourself’ isn’t restricted to software, and banks have constraints that other businesses don’t. Consequently, they have been slow to adopt cloud approaches to infrastructure and are behind the curve in flexibility, speed of deployment and cost.
As acknowledged by Jamie Dimon, firms like Google and Facebook may provide competition to the banks. Even anonymised, the richness of data provided by all types of banking transactions (but particularly credit and debit cards) would greatly enrich models based on found data.
- Lack of expertise
Data scientists are in high demand. Data science might be a sexy job, but bank data scientist is less so. The cool kids want to work at Google.
I have no doubt that banks will solve difficult problems using big data, from the detection of fraud to modelling sentiment, but a pocket calculator didn’t put a man on the moon – other men did. They may have needed a calculator on the way, but they needed the vision to begin with, and the commitment to make it happen.
In some ways, given the explosion in computing power we’ve already experienced, imagining what we might achieve next is a challenge in itself.