Sverre Jarp, chief technology officer at Cern, has the task of ensuring that the particle physics laboratory can access and analyse the 30 petabytes (that is 31,457,280GB) of data from the Large Hadron Collider data annually.
To put that into context, one year’s worth of Large Hadron Collider (LHC) data is stored on 83,000 physical disks. And it has 150 million sensors delivering data 40 million times per second. One gigabyte of data is sent per second.
Cern’s task, says Jarp, is the equivalent of searching for one person across a thousand planets.
Speaking at the Big Data Innovation Summit in
He said: "Big data is very much à la mode at the moment. We are bridging a lot of what we have been doing in science with what’s going on in enterprise and in the commercial world."
While brands might be asking what people will be doing with their Saturday evenings or how consumers are interacting with screens, Cern’s questions are a little different, said Jarp.
"Our questions are about who are we, and where we live. Even today, 95% of stuff in the universe is still unknown. We ask: what is in that 95%?
"The Large Hadron Collider allows us to probe into these questions, discovering new knowledge that will help move society forward. We’re not proving time travel, but, hey, why not?"
While marketers and agencies are quick to talk about the data deluge they face, Cern’s challenge leaves these seemingly unassailable problems in the shadows.
The 27km tunnel which houses the LHC is the world’s largest refrigerator, as well as the hottest spot in the galaxy (100,000 times hotter than the heat of the sun). And it is all happening 100 metres underground.
But big data is relative. When Jarp joined Cern in the 1970s, big data was about megabytes. "Today, your kids wouldn’t even accept that size on a [memory] stick. But at the time 50-100MB was a lot of data and we felt we were in "big data" as it was complex data handling, " he said..
Since then it has gone from gigabytes to terabytes – and now it is in the transition from petabytes to the exabyte world.
Jarp said: "This is a huge organisational [shift in] approach. You can’t just go and buy 1,000 times more disk drives and storage disks. It takes a lot of careful planning over decades."
With so many companies swamped in often meaningless data, is Cern any different? Simply put, Jarp explaied that Cern needs the huge volumes of data ready to analyse as it is searching for the rarest of events.
It’s like looking for gold; most of what you dig up is just dirt, but you have to filter through it quickly.
"It’s like looking for gold; most of what you dig up is just dirt, but you have to filter through it quickly. Maybe it’s the same in some of your businesses. You have a lot of data but it is only the glimmer that you are after.
"A lot of our unstructured data is only saying that something has happened. We have to put it in a structured format so that we can then take it through into analysis. The better it is classified, the easier it is to go after it.
"One problem we have is that if there was some new physics, and we didn’t even have the theory behind it, we couldn’t write the algorithm for filtering it. This would mean we could end up dropping valuable data – and it would be dropped forever."
Cern has thousands of people across the world looking at its data. The situation is a form of chaos, admits Jarp.
"You don’t know who is going to do what or when they are going to start. It could be an American professor with his PhD student up in the middle of the night. This is very much big data; you have to be prepared for the worst. It’s a bit like a bus service in
"We are in the position where we never have enough storage or computing capacity. But we try to have an orderly approach to it so that people don’t just find chaos."
Keep, archive or delete?
No one will ever say "get rid of data", or that "data is obsolete", but if it is rarely being used then, said Jarp. you should "archive it further and further away from capacity infrastructure",
He said: "Sometimes the best thing is to forget about the data and not to go back. We keep data alive for one or two decades. But with the data from our Large Electron Collider (which pre-dates the LHC), few people ever go back to that data today."
With several decades of a career at Cern under his belt, Jarp is quite clearly passionate about the lab’s work. But "being in big data" is an extremely serious business, he warns.
We are able to get results out as soon as we have the data. This is what big data analysis is all about, extracting the value from data immediately.
"We are able to get results out as soon as we have the data. This is what big data analysis is all about, extracting the value from data immediately. We have open source tools; anyone can use them, which help us find the nuggets in the tsunami of data we face.
"If you are into big data, then it must be taken very seriously. Our people at Cern come in very early and leave late at night. Even with petabytes of data, missing bits matter. You have to be in control; you can’t let Mother Nature trample over your data in a random fashion."
Sverre Jarp's top tips for big data handling
The organisational structure for handling big data is vital. You can’t just say big data is à la mode, I’ll buy a couple of cabinets of disks and get going. This would very easily lead to a world they call big headaches.
You often see a selfish behaviour in users of the data. That is what we have seen in our community. You need agreements in place, so that – like society – everyone has to obey certain rules. If they don’t, then your big data problem can get even bigger.
You need the right corporate culture. We had a decade of preparing through simulated data – not just inside the computer centre, but across the worldwide grid we have built.
Always ask yourself, are you extracting the maximum value out of your data? For us, we’re heading for exabytes – there is no time to rest.