Matt Strassler [April 2, 2012]
We now have many details of what went wrong at OPERA, the experiment which produced an anomalous result showing neutrinos arriving earlier than expected, widely interpreted in the press as a violation of Einstein’s dictum that nothing can go faster than the universal speed limit, the speed at which light travels. No one in the scientific community is surprised that the result was due to an error of some type. Most scientists, including myself, weighed the chances of the experiment being correct as extremely small; though most people kept an open mind, I don’t recall ever having a conversation with a serious scientist who thought it was likely to be correct. The main reason was that it was very difficult to imagine any way that it could be correct and yet be consistent with many other classes of experiments that confirmed Einstein’s equations for relativity, or even be self-consistent; see for example this article, and this one.
But the process of finding a mistake in an experiment is itself often an interesting and instructive story. And this one is no exception. In this article, I’ll walk you through the process by which OPERA found, diagnosed, and removed their mistakes.
[I would like to thank commenters Eric Shumard, Titus and A.K. for their help in making this article much better than it otherwise would have been.]
First, let me put the obvious disclaimer: at the moment of writing, my information is partial. I’ll try to indicate where I don’t know things.
Second, let me say that I will withhold judgment (until the end) as to the human elements of OPERA. Let’s start by just looking dispassionately at the information.
Discovery and Diagnosis of Potential Problems
The best place to start is with a slide from a talk by Maximiliano Sioli, from Bologna, given on behalf of OPERA, at the March 28th mini-workshop on neutrino timing held at the Gran Sasso laboratory, where OPERA is based. [Note that OPERA is not, despite reports in the press, a CERN experiment; CERN only provides the neutrinos. All the known mistakes made at OPERA were in equipment installed inside OPERA's location at Gran Sasso, well outside CERN's purview.] This slide, shown in Figure 1, gives the full timeline.
What you learn from this is that between September and November, after the original OPERA result (what I call OPERA-1) was released publicly and the media frenzy began, OPERA personnel did what they should do: they made many studies of their work to see if they could find mistakes. But their focus was on higher-level questions: had they done the data analysis correctly? Had they missed anything in their calculations involving special and general relativity such that their expectation for the neutrino arrival time was off? [There is also a statement about "Delayed double cosmic muon events"; this is an attempt to cross-check the result, described in the first comment below, but I'll skip this because it is in the end irrelevant.] Apparently the lower-level questions — is every wire functioning? have we double-checked every single part of the experiment? — were not at the top of the agenda. That may be because they got many questions from the outside world after their initial presentation, and outsiders would only be able to suggest higher-level problems to check; no one from outside would have ever been able to suggest that perhaps they hadn’t screwed an optical fiber in correctly!
Then came what I call OPERA-2 (which OPERA calls “bunched beam”, or “BB”) where they measured the neutrino speeds in a different way, using very short neutrino pulses. As I explained here, OPERA-2 is a much better way to do the measurement; I would largely disregard the original OPERA-1 and focus on OPERA-2, in my own thinking, because it is so much clearer how to think about the result. The fact that the results from OPERA-1 and OPERA-2 together (the two red numbers marked δt in Figure 1) agreed, to within the measured uncertainties, indicated that the data analysis techniques were not at fault, and also made it unlikely that any intermittent equipment problem was responsible.
Only in December did they start to look at the equipment that actually allowed them to carry out the timing measurement. I don’t know why. One might have thought this was the first priority, but perhaps some people within OPERA must have felt that any problem in the timing would have shown up through some other cross-check that they’d already made? Or perhaps this was one part of the experiment in which they were particularly confident (apparently overconfident)? This is one of the main puzzles at the time of writing.
Right away, a major problem showed up. On December 6 – 8, measurements took place of the time interval between
- the moment when a signal (a laser pulse) is sent from the lab’s GPS timing equipment somewhere on the earth’s surface, across 8.3 kilometers down into the underground lab and to OPERA itself, where the laser pulse is converted (in a special device, which I’ll just call the box) to an electronic signal for use by the OPERA Master Clock, and
- the moment when the Master Clock sends a timing pulse to synchronize all of OPERA’s many computers and other devices.
The problem is shown in Figure 2, taken from a slide by G. Sirri, also of OPERA and from Bologna. Most disturbing is that this way of measuring the timing had not been repeated since 2007! (Why not?! This is another major puzzle at the time of writing. Presumably they checked the timing using some other methods which for some reason didn’t reveal the problem…?)
- The measurements in 2006 and 2007 showed a time interval of about 41000 nanoseconds [billionths of a second], but
- The measurement on December 6-8, 2011 showed a time interval of about 41075 nanoseconds.
This change was of the right type to potentially cause an apparent early arrival time for the neutrinos, and of roughly the right size to explain OPERA’s measurement. This must have generated immediate and considerable alarm among those within OPERA who made the neutrino speed measurement. But the discovery of a potential problem is not the same as the unambiguous determination of a problem, so investigation continued.
Sometime over the next few days, efforts to track down the problem led people to discover that the fiber carrying the laser pulse to OPERA’s converter had not been screwed correctly into the box. This is shown in Figure 3. Despite reports in the press, this is not a “loose wire”. A copper wire that isn’t tightly connected to an electrical lead can cause an electrical device to behave erratically, because electrical current will sometimes flow and sometimes not. But the optical fiber isn’t what most people think of as a wire — it carries light, not electrical current; and it wasn’t loose, it just wasn’t screwed in all the way. That’s relevant, as we’ll see.
Those within OPERA who were studying this problem found was that when they screwed the optical fiber in tightly, the time interval went right back to 41000 seconds (Figure 2 again). So they knew then, on December 13th — ironically, on the same day that there was all the excitement about the search for the Higgs particle at the Large Hadron Collider over at CERN 730 kilometers away — that the fiber not being screwed in right had the potential to shift their results towards an early-arrival time for neutrinos by many tens of nanoseconds.
But at around the same time (when? I’m not sure…) another problem appeared. They detected some kind of timing drift. For technical reasons, OPERA took data in 0.6 second chunks, and cross-checks of measurements suggested that the timing at the end of a chunk was not calibrated properly relative to the beginning of a chunk. So this added confusion to the situation. The drift would also have affected their measurements, though possibly in the other direction, causing neutrinos to apparently arrive later.
So they had two problems, a shift from a fiber connector, and a drift from somewhere not yet identified, and the questions they had to answer in mid-December, closely paraphrasing Sioli’s talk at the mini-workshop, were
- What were the sources of the fiber time delay and of the drift effect?
- How long had these two problems been present? Long enough to affect the OPERA measurements?
Somewhere over the ensuing two months (including what must have been quite an unpleasant holiday) they managed to untangle the puzzle.
First, what was causing the two effects?
The reason an unscrewed fiber can cause a time delay is the following (and this was guessed almost precisely by Eric Shumard, one of the commenters on this site, shortly after OPERA’s problems became public.) It is shown in Figure 4. The timing system works by sending a laser pulse, of considerable length and high intensity, at prescribed intervals (once every thousandth of a second), down the fiber. The start of one of those pulses is shown in yellow at the top of Figure 4. That pulse then enters the box. Rapidly (but note Eric Shumard’s comment on this post), over about 100 nanoseconds, that pulse generates an electrical voltage (shown in blue) inside the box (I’m skipping some electronics details) and when the voltage reaches 5 Volts, it causes the Master Clock to register the timing pulse and fire off a signal (shown in pink-purple) to the rest of the OPERA experiment. But when the fiber isn’t screwed in right, an effect such as that shown at the bottom of Figure 4 results; not as much light as expected enters the box, and this slows the rate at which the electrical voltage builds up, delaying the point at which 5 Volts is reached, and therefore delaying the Master Clock timing pulse. The whole effect is several tens of nanoseconds, depending on how improperly the fiber is screwed in. This is the effect that generated OPERA’s apparently early neutrino arrivals.
What about the drift? It turns out the Master Clock itself was not properly calibrated. After it fired with the laser pulse, at the start of each 0.6 second data chunk, it then drifted slightly during the next 0.6 seconds, by a total of 74 nanoseconds. Then it would be re-synchronized 0.6 seconds later (albeit incorrectly, due to the improper fiber connection) by one of the laser pulses coming down the fiber. On average, its drift would have an effect of 37 nanoseconds, but it would be worse at some times and better at others during the 0.6 second chunk of data. This effect would make the neutrinos appear to arrive late, but turns out to be insufficient to cancel the effect of the fiber.
OPERA’s Scientifically Sound, Falsifiable Model for What Went Wrong
Now — and this is where it becomes a scientific story of its own — the OPERA folks, thinking like good scientists, developed a clear and precise hypothesis for what was going on. If these two problems were the full explanation for what was going wrong at OPERA, then they should have had the effect shown in Figure 6 on all of OPERA’s timing measurements — for neutrinos and for anything else that happened to pass through their experiment.
When the fiber was screwed in correctly, but the Master Clock was drifting, neutrinos (or anything else) that arrived at the start of the 0.6-second chunk of data should have been properly timed, but those that arrived later should have had timing off by an amount that grows linearly in time across the data chunk, reaching 74 nanoseconds apparent-late-arrival for those that arrive at the end of the 0.6-second data chunk, or an average of 37 nanoseconds late-arrival across the data chunk. This is shown as the green line.
When the fiber was not screwed in correctly, then the Master Clock drift would have had the same effect as before, but all of the timing should have been shifted earlier by 73 nanoseconds. That is shown as the red line.
Now they had a clear, falsifiable hypothesis to check. But how to check it?
Clearly it is essential to know how long the fiber had been improperly screwed in. They were apparently able to find a photograph of the apparatus from October 13th — I don’t know when they found that photo — which by chance shows (Figure 7) that indeed the fiber was screwed in wrong and apparently by the same amount as in the photo from December 8th (Figure 3). This fact alone invalidates the initial result from OPERA-2; the result cannot be trusted once this is known.
Unfortunately they could not find earlier photos to clarify if and when the fiber’s connection had been unscrewed during the previous years, when OPERA-1 was running. And to test their hypothesis, they really needed to know when this problem started.
How the Model was Tested
Lacking that, they managed to get the same information by performing a crucial check comparing timing at OPERA and at the nearby LVD experiment using cosmic-ray muons that pass through both detectors. [You should read that article now before continuing, if you haven't already.] In Figure 2 of that article you will see that this cosmic-ray muon measurement revealed a 73ish nanosecond shift in the timing between OPERA and LVD that started in mid-2008 and ended when they screwed the fiber back in at the end of 2011. (Another puzzle: it is still not known why they didn’t consider and/or carry out this check before going public with their result last September.) Obviously that implies that the fiber was unscrewed and then not screwed back properly sometime in the middle of 2008, before OPERA-1 began. I will now show you that armed with that information, the OPERA folks could test the hypothesis shown in Figure 6 in detail.
First, using the method I described in the above-mentioned article, they compared the timing of the cosmic-ray muons that passed through both OPERA and LVD during the periods (before mid-2008 and after Dec 13th, 2011) when the fiber was screwed in properly (normal condition [NC]) and found the data (green dots in Figure 8) agreed with the prediction of their model, the green line in Figures 6 and 8.
Then, they studied the muons that arrived when the fiber was screwed in badly (anomalous condition [AC], from mid-2008 through December 13th, 2011) and they saw that those muons (red dots in Figure 8) are indeed shifted down to the red line predicted by their model.
Having confirmed the model using cosmic-ray muons, they had now shown somewhat convincingly that they understood what was wrong with their timing and when it was wrong. But as good scientists, they weren’t going to stop there.
What they did next was to check whether their neutrino measurements were consistent with what they’d learned, independently, from the cosmic-ray muons. They had to account for the fact that neutrinos in OPERA-1 and OPERA-2 were not evenly distributed across their 0.6 second chunks of data. For technical reasons having to do with how CERN delivers neutrinos in the direction of OPERA, these neutrinos tended to arrive on the early side of the chunk for OPERA-1, and on the very early side of the chunk for OPERA-2 . Based on this, they could now predict, given their model for the timing problems, what they should have measured for the arrival times of the neutrinos in OPERA-1 and OPERA-2; these predictions are shown as two red dots in Figure 6. The actual data giving 60 nanosecond early apparent arrival is shown in Figure 8: the black circle marked “Standard ν runs” for OPERA-1, and the open circle marked “Bunched ν runs” for OPERA-2, and their locations agree with the red dots in Figure 6.
You see that it is a bit of an unpleasant and unfortunate accident that both versions of OPERA came out with the same result. Had CERN’s neutrino beam spread OPERA-1′s neutrinos out more evenly across each data chunk, the time delay measured at OPERA-1 would have been perhaps half as big as that of OPERA-2. They, and we, would have known already in November that something was wrong. And moreover, if the problem with the fiber’s connection had occurred, say, between 2009 and 2010 instead of in 2008, in the middle of OPERA-1, that too would have made OPERA-1 different from OPERA-2; in fact the change would already have been noticed in OPERA-1′s data. And finally, it would appear that a statistical fluke made the two results a bit closer than they would have been expected to be, given this model for the problems; OPERA-2 lies just a little bit above the prediction, and OPERA-1 a little below it. So OPERA had two and a half strokes of very bad luck… not unusual when an experiment goes wrong.
But now, finally, their solid scientific detective work pays off, and they can walk off the stage with their heads as high as any experiment that’s been through a catastrophe possibly could. Because with their model working well (and they added another small check of the model which I’ve omitted for brevity), they can now work backwards and remove the effect of the improperly connected fiber, and of the clock drift, from the OPERA-2 measurement. They aren’t quite done with this yet, but they have a preliminary result, and it says that the difference between the corrected arrival times and the predicted arrival times is -1.7 nanoseconds with a statistical uncertainty of at least 3.7 nanoseconds, experimentally consistent with zero — though note this preliminary result contains an incomplete estimate of the uncertainties and so the overall uncertainty will go up in the final result. In short, OPERA-2′s neutrino arrival time is now apparently consistent with Einstein’s prediction, though they still have more checks to do before they can say this with the confidence they want. So when this revised result is complete and appears in a preprint document, it is very likely OPERA will agree with the ICARUS experiment’s result from a couple of weeks ago, though within OPERA they must be very frustrated indeed that ICARUS used OPERA’s hard work (in setting up the unprecendentedly accurate timing and distance measurements from CERN to the Gran Sasso lab) to beat them to the finish line.
A Few Personal Comments
Note added: I recommend reading Antonio Ereditato’s letter about his resignation as leader of OPERA; it gives his point of view in some detail, and puts some perspective on these comments.
The OPERA team, as a whole, has done what any good scientific team would do: they’ve worked hard to make the best possible analysis of their experiment, and fix the problems they’ve found using a careful, clever, creative, and scientifically convincing technique. For this they deserve kudos and respect.
But that said, their speed measurement crashed and burned in a very distressing fashion. Any good experimentalist will tell you that when you think you might have made a radical discovery, one thing you need to do is make a list of all the elements of your experimental apparatus that could affect your measurement and check them all, every single one, in multiple ways. Clearly, you check all the wires and connectors in more than one way, one by one, especially those on which the whole experiment depends, such as those going into and out of the experiment’s master clock. And if your measurement is based partly on a precision timing measurement, you want to have made abundant checks that you don’t have a problem in your timing equipment. It is more than a little mystifying that OPERA’s neutrino-speed experts did not have a complete yearly check of their timing calibration, or at least a final thorough check after OPERA-1 was over before announcing a result. How could a big problem that arose in 2008 go undetected until the end of 2011, after the experimental result was announced? Maybe there was a good reason that isn’t obvious, but right now the OPERA folks who were in charge of this part of the experiment (and I remind you that the neutrino speed measurement was a small part of the OPERA program, so most of the OPERA people were not directly involved) have some serious explaining to do.
There is also the question of why the discovery of a problem in mid-December was not revealed (possibly under pressure from some direction, though the details remain murky) until mid-February. I do see one reason why there may have been such a delay. [However, the statement from Eriditato suggests that the OPERA/LVD collaboration only occurred quite recently, and casts doubt on the rest of this comment; instead it suggests that there was some other internal concern until February, but it isn't clear what it was.] Perhaps not everyone in OPERA was convinced that the fiber and the clock were the source of the main mistakes — after all, there was a complicated combination of a shift and a drift in opposite directions — and as we’ve seen, a crucial way to check the story’s consistency was to carry out the comparison of the timing at OPERA and at LVD using the cosmic ray muons that enter both experiments. There are only about a dozen of these muons per month, so in order to gather enough of a statistical sample after the fiber was properly screwed in so that they could be sure the problems had been fixed, they needed to take data for much more than just a couple of weeks.
Nevertheless, while it is not unethical to want to study a problem in detail before talking about it — it reflects wanting not to create additional confusion by saying something that itself turns out to be wrong, and also reflects wanting to preserve one’s reputation by briefly (!) not revealing a mistake until it can be fully diagnosed and corrected for — in this particular case questions have to be raised, because quite a few other scientists were working hard on future efforts to confirm or refute OPERA’s measurement. The fact that internal questions about OPERA’s results were not revealed for two months may have wasted the time and money of a number of scientists who would have chosen to do something else had they known about this. In my mind this may have crossed a line — not a line of professional ethics, perhaps, but certainly a line of professional judgment. Of course, there may be issues that the OPERA people were dealing with of which I am not aware. But I think we are owed an explanation, even if it is an embarrassing one for some individuals.
Overall, this story currently appears to be a classic case of how not to handle a possibly sensational result. What are the lessons?
- You do not go full-speed ahead with a big public presentation and a press conference if you haven’t, in fact, done all the basic internal checks of your equipment.
- You don’t say “we have a timing anomaly (oh, and by the way, if it is not a mistake you can interpret it as a violation of Einstein’s relativity)” because the only thing people outside physics will hear is “violation of Einstein’s relativity”; you have to bend way over backwards to call it a “timing anomaly” again and again and again, much louder and more assertively than the OPERA people managed to do it.
- Instead of having a big public talk at CERN (thereby dragging other people into your mess) when you announce your result, you should announce the anomaly quietly, at a little mini-workshop, and save your big public talk (and any damage to your reputation, and everyone else’s, the guilty and the innocent) until you confirm it with a second measurement. [Granted, at OPERA that second measurement probably would have been OPERA-2, which did in fact confirm OPERA-1, so in this case we would still have had some of the same hullabaloo; but it would have been much easier to forgive the OPERA leadership in that case.]
- Or if you do decide to have a big public talk when you announce your initial result, you don’t then reveal that you’ve found and fixed your errors at a quiet little mini-workshop at Gran Sasso. That’s like newspapers printing the corrections to their erroneous front-page stories at the bottom of page seven (which happened on a regular basis amid the often dreadful mainstream-media reporting on this story.) I hope there will at least be a big public talk when the updated OPERA-2 result becomes official.
Scientific mistakes are forgivable; at the forefront of knowledge, where new techniques are being tried out for the first time, mistakes are going to happen. Some mistakes are worse than others, but even bad ones are going to happen to good people sometimes. The issue here is not the scientific errors themselves, but the bad judgment about how to handle a potentially sensational but probably wrong scientific result. It is impossible to say whether or not the leaders of OPERA, who damaged their own reputations — along with those of their experiment and their collaborators, of a couple of supporting laboratories, of the research field of particle physics and perhaps of science as a whole — will pay a long-term price. But for now, their resignations from their leadership positions in OPERA (though not from the OPERA experiment) do not seem inappropriate.