A while back, Mark talked about some of the professions in the computer industry. Unfortunately I can’t seem trackback those articles. Instead of waiting for him to talk about the on I am part of, I thought of doing it myself.
In a train back from Paris, where I have spent the day doing an intensive session of troubleshooting for one of our customers, I am reminiscing about the work I have achieved today, the good decision I took which helped us identify a potential cause for the suspend/resume issue that we were investigating.
First of all, this time I was not the one with the knowledge : Colin King was. I was there to assist and help in identifying why the laptop was not resuming from suspend. Since the customer plans to deploy 2000 of this specific model, it is better to identify and fix those kind of issues beforehand.
So we went in, him in somewhere in England, me in Paris, both in a constant IRC chat, shooting ideas back and forth. Well mostly him sending ideas and me shooting back the results and observations, things that I was seeing, important or not, trivial or what I thought was important. And in those situations where you are part of a chain and not necessarily the one with the most knowledge in the issue being investigated, it is very important to avoid judging on the importance of the information that you provide.
Too often, the dreaded “oh, I thought that this wasn’t important so I didn’t bother to mention” spell just brings tons of bad magic on the investigation efforts. I learned about it the hard way, being too often at the other end of the chain, trying to make sense of an investigation with the help of a more junior support colleague or a customer who, too many times, has more important things to do.
This time, one of the most important piece of information came about this way :
(12:08:29) caribou: it’s about lunch time, I may step out to grab a sandwich in the meantime
(12:12:08) cking: sure
(12:12:16) caribou: fyi, it doesn’t completely powers off
(12:12:30) cking: AH
(12:12:45) cking: this means it’s definitely a problem with the power management on the southbridge
(12:12:51) caribou: after the “shutdown”, the screen stays on with [28. ….] Power down.
(12:13:19) cking: if it can’t S5 then we can do a S3 either, that’s the same power management register on the southbridge
By simply indicating that, after invoking the ‘shutdown’ command, the laptop did not power off but stayed powered on in some limbo, I brought to Colin’s attention, a situation that for me was trivial, but for him was very important. I did not know that the suspend/resume functionality and the power off functionality use the same mechanism, the same power management register and that if one did not work, the other couldn’t work either !
From that point on, we were able to target the whole power management subsystem. After a few more tests on the operating system side, cking became convinced that it was a hardware issue. One easy way to find out is to swap the operating system and see if the problem persists. So here I go, installing Windows 7 on this laptop, installing the graphic driver to get the suspend functionality working to finally be able to test the same functionality on Windows 7.
One other important thing highlighted by cking was to make sure that Windows 7 was also using the ACPI functionalities and not the old APM, which would have rendered our test useless. A few minutes of searching the web with the support engineer’s best friend : the web search engine pointed to many articles on Microsoft’s website talking about ACPI & Win7. That was sufficient to confirm that it was also using ACPI.
So I went ahead and actioned the “Suspend” functionality on Win7. The laptop’s screen went blank, but the power button kept blinking with a few more LEDs turned on, which was the main symptom on Ubuntu as well. This and the fact that shutting down Windows 7 left the laptop in the same kind of limbo’s half powered down state, achieved to convince us that we were facing a hardware problem and not some bug of our operating system. The customer still have to find out if this is systematic on that model of Sandybridge laptop or this is only happening on this single unit. But now, they know were to look at, especcially since their back up plan was to deploy those laptop on Windows 7 instead of Ubuntu.
I wanted to write about today’s session because I thing that it shows the kind of investigating work that is expected from a support engineer. Most of the time, we are not there to fix bugs, but to clearly identify them, to find ways to reproduce them as much as possible. Our job is to provide the data that will be necessary to identify the flaw and fix it. Sometimes, the first action is to find workarounds so the user can continue his work while we look for a more definitive solution. Some of us will even go further and suggest fixes to the developpers or at least help them in fixing those bugs.
A support engineer needs to be investigative, to keep an open mind, to avoid the mistake of being convinced that he already knows the answer before starting to look at the data. He must be able to ask for help, to be humble about his knowledge and respect the knowledge of others. And often, it is best to take a step back and look at a problem from different angles. Sometimes the solution is easier to find from a few steps back.
I have been doing this job for almost fifteen years and I still get a kick at identifying bugs, in helping users with their issues, in trying to make their computing experience as easy as it can be. I know that many of my colleagues feel the same about it. I also think that a support engineer is not just some sysadmin who doesn’t have the skills required to become a developer. I’ve seen a few too many devs digging in the source code, looking for answers to a problem, when simply opening the log files and searching for known issues was enough to identify and fix the problem.
And don’t get me wrong, I admire the skills of the developers and I’m still hoping to get more and more involved in works similar to theirs, but I also have a great respect all those support engineers that go after the next problem and find a fix for it.