Pride before the stall

A couple of readers sent me the link to this story:

It remains the mystery at the heart of Boeing Co.’s 737 Max crisis: how a company renowned for meticulous design made seemingly basic software mistakes leading to a pair of deadly crashes. Longtime Boeing engineers say the effort was complicated by a push to outsource work to lower-paid contractors.

The Max software — plagued by issues that could keep the planes grounded months longer after U.S. regulators this week revealed a new flaw — was developed at a time Boeing was laying off experienced engineers and pressing suppliers to cut costs.

Increasingly, the iconic American planemaker and its subcontractors have relied on temporary workers making as little as $9 an hour to develop and test software, often from countries lacking a deep background in aerospace — notably India.

In May I wrote a post about how I thought the cause of the Boeing 737 Max crashes was as much organisational failure as technical fault. The Devil’s Kitchen reposted it on Facebook and someone appeared in his comments:

His response is pretty much what you’d expect from an engineer: appeal to his own authority and then dive head first into the details of his own area of expertise thus missing the broader point. Nobody is saying that diversity policies at Boeing caused the technical failure. Instead, I am talking about organisational failure whereby a company which is prioritising diversity to the extent they have 42 councils devoted to the issue is likely to lose focus on other areas. This in turn leads to poor practices being adopted and standards dropping in certain places, which combine to deliver less than optimal outcomes. Sometimes this manifests itself in higher staff turnover, other times a less efficient production process, others lower revenues, and so on. In such an environment, the risk of a problem going undiscovered or kludged increases, and the more complex the organisation the greater that risk. This is organisational failure leading to technical failure.

One of the biggest hazards a complex organisation faces is losing control of its supply chain. This is what I think happened in the Miami bridge collapse: there were a lot of subcontractors and nobody seemed to be in overall charge. It’s not that the engineers were necessarily bad, it’s just the management didn’t seem to have an idea what was going on, let alone were in control. So getting back to the story of the Indian software engineers on $9 per hour, I doubt the root cause of the 737 Max malfunction was some ill-trained programmer tapping in the wrong code because he was distracted by the cricket. But what it does tell you is that Boeing appears to have lost control over its supply chain to the degree that practices are popping up in it which probably shouldn’t be there. This is organisational failure, and although it is not caused by having the top management so focused on diversity, the fact that the top management is focused on diversity is a reasonable indication of organisational failure. Which can be denied – right up until planes start dropping out of the sky.

Liked it? Take a second to support Tim Newman on Patreon!
Share

37 thoughts on “Pride before the stall

  1. Pretty much all components are supplied by the lowest bidder, yet still most things, apart from Windows, don’t crash. The hourly rate of the software bods is irrelevant.
    Disasters almost never happen from a single cause. It takes multiple cock-uos at the thingummy / human interface to really screw things up.
    It is the boss job to make sure things are done right. Boeing, like BP with Macondo, may have been let down by its subcontractors. Tough. It was their job to integrate the components, systems and humans so that any cock up could be contained.

  2. There was an excellent article on this that seemed to explain things pretty well, which annoyingly I can’t find at the moment. The gist of it was something like this:

    1. Boeing is working on a new plane (i.i. not the 737)
    2. Airbus look like they will win a big order (from Delta I think)
    3. Boeing panic and need to get a plane to compete with Airbus. They don’t have time, so they decide to upgrade the 737.
    4. The necessary engines are too big for clearance, so they have to push them forward.
    5. This makes it *inherently* unstable – if the nose pitches up, the design is such that it will become catastrophic rather than returning to equilibrium.
    6. Hence the software – if it detects the pitch up it is programmed to pitch down.
    7. There are two detectors on either side. The software relied on only one to judge pitch, rather than deciding after getting both inputs.
    8. If that input is false, it can think the plane is pitching dangerously up and *automatically* pitch down. Hence the two crashes as the pilots kept trying to bring the nose up as the autopilot kept pushing it down.

    There is other stuff, like you mention here, and the fact that the pilots were seemingly not told about this (Boeing needed to pretend this was essentially the ‘same plane’ so no new expensive certification or pilot training was needed).

  3. If an obsessive focus on diversity and wokeness distracts companies from their core business, Google Search will soon be unable to locate pornography and a Gillette razor is going to take someone’s face off.

  4. like BP with Macondo, may have been let down by its subcontractors.

    As far as I could tell, a good chunk of Macondo was caused by company men with no relevant experience on deepwater rigs arrogantly telling their experienced, competent subcontractors what to do because they work for a big oil company, don’t you know? An attitude which unfortunately prevails across the supermajors.

    Of course, it didn’t help that the shear rams closed on a tool joint. Or rather, didn’t.

  5. There was an excellent article on this that seemed to explain things pretty well, which annoyingly I can’t find at the moment.

    This one?

  6. Software development can be outsourced/offshored very effectively. Companies do this without serious issues all the time.

    Of course, the added distance/timezone/contractual elements mean the client needs to be exceptionally diligent when defining their requirements and testing the product when delivered.

    If software is at the root of these disasters, the reason will be somewhere in that last paragraph.

  7. Is he right when he says ‘all modern aircraft rely on software for stability’? I thought the issue with the 737 Max was that the engines being moved forward meant the centre of gravity was shifted forward, and it would not fly level unassisted, it always wanted to nosedive? That it had become fundamentally unstable in the air, rather like military jets, who are indeed only kept in the air by computer wizardry? Is this true of all passenger aircraft nowadays or is the 737 Max the outlier here?

  8. In my not at all humble opinion, this post would be perfectly valid and not miss a beat if the words “pompous twit” were omitted.

  9. In my not at all humble opinion, this post would be perfectly valid and not miss a beat if the words “pompous twit” were omitted.

    Fair point, and I was thinking about not including that. Hey, I’m allowed to be grumpy sometimes! Edited.

  10. @Jim

    Actually the other way round. The engine nacelles generate lift, particularly under high power – putting them so far forward tends to make the aircraft pitch nose up, particularly on power. You could thus end up with the plane approaching a stall, the pilot opening the throttle to gain airspeed which induces a nose up attitude, pulling it into a stall. MCAS was intended to counter this behaviour by detecting the nose up state and moving the stabiliser to push the nose down.

    Although “unstable” in this way, a crew aware of the aircraft’s propensities to pitch up should be able to fly it perfectly fine without MCAS, (I.e. this isn’t dynamically unstable in a way that becomes uncontrollable without the computer – it just potentially requires some control inputs which don’t feel very natural) however the certificatation requirements require somthing (natural or computer controlled) to automatically counter such tendancies.

  11. From what I can see, it looks like part of the problem was under-trained 3rd World pilots — which is perhaps a more understandable version of the “Diversity” cancer. But Boeing knew to whom it was selling aircraft, and should accordingly have been even more careful about the need for their aircraft to be fail-safe.

    If Boeing had been an old-fashioned Japanese company, the CEO would have resigned (if not commit seppuku). If Boeing CEO Dennis Mueilenburg were an honorable man, he would have worked his way down the chain firing everyone who bit his tongue about this potential problem and putting entire relevant departments on zero bonus for this year — and then resigned without any golden parachute. If …

    After all, the whole point of all those “Diversity” councils is that there will be consequences for anyone who is not seen as being sufficiently pro-homosexual or whatever. If Mueilenburg made it clear that incompetence or covering up problems would be punished even more severely than lack of lip-service to “Diversity”, the company would be back on the right path.

    But old Dennis did not resign. And the useless Board of Directors did not fire his accountable ass. Any company would be much healthier if Directors were required to hold at least 25% of their personal total net worth in stock of that company — giving them real skin in the game.

  12. Re BP’s Macondo failure — we all understand that real world disasters are usually caused by the concatenation of a number of individually low-probability events, which is why major disasters are fortunately quite rare. Many things have to go wrong in sequence before the situation ends in disaster. Interesting question is — what was the first failure on the chain?

    A number of Brits, defending BP because its name used to be British Petroleum, immediately jumped on the US manufacturer of the Blow-Out Preventer. But the initial cause was earlier — and more elementary.

    One of the standard issues faced on wells under construction is the possibility of a blow-out. Gas enters the well at the bottom, and rises up the well, expanding as it goes. This event probably happens somewhere in the world almost every day, and drillers train constantly on how to deal with the problem.

    Working in the driller’s favor, the gas enters the well miles underground, and it can take several hours for this gas to reach the surface, giving the driller time to respond. Also working in the driller’s favor, the gas in the wellbore displaces mud from the well, resulting in the immediate impossible-to-ignore phenomenon of the mud pits on the rig overflowing. So the driller has a clear warning that a bad situation is developing, and enough time to correct the situation through a number of different appropriate actions (of which the Blow Out Preventer is only one).

    So what did BP do wrong in Macondo which resulted in them missing this clear warning? They were behind schedule and over budget on the well. To save a few hours before the planned imminent rig move, BP took the high-risk decision to start unloading mud from the mud pits before the well was secure. Since they were pumping mud off the rig onto a supply vessel without keeping careful track of volumes, they did not realize that the volume of mud in the pits was growing.

    It was an entirely avoidable accident, caused (as Tim suggests) by arrogant BP staff. They were presumably trying to get brownie points within their company by unwisely shaving costs.

  13. Oh that’s the John Band, the know-it-all Sunday socialist marketing consultant who emigrated to Australia. He used to write sneery know-it-all comments on TW’s site. I especially one I recall and treasure from just after the bank collapse. His expert analysis of Nat West’s accounts showed him that after a bit of pain and reorganisation there was a thundering good bank in there.

    He didn’t warn us, however, that after more than 10 years they still haven’t managed to locate it.

    His other expertise includes train and railway design and management, everything to do with airlines and travel, social housing…. Etc etc. His twitter account used to be a stream of arrogance that was extraordinarily funny for people like me. Maybe I should try to find it again. He used to have a couple of mates who were only too ready to play pontiff and congratulate each other on their infinite sagacity

  14. His handle was John B, but somebody with rather more realistic views started posting under that name. So there he is with the Yorkshire ranter and d-squared, effortlessly solving all problems

  15. His handle was John B, but somebody with rather more realistic views started posting under that name.

    Yeah, I think he had a blog for a while, and I guessed it was the same chap from TW’s blog. Same name, same attitude.

  16. Jim: “Is he right when he says ‘all modern aircraft rely on software for stability’?”

    I cannot comment on all, or even any current aircraft. But as far back as the Airbus A310 (and of course Concorde before that) fuel transfers to the tailplane/rear are used for CG control.

    For Concorde, this was essential for the supersonic phase, as the CG moves a lot. For the Airbus, this is to reduce fuel burn by reducing the (negative) lift provided by the tailplane/stabiliser: i.e. instead of the tail ‘pushing downwards’ you move a bit of weight aft to do the same. Less lift, less drag, less fuel burnt.
    This CG shift reduces the plane’s stability to such an extent that the move is only done in the air, after the turbulent ground air is left. In Airbus A310, the move is started, calculated and stopped by software running on a 6802 micro.
    So the moment the plane takes off, the computer starts destabilising it, and software written by me tells it when to stop. Enjoy your flight!

    The commecial argument is so strong I expect all subsequent airliners now do this. If they do, then yes, they depend on software to retain stability, quite apart from any fly-by-wire stick-fixed instability considerations.

  17. @Gavin: While there have been questions about Lion Air and and Indonesian regulator in the past, Lion Air succeeded in getting themselves off the EU blacklist, so they can’t have been too bad. Plus, a different crew cleared the problem the previous day, so standards are clearly not uniformly poor there.

    Ethiopian, on the other hand, is a different matter: They operate one of the largest and best airline training organisations in the world (I visited them once, they’ve an impressive setup) and for me, there is no question about their competence. As far as I can tell, the accident pilot (effectively working alone, since the 1st officer was very inexperienced) only made one small mistake: He let the airspeed get too high because he was distracted by the aircraft’s other fuckology. That meant he couldn’t re-trim the aircraft to a noise-up attitude due to lack of physical strength. But to get the power assistance back, he had to turn the MCAS back on. Talk about a rock and a hard place.

    Also, as somebody else said, it’s not the $9/hr programmers who are likely to have fucked up: It’s the specification that decides how many AOA sensors to use, and the specification is Boeing’s responsibility. They might have outsourced it, I don’t know, but they’re responsible. The specs I see where I work are insanely detailed, and crucially, experienced people who are not working under threat of being fired for giving bad news are there to verify that every line has been implemented properly.

    The above are personal opinions only, may be incorrect and have nothing to do with my employer.

  18. There is a great deal wrong here.

    I’ll start with this:

    This makes it *inherently* unstable – if the nose pitches up, the design is such that it will become catastrophic rather than returning to equilibrium.

    No. This is completely, utterly, wrong.

    Because the nacelles are both larger and further forward than on the previous model, in a high angle of attack situation, the nacelles shift the center of lift slightly more forward than had previously been the case.

    That does not make the airplane unstable, it just means that in a high AOA situation, the MAX would require more nose down control force to get the same response as the previous models. MCAS was introduced to make the flying characteristics the same.

    There are two detectors on either side. The software relied on only one to judge pitch, rather than deciding after getting both inputs.

    I’m not an expert in the certification process — a subject that essentially all reporting in the MAX has completely ignored, but one aspect of it is trading off complexity against reliability. Certainly, there is no cost advantage to be had by relying on only one AOA vane.

    There is a very short and easy non-normal procedure for un-commanded pitch trim — regardless of cause. My guess is the designers decided that the extremity of a high AOA condition required the most robust implementation possible.

    … the fact that the pilots were seemingly not told about this.

    I recently did a rough page count of my flight manuals. Something around 1000 pages.

    There is always tension between keeping the manuals to a manageable size — particularly after flight management systems were added to the pile — and not excluding necessary information.

    Pilots are divided on this — about as many think Boeing should have included it as don’t. I’m on the latter side.

    I thought the issue with the 737 Max was that the engines being moved forward meant the centre of gravity was shifted forward, and it would not fly level unassisted … Is this true of all passenger aircraft nowadays or is the 737 Max the outlier here?

    This is completely wrong.

    The engine nacelles generate lift, particularly under high power – putting them so far forward tends to make the aircraft pitch nose up, particularly on power. You could thus end up with the plane approaching a stall, the pilot opening the throttle to gain airspeed which induces a nose up attitude, pulling it into a stall. MCAS was intended to counter this behaviour by detecting the nose up state and moving the stabiliser to push the nose down.

    This characteristic is true of all airplanes with wing mounted engines. The MAX differs only slightly in degree.

    From what I can see, it looks like part of the problem was under-trained 3rd World pilots …

    BINGO!

    As far as I can tell, the accident pilot (effectively working alone, since the 1st officer was very inexperienced) only made one small mistake: He let the airspeed get too high because he was distracted by the aircraft’s other fuckology.

    That isn’t a small mistake, it is a huge mistake. It is an epic mistake.

    They failed to get the autothrust system out of target thrust mode into a vertical mode. Depending on the departure procedure, this should have happened at either 1000 or 1500 feet above the airport, before retracting the flaps and slats. Which is also before MCAS is armed.

    That failure meant the airplane was going to go as fast as the engines could make happen.

    Assume that at 200 knots, the nose down force on the yoke would have been 80 pounds.

    At 400 knots, that force quadruples, to 320 pounds.

    Step one in any emergency: maintain aircraft control.

    See also Air France 447

  19. @Jeff: Some good points, but I have three disagreements with you:

    – Boeing themselves categorised an MCAS failure as Hazardous (instead of the next level up: Catastrophic, which in hindsight may have been more appropriate), and so it should not have been relying on a single sensor input. The failure probability for a single sensor is just too high, and AOA vanes lead a particularly hard life.
    – You can argue that the pilots don’t need to know about MCAS (though I disagree – 737 pilots are used to having absolute control and changing that philosophy surely deserves a mention), but the Line Maintenance engineers who need to troubleshoot the aircraft sure as hell DO need to know about it, and it was missing from their manuals too. To me, that looks more like the manufacturer forgot to put the info in, or maybe that they didn’t want to…
    – We’ll see when the final reports come out, but I think you’re way off base about poor quality pilots. That poor guy had massive workload going on with intense stress. A single mistake (and ok, I accept your obviously expert input that it’s a big one) leading to an overspeed shouldn’t be a death sentence. It should cause an alarm, an opportunity to correct, and a bollocking from the base Captain when he gets back, but the system shouldn’t make (AOA sensor failure) + (one airspeed fuck up) = (death for all on board). I mean, that’s gotta be one of the shortest “Swiss cheese” failure models ever.

    I haven’t worked with the Ethiopian pilots, but I have worked with their engineers, and they are well on top of things. They deal with 787s, which have occasionally been troublesome and are very complex, and they have amazing training facilities, so the idea that this is some two-bit shithole operaton just doesn’t match the reality I’ve experienced at all.

  20. Yeah, that guy’s response bugged me. I think you nailed it in regard to the organization taking its eye off the ball. There’s soooo much of this, especially in government and I greatly fear in our collective militaries as well.

  21. There’s no single cause of failure in this. Just like there was no single cause of failure for Grenfell, or the Miami bridge or the Genoa bridge or anything else.

    This is why everyone who wants to be an expert can point to some bit – the pilot, the MCAS s/w, the AOA sensor, the engine position etc. – point out that that bit was bad and be partially correct. And also point out correctly that some other bit was perfectly safe because other people didn’t have a problem with it.

    In this case I think most of the fault can be laid at the feet of decisions made for sales/marketing reasons rather than pure engineering ones, because they set the scene for the various compromises that eventually led to the MCAS doing stuff that the pilot did not anticipate. However no company in the world can build things without paying attention to the sales/marketing because if they ignore sales/marketing they’ll go bust. In places like the Soviet Union where technically Sales/marketing wasn’t a thing, it actually was, just renamed to political toadying and bureaucratic brown-nosing rather than explicit sales/marketing with glossy brochures, PDFs and powerpoints

    In this case it seems clear that the primary cause – the fundamental root of the problem – was caused by a deliberate effort to make this look like (and be as much as possible) an incremental evolution of an old design rather than biting the bullet and admitting that maybe it would be better to start from scratch.

    The root cause of that decision appears to have been that time to market for a brand new design would have been many years longer and a significant part of that delay was due to the fact that government regulators would insist on far longer and in depth testing of a new design vs an “incremental improvement”.

  22. Thats an interesting take on it there Francis, I haven’t followed this closely but I get the gist of what your are saying.

    So in your scenario, what then is the solution:

    1- To reduce approval turnaround time; or

    2- Introduce a reduced liability aircraft ie if you fly on this you need to know that it isn’t over the line, not sure about those beneath but it might be a solution; or combination of 1 and 2, or

    3 – Self regulate, since regulation with all of its bloated largeness also failed here.

  23. @Bardon: I see this argument frequently online – regulation failed in one instance, so let’s just scrap it (therefore ignoring all the good they have done). But do you really think they would have done a brand-new design any better?

    I don’t. As evidence, I refer you to the 787 battery fires. And the fact that it is clear that the FAA wasn’t enforcing much of anything on the 737max, so I don’t see how you can blame regulations…

  24. @HF

    I don’t have a position on this one, I was picking up on Fancis’s summation and using that as basis and threw up some suggested solutions to it.

    Fully appreciate that it was an off the cuff option list and interested to see which way he would go, what about you, which way would you move forward based on Francis’s position?

    I have worked on the regulation side of the fence, when I was a young budding engineer with a fairly large state highway authority and made it into their coveted Bridge Branch. I was fortunate enough to work with some undoubted captains of industry, some experts dont chase the big bucks and are content plying their trade diligently and to the max in the lower paid regulatory sector. I have utmost respect for these consummate professionals. Not forgetting that the civil infrastructure sector whilst high public risk, is orders of magnitude less risky and complex than the aircraft sector. That system transitioned somewhat to self regulation, whereby the contractors were responsible for quality assurance and the like, we and I were very skeptical of this, but it has proven to be a better and more efficient system than state regulation. When this changing of the guard was taking place we had weekly roadshows to Chinese contingents inserted in this new way of contract surveillance, this was in the mid nineties and well before China became the power that it is now.

    But I do think that at the highest level that in most cases that the state should not be involved with these types of activities, it was never intended that the state perform these type of roles. It is very difficult for a large organisation to have such a diverse range of offerings as the state does, particularly specialized ones and remain focused and the leader in each.

    Independence is important, so maybe the regulator is a private company as opposed to “self” regulation. But again, if the market did go to self regulation then the market would force these organisations to regulate themselves properly, or they would suddenly be out of business at the least.

  25. The regulation thing is a fair point.
    What is it’s purpose? To protect the public and enforce some level of safety (for example).
    So the same level of safety should apply to any newly delivered aircraft, regardless of whether it is an upgrade or a new design.
    The current system admits that ‘upgraded old designs’ are not as safe as new designs, and openly encourages manufacturers to game the system.

    I recall one of the Airbus vs Boeing rows was about just this: the decision time needed for a take off abort decision around V1. Boeing claimed their plane was an upgrade, so needed only the 1950’s decision time of 1 second. Airbus, as a new design, need the new period: 2 or 3 seconds (I forget). That makes a huge difference to the payload and range for taking off on any real runway.

    Boeing came unstuck on this one too: a very famous incident at Gatwick came within an whisker of a major disaster whan a maximully loaded Boeing had just this decision time tested. They started dumping fuel while still on the runway and cleared Crawley hill (housing estate) by not very much. An Airbus would have been quite OK, as the extra time was enforced on it.
    In this case, no one died. But the message was clear.

    Whatever the rules, people will adjust to maximise their benefit under those rules. So they better be relevant. I think this recent case shows how divergent the FAA regulations have become to their purpose.

    And it isnt just planes….VW ?

  26. [HibernoFrog]: Boeing themselves categorised an MCAS failure as Hazardous (instead of the next level up: Catastrophic, which in hindsight may have been more appropriate), and so it should not have been relying on a single sensor input. The failure probability for a single sensor is just too high, and AOA vanes lead a particularly hard life.

    Hazardous is the correct category.

    The distinction with Catastrophic is important. A catastrophic failure is one from which recovery is either impossible or extremely difficult, and for which there is no alternative system or procedure.

    An engine driven generator failure is in the Catastrophic category, because its loss will very likely lead to the loss of the airplane unless there is another independent source of power.

    In contrast, the Hazardous category is for those systems that, without aircrew intervention, will create a situation that has an elevated likelihood of losing the aircraft. There are lots of items in the Hazardous category. For example, the Captain’s radar altimeter is Hazardous, because some failure modes can lead to the loss of the aircraft absent aircrew recognition and correct response using a procedure that all qualified pilots can accomplish without error. Google [turkish airlines 737 Amsterdam crash].

    Which is why MCAS is, and will remain, Hazardous, along with every other element of the primary pitch trim system. Otherwise, every airliner ever built must be immediately grounded.

    The non-normal procedure in response to uncommanded pitch trim is simple, and is essentially the same for all airliners (absent Airbus after the A300):

    1. Counter the uncommanded input using the yoke mounted pitch trim switch.
    2. If that does not correct the problem, Primary Pitch Trim switches — cutout

    The Lion Air and Ethiopian crews failed to properly diagnose the situation and take the appropriate corrective action.

    That is the bottom line. They crashed completely flyable airplanes.

    (No, I have no idea why the font size jumped on that one line.)

  27. [HibernoFrog]: You can argue that the pilots don’t need to know about MCAS (though I disagree – 737 pilots are used to having absolute control and changing that philosophy surely deserves a mention).

    The argument against including MCAS is based on:

    1. No cockpit indications
    2. No switches or controls affecting MCAS
    3. MCAS does not introduce any new procedures.

    I am completely persuaded. Back in the day, pilots were required to memorize all manner of facts that were completely inconsequential to operating the airplane. For instance, I don’t need to know the normal operating pressure range of the hydraulic system, because it will throw a warning if pressure goes outside that range.

    So either the argument against including MCAS holds, or we will go straight back to the bad old days of memorizing stables full of useless horseshit.

    As for 737 pilots being used to getting complete control, MAX is no different than its predecessors:

    1. Autopilot Disengage Switch — Depress
    2. Autothrottle Arm Switch — Off
    3. Primary Pitch Trim — Cutout

    That would take less than half the time to do than it did to type.

    Line Maintenance engineers who need to troubleshoot the aircraft sure as hell DO need to know about it, and it was missing from their manuals too.

    I’m not a mechanic, so I’m treading on thin ice here, but I don’t think that is true.

    If there is a problem with MCAS itself, it will throw status messages. Response: replace MCAS unit. (If such a thing even exists as a physical line replaceable unit.)

    If there is an aircraft problem outside MCAS, then knowing about MCAS is no help.

    Lion Air maintenance released an aircraft that was not flight worthy, and the existence of MCAS documentation wouldn’t have changed that in the slightest.

    We’ll see when the final reports come out, but I think you’re way off base about poor quality pilots. That poor guy had massive workload going on with intense stress.

    These crews crashed completely flyable airplanes for reasons well beyond MCAS. They did not fly at an appropriate speed for the out of trim condition — slowest clean wing maneuevering speed, around 210 knots. The out of trim loads on the yoke increase proportionally to the square of the airspeed.

    The Lion Air crew never called for the non-normal checklist. Given how intuitive it is, there is a real question whether the crew had any knowledge of that the Pitch Trim Cutout switches were for.

    The Ethiopian crew, for whom ignorance should have been no excuse, killed themselves because they failed to fly the airplane, thereby turning an easily managed problem into a crater.

  28. @Bardon: On re-reading, I see that I indeed misunderstood. You raise some interesting ideas for regulation, but the level of safety already demanded by the public is going to be very hard to achieve with any other system than the one we’ve already got (maybe in the long term it could be done, but there is no public apetitie to get there – and I certainly don’t blame them). I would say that the answer is the same as it is for a lot of public policy discussions: An unsexy compromise. Keep the current system (which, when it is applied properly FAA is incredibly safe) but maybe open the door a little more to new ideas…

    @Jeff: Again, clearly you know your stuff, but I can’t help but dispute some points:

    – Recovery clearly was impossible with a failed MCAS. Now, I know you’re going to say that with the correct application of procedure that it is recoverable, and I believe you. However, what we have here are two pilots deemed competetent by a training system deemed competent by the FAA and EASA (and very specifically so, in the case of Lion Air, who had to earn this distinction directly with the FAA and EASA, independently of their own aviation regulator). There are many thousands of pilots like them out there. Maybe they fall short of the standards you hold to be correct and true, BUT the system deemed them competent. The airplane needs to be designed to accomodate that prevailing standard, particularly the ability of these pilots in a stressful, task-saturation environment.
    I can’t dispute your knowledge of the correct piloting procedures, but if it was that easy and that routine to counter this problem, it is hard to believe that a very experienced pilot at a well-regarded training airline like Ethiopian would fail. There HAS to be more to it than that…

    Also, since the entire fleet has been grounded, so I think it’s fair to say that the entire world’s regulators feel similarly.

    Out of interest, on which aircraft is the engine-driven generator CAT? That seems very odd to me: The electrical system is one of the most heavily redundent on the aircraft…

    I don’t think you can lump the MCAS in with the pitch-trim system. The function of MCAS (i.e. a very basic kind of fly-by-wire system) is far more analagous to (one of the functions of) the Flight Management System on an Airbus, and while I don’t know for sure, I’d bet a sizeable sum that it’s categorised CAT.

    NB: Your font size comes out uniform on my side.

    I think you make a good case that pilots didn’t need to be informed about MCAS, so you have changed my mind… but on one condition: provided that it had been designed to a failure rate similar to the rest of the primary flight controls (and like I said, I think there’s a reasonable argument that this machine IS part of the primary flight controls).

    The function of all the systems (without exception) is described in the maintenance manual as a matter of course, so the absence of MCAS is an omisison at the least.
    For the line maintenance engineer, it is very often not enough to simply respond to the fault message by replacing the box (in fact, I think you are doing your own engineers a disservice if you think their job is so easy!), so the systems are described in enough detail that the engineer can troubleshoot intelligently. Not having any knowledge of the MCAS system, there’s no reason for the engineer to make a connection between an uncommanded nose-down and a failing AOA sensor (unless the 737 has some kind of automatic nose-down in the event of a stall?) and given the lack of mention of MCAS anywhere else, I assume that the troubleshooting manual didn’t give this as a possibility either, so I’m not surprised the fault wasn’t found. Some “difference training” could have really helped here.
    Also, since you can’t usually replicate an AOA fault on the ground, it’s normal to run a system test, and to send the plane back out if it checks out OK to see if the problem is cleared. The aircraft design has to allow for this.

    Something we do agree on (I think it was you anyway): I don’t think that Boeing should be demonised for improving the flying characteristics of their aircraft through software. Fly-by-wire does this all the time on other aircraft without problem…

  29. [HibernoFrog:] Recovery clearly was impossible with a failed MCAS. Now, I know you’re going to say that with the correct application of procedure that it is recoverable, and I believe you. However, what we have here are two pilots deemed competetent by a training system deemed competent by the FAA …

    Track down the final report for Air France 447, the one that went down in the middle of the Atlantic. Incompetent pilots crashed a perfectly flyable airplane. Towards the end of the report, the investigation noted that the pilot flying was an ab initio pilot (hired at Air France after getting a basic flying qualification), and that ab initio pilots are seriously overrepresented in mishaps.

    Pilots at a well regarded airline like Air France failed, and there really was nothing more to it than that.

    The other thing to note is that deeming pilots competent, and them actually being competent, are two different things. Certainly that was the case at Air France.

    After all, MCAS is really nothing more than a particular cause, among others, of uncommanded pitch trim. If pilots can’t be expected to handle MCAS, then they can’t be expected to handle any instance of uncommanded pitch trim, can they?

    And that really is the problem that needs addressing: the inability to diagnose an obvious situation and react appropriately.

    The airplane needs to be designed to accommodate that prevailing standard, particularly the ability of these pilots in a stressful, task-saturation environment.

    If that is the standard, then the only answer is to ground all airliners now, and forever.

    This puts the finger on what I find so mystifying.

    Also, since the entire fleet has been grounded, so I think it’s fair to say that the entire world’s regulators feel similarly.

    Given the nature of pubic opinion, particularly when informed by lazy and incompetent journalists (yes, I know, I repeat myself), regulators had no alternative.

    But if I am right, then the inability of the pilots to correctly identify and appropriately respond will figure largely, as will the glaring inability of the Ethiopian crew to maintain aircraft control.

    I don’t think you can lump the MCAS in with the pitch-trim system.

    MCAS provides inputs to the primary pitch trim system, just as the autopilot, yoke mounted trim switch, and Mach compensation do.

    I think you make a good case that pilots didn’t need to be informed about MCAS, so you have changed my mind… but on one condition: provided that it had been designed to a failure rate similar to the rest of the primary flight controls

    The one thing that can be said about MCAS is that it adds another point of failure to the pitch trim system. So there will inevitably be an increase in uncommanded pitch trim events.

    But given that the mean time between failures of AOA vanes is very high (I’m asserting that based primarily on experience; I coudn’t find an actual value), then MCAS would have a failure rate commensurate with the rest of the primary flight controls.

    It is worth keeping in mind that the Lion Air aircraft was not airworthy. There is no designing airplanes to deal with that.

    Not having any knowledge of the MCAS system, there’s no reason for the engineer to make a connection between an uncommanded nose-down and a failing AOA sensor (unless the 737 has some kind of automatic nose-down in the event of a stall?) and given the lack of mention of MCAS anywhere else, I assume that the troubleshooting manual didn’t give this as a possibility either, so I’m not surprised the fault wasn’t found.

    True, if the uncommanded pitch trim was the only symptom. However, the MCAS response was due to anomalous stall indications — stick shaker, stall warning horns, etc — that point straight at bad AOA. Assume an airplane without MCAS. What would the mechanics do? Add MCAS, in what way would should their response change?

    Also, since you can’t usually replicate an AOA fault on the ground …

    I’m not a mechanic, but given the hammer-like simplicity of an AOA vane, I don’t see why proper function can’t be shown on the ground. Check for play, rotate the vane through its range, look at output values.

    The next step would be to check connectors and continuity.

    As far as MCAS itself goes — if it is a separate LRU — line mechanics won’t trouble shoot it, their role is to R&R.

  30. Track down the final report for Air France 447, the one that went down in the middle of the Atlantic. Incompetent pilots crashed a perfectly flyable airplane.

    Well yes, they were French. Having worked with Frenchmen in senior positions their prevailing attitude can be summed up as follows:

    I know more than anyone, which is why I’m in this position and they’re not. I do not need manuals, procedures, or standards because I am smart enough to make decisions on the fly thus delivering the optimum outcome every time. I do not need to refer to the experience of others because 1) if they knew more than me I’d not be in charge and 2) this situation is unique, and therefore there is nothing to be learned from experience.

    So it’s hardly surprising that when the SOPs and autopilot were screaming at the pilots to do X, they decided they knew best and did Y.

  31. [Tim Newman:] So it’s hardly surprising that when the SOPs and autopilot were screaming at the pilots to do X, they decided they knew best and did Y

    It is far worse than that. Faced with the loss of airspeed data at 35,000 feet, the pilot flying’s response was to pull the airplane into a 10º climb.

    At 35,000 feet the limited excess energy available means any climb angle of more than a degree or so will result in airspeed loss.

    That is incompetence on stilts.

    The non-flying pilot completely failed to monitor what the flying pilot was doing.

    More incompetence.

    Result? Putting the airplane into an aft-stick stall.

    When the airspeed went away, had the pilots instead decided to leave the cockpit for a few minutes and chat up the flight attendants, everyone would have lived.

    The report highlighted poor training, and lack of flying skills among ab initio pilots.

  32. That is incompetence on stilts.

    It’s driven by an arrogance and utter lack of humility that’s driven into them as early as primary school: it is simply inconceivable that they do not know what’s best.

    I’ve sat in a meeting where the world’s foremost specialists in a certain construction technique arrived to give a presentation to a bunch of French management who were ostensibly interested in their services. The meeting descended into a session where the Frenchmen told the experts how they could probably do their job better, taking the approach of “You probably haven’t considered this trivially obvious alternative; I will proceed to tell you why I think you’d be better doing it this way.” The Frenchman leading the charge had zero experience in this technical area, absolutely none whatsoever.

  33. [Tim Newman:] It’s driven by an arrogance and utter lack of humility that’s driven into them as early as primary school: it is simply inconceivable that they do not know what’s best.

    I’ve spent a fair amount of time in France, so I sympathize.

    But if that is your conclusion, then you would never see any need to modify Air France’s training or operational philosophy because they put too much emphasis on operating the airplane instead of being able to fly it.

    Which is why I’m banging on about this is that — IMHO — focusing on MCAS is getting the problem backside forward.

    MCAS added a new failure path, which increased the likelihood by the mean time between failure of the AOA sensing system. That MTBF is extremely low, to the point that absent maintenance malpractice, or externally induced physical damage to the vane, it is also near as dammit to zero; even so, it could be 100 times that of the pre-existing system, yet still remain well within existing certification criteria.

    The question that should be focused on is this: how is it that two crews crashed completely flyable airplanes, particularly the Ethiopian crew?

    If everyone’s answer is “MCAS what’s done it”, then no one is going to be asking that very obvious question. Fail to answer it, and we are waiting for any number of other simple failures to kill people.

    Compare and contrast with the SWA crew that had an engine let go on them at altitude.

  34. @Jeff:

    “Track down the final report for Air France 447, the one that went down in the middle of the Atlantic. Incompetent pilots crashed a perfectly flyable airplane”
    Absolutely agreed but in AF447, the crew gave the erroneous control inputs while the aircraft systems where shouting at them to stop (notably the stall warning). This is markedly different to the Ethiopian crash, where the aircraft was creating the control inputs, increasing pilot stress/workload.
    But your comments prodded me to question myself, and I found an article on flightglobal.com reviewing the flight recorder data: The 1st officer first suggested disengaging pitch-trim and the captain quickly agreed. So they both obviously knew what to do. The next MCAS “pitch down” command was ignored by the aircraft, but the one after was not, leading to the fatal result. The question is whether the crew re-engaged the pitch-trim and why, or if not, what caused the sudden re-establishment of MCAS’ control?
    It could be as I suspected earlier: The crew allowed airspeed to rise and were physically incapable of turning the trim wheel. I would then still stand by my position that this is a bad design: Letting the airspeed get away while trying to deal with a stressful situation shouldn’t be an insta-kill… but I have no good counter-argument to your point that ANY nose-down pitch trim failure would have created this same situation. So pending the official reports, you have strongly convinced me to reevaluate my opinion that the MCAS design is fundamentally flawed, but it just transferrs the problem: A trim failure is an insta-kill with ONE screw-up by the flight crew… This seems an incredible unforgiving environment for an airliner… You will no-doubt counter that the 737 has an excellent safety record up to now: I cheefully conceed that point 🙂 Let’s see what the reports recommend.

    LOL at “the hammer-like simplicity of an AOA vane” 🙂

    “I don’t see why proper function can’t be shown on the ground”
    Poor communication on my part: You can of course test the AOA vane on the ground, but my point was that you’d have no reason to do so in response to an uncommanded nose-down unless you were aware of the MCAS function (or you had a fault specifically pointing to the AOA vane, in which case I imagine the engineers would have replaced it). Now, if there were erroneous stall warnings in that event, then I take your point that this should absolutely point to an AOA vane.

    “MCAS provides inputs to the primary pitch trim system, just as the autopilot, yoke mounted trim switch, and Mach compensation do”
    It may be acting via the pitch trim, but I’m absolutely convinced that the MCAS should be considered as a form of fly by wire and should have been designed accordingly, and a single sensor input is absolutely not going to cut it: Fly by wire requires triple redundancy, with the three systems installed in across least two different physical locations, combined with the redundant systems being designed by two entirely separate teams. I’m not saying ALL of that is necessary, since MCAS is only one part of a fly by wire system, but a single-sensor input is bullshit in this context. I wonder how many inputs the Mach compensation uses, and how it reacts to a disagreement… “better”, I suspect. But again, I accept your argument that the pilots are supposed to just turn it off and retain control of the aircraft, but again, that’s pending the final reports on why this was not the case and again with the priovisio that this seems incredibly unforgiving…

    “the only answer is to ground all airliners now, and forever”
    Or the 737’s systems could be simply forced to respond to pilots’ commands, not their own internal desires and logics (This isn’t an Airbus, after all).

    “that MTBF is extremely low … it is also near as dammit to zero”
    This is very unlikely to be true unless you are only looking at an individual aircraft or the lifetime of an individual pilot. Physically moving parts on the aircraft have failure rates around 1 per 5,000-100,000 hours. If you have, say, 5000 737s in-service (probably a low guess) flying 10 hours per day (also probably low), that’s 50,000 flight hours per day. So if AOA vanes are somewhere in the middle of the range, then that’s one failure per day just on the 737. This is why I keep arguing that the design has to be more forgiving: When you KNOW you’re going to throw this failure at at least one random pilot on any given day, every day of the year, what’s the probability that one of them is going to f*ck it up?

    @Tim: I’ve heard of this kind of behaviour from many other people in France, but it seems to be mercifully absent from my little corner of the aerospace world. Even the most self-confident French managers I encounter are willing to give the technical experts the chance to change the manager’s mind, and they will spin on a dime if the technical argument is convincing. I think it’s the pervasive emphasis on safety – our people have all seen that arrogance can kill, and nobody wants to be the one who does so. Strange thing is, you can properly kill people (and large chunks of environment) in the oil industry too, so I wonder why the difference.

    I think I’ll probably stop responding to this conversation now, due to the time it’s taking, but thanks for your time Jeff, it has been a very, very educational experience for this aero engineer.

  35. HibernoFrog, even though you are done, you do deserve a response.

    Absolutely agreed but in AF447, the crew gave the erroneous control inputs while the aircraft systems where shouting at them to stop (notably the stall warning). This is markedly different to the Ethiopian crash, where the aircraft was creating the control inputs, increasing pilot stress/workload.

    Not quite right. For mysterious reasons, AOA (and therefore stall warning) is inhibited below 80kts. If I was in charge, weight off wheels would enable AOA. (Airplanes can be airborne at zero knots.)

    The AF447 crew simply did not know how to fly the plane. The first step in any unusual situation is to maintain aircraft control. Pulling the stick back to the aft stop at 35,000 feet is as wrong as wrong can be. Had he maintained the existing pitch attitude, we’d have never heard of AF447.

    Yes, the MAX crashes where the aircraft was creating control inputs. That’s a bad thing. Why did it take so long to fail to make the bad thing stop? I strongly suspect inadequate systems knowledge, that the pilots simply had no idea to how to take complete control of the plane.

    The 1st officer first suggested disengaging pitch-trim and the captain quickly agreed. So they both obviously knew what to do. The next MCAS “pitch down” command was ignored by the aircraft, but the one after was not, leading to the fatal result.

    It took a long time to reach that point, and by the time they did, the airspeed was truly excessive. That is what killed them.

    The question is whether the crew re-engaged the pitch-trim and why, or if not, what caused the sudden re-establishment of MCAS’ control?

    I have read that they did, which enabled MCAS commands to reach the pitch trim motor. Unfortunately, by that point, their failure to control airspeed had rendered the situation unrecoverable.

    The crew allowed airspeed to rise and were physically incapable of turning the trim wheel. I would then still stand by my position that this is a bad design: Letting the airspeed get away while trying to deal with a stressful situation shouldn’t be an insta-kill…

    Clearly that is desirable. However, it would be impossible to build an airplane which didn’t need to be flown properly by crews who can’t exercise basic airmanship — just like AF447. I mentioned a Turkish Airlines 737 crash short of the runway at Amsterdam. It was another case of failure to fly the airplane coupled with a single point radar altimeter failure.

    Which can happen to every CAT III capable airplane out there.

    So pending the official reports, you have strongly convinced me to reevaluate my opinion that the MCAS design is fundamentally flawed …

    I happen to agree that the MCAS design was ill considered. If I was to guess the cause of the next airliner crash, it would be a botched missed approach. My airline has had four near misses in the last two years, and we aren’t alone. Remember the Dubai 777?

    When that goes south, one likely outcome is a very nose high low airspeed situation. MCAS was designed to get the nose down stat, and to do it in the most reliable way possible.

    If I was the designer, MCAS would get to ignore the first countering pitch trim input, but not the second.

    But in their defense, MCAS was designed consistent with existing risk criteria. Low probability of a false stall warning, and an easy solution with a proven backup.

    You can of course test the AOA vane on the ground, but my point was that you’d have no reason to do so in response to an uncommanded nose-down unless you were aware of the MCAS function

    I think there is a misunderstanding here. If MCAS commanded nose down pitch trim in the absence of stall AOA, that would be an MCAS malfunction. But in both crashes, MCAS was working perfectly, it was AOA that was gooned up.

    The Lion Air crash happened after a crew successfully handled bad AOA induced MCAS inputs. Their maintenance completely bollocksed the repair, putting an un-airworthy aircraft back in service.

    MCAS should be considered as a form of fly by wire and should have been designed accordingly, and a single sensor input is absolutely not going to cut it:

    There is a tradeoff here between complexity and reliability. Also, there are plenty of flight control related systems that are single sensor dependent. If I’m the pilot flying, I select the left autopilot, which takes all its data from the left air data systems, #1 inertial reference unit, and the left flight management computer.

    [Jeff:] the only answer is to ground all airliners now, and forever

    [HF:] Or the 737’s systems could be simply forced to respond to pilots’ commands, not their own internal desires and logics

    My point was that if the requirement is that a single error can’t doom a plane, then we must ground all planes.

    [Jeff:] that MTBF is extremely low … it is also near as dammit to zero”

    [HF:] This is very unlikely to be true unless you are only looking at an individual aircraft or the lifetime of an individual pilot.

    I have tried to find MTBF values for vane type AOA devices. Failed. However, in forty years I’ve never had one. Moreover, I’ve never reviewed the aircraft maintenance log from any plan before flight and seen an AOA writeup. Nor have any of my pilot friends ever mentioned such a thin in bar talk.

    That is why my guess is that AOA device failures are vanishingly rare.

    However, if 737 failures happen at one a day, then there must have been many 737 MAX aircraft that successfully dealt with AOA failures.

    Thanks for your time.

Leave a Reply

Your email address will not be published. Required fields are marked *