HackNot - Skepticism

The Skeptical Software Development Manifesto¹

“Argumentation cannot suffice for the discovery of new work, since the subtlety of Nature is greater many times than the subtlety of argument.” – Francis Bacon

The over-enthusiastic and often uncritical adoption of XP and Agile tenets by many in the software development community is worrying.

It is worrying because it attests to the willingness of many developers to accept claims made on the basis of argument and rhetoric alone. It is worrying because an over-eagerness to accept technical and methodological claims opens the door to hype, advertising and wishful thinking becoming the guiding forces in our occupation. It is worrying because it highlights the professional gulf existing between software engineering and other branches of engineering and science, where claims to discovery or invention must be accompanied by empirical and independently verifiable experiment in order to gain acceptance.

Without skepticism and genuine challenge, we may forfeit the ability to increase our domain’s body of knowledge in a rational and verifiable way; instead becoming a group of fashion followers, darting from one popular trend to another.

What is needed is a renewed sense of skepticism towards the claims our colleagues make to improved practice or technology. To that end, and to lend a little balance to the war of assertion initiated by the Agile Manifesto², I would like to posit the following alternative.

The Skeptical Software Development Manifesto

We are always interested in claims to the invention of better ways of developing software. However we consider that claimants carry the burden of proving the validity of their claims. We value:

Predictability over novelty
Empirical evidence over anecdotal evidence
Facts and data over rhetoric and philosophy

That is, while there is value in the items on the right, we value the items on the left more.

Our skepticism is piqued by claims and rhetoric exhibiting any of the following characteristics:

An imprecision that does not permit further scrutiny or enquiry
The mischaracterization of doubt as fear or cynicism
Logical and rhetorical fallacies such as those listed below:³

Argumentum Ad Hominem

Reference to the parties to an argument rather than the arguments themselves.

Appeal To Ignorance

The claim that whatever has not been proved false must be true, and vice versa.

Special Pleading

A claim to privileged knowledge such as “you don’t understand”, “I just know it to be true” and “if you tried it, you’d know it was true.”⁴

Observational Selection

Drawing attention to those observations which support an argument and ignoring those that counter it.

Begging The Question

Supporting an argument with reasons whose validity requires the argument to be true.

Doubtful Evidence

The use of false, unreasonable or unverifiable evidence.

False Generalization

The unwarranted generalization from an individual case to a general case; often resulting from their being no attempt to isolate causative factors in the individual case.

Straw-Man Argument

The deliberate distortion of an argument to facilitate its rebuttal.

Argument From Popularity

Reasoning that the popularity of a view is indicative of its truth. e.g. “everybody’s doing it, so there must be something to it.”

Post Hoc Argument

Reasoning of the form “B happened after A, so A caused B”. i.e. confusing correlation and causation.

False Dilemma

Imposing an unnecessary restriction on the number of choices available. e.g. “either you’re with us or you’re against us.”

Arguments From Authority

Arguments of the form “Socrates said it is true, and Socrates is a great man, therefore it must be true”.

We are especially cautious when evaluating claims made by parties who sell goods or services associated with the technology or method that is the subject of the claim.

Principles Behind The Skeptical Software Development Manifesto

We follow these principles:

Propositions that are not testable and not falsifiable are not worth much.
Our highest priority is to satisfy the customer by adopting those working practices which give us the highest chance of successful software delivery.
We recognize that changing requirements incur a cost in their accommodation, and that claims to the contrary are unproven. We are obliged to apprise both ourselves and the customer of the realistic size of that cost.
It is our responsibility to identify the degree/frequency of customer involvement required to achieve success, and to inform our customer of this. Our customer has things to do other than help us write their software, so we will make as efficient use of their time as we are able.
We recognize that controlled experimentation in the software development domain is difficult, as is achieving isolation of variables, but that is no excuse for not pursuing the most rigorous examination of claims that we can, or for excusing claimants from the burden of supporting their claims.
Quantification is good. What is vague and qualitative is open to many interpretations.

Basic Critical Thinking for Software Developers⁵

Vague Propositions

A term is called “vague” if it has a clear meaning but not a clearly demarcated scope. Many arguments on Usenet groups and forums stem from the combatants having different interpretations of a vaguely stated proposition. To avoid this sort of misunderstanding, before exploring the truth of a given proposition either rhetorically or empirically, you should first state that proposition as precisely as possible.

Consider this proposition: P(1): Pair Programming works.

If I were to voice that proposition on the Yahoo XP group⁶, I would expect it to receive enthusiastic endorsement. I would also expect no one to point out that this proposition is non-falsifiable.

It is non-falsifiable because the terms “pair programming” and “works” are so vague. There are an infinite number of scenarios that I could legitimately label “pair programming”, and an infinite number of definitions of what it means for that practice to “work.” Any specific argument or evidence you might advance to disprove P(1) will imply a particular set of definitions for these terms, which I can counter by referencing a different set of definitions – thereby preserving P(1).

A vast number of arguments about software development techniques are no more than heated and pointless exchanges fueled by imprecisely stated propositions. There is little to be gained by discussing or investigating a non-falsifiable proposition such as P(1). We need to formulate the proposition more precisely before it becomes worthy of serious consideration.

Let’s begin by rewording P(1) to clarify what we mean by “works”: P(2): Pair Programming results in better code.

Now at least we know we’re talking about code as being the primary determinant of whether pair programming works. However P(2) is now implicitly relative, which is another common source of vagueness. An implicitly relative statement makes a comparison with something without specifying what that something is. Specifically, it proposes that pair programming produces better code, but better code than what?

Let’s try again: P(3): Pair Programming produces better code than that produced by individuals programming alone.

P(3) is now explicitly relative, but still so vague as to be non-falsifiable. We have not specified what attribute/s we consider distinguish one piece of code as being “better” than another.

Suppose we think of defect density as being the measure of programmatic worth: P(4): Pair programming produces code with a lower defect density than that produced by individuals programming alone.

Now we’ve cleared up what we mean by the word “works” in P(1), let’s address another common source of vagueness – quantifiers. A quantifier is a term like “all”, “some”, “most” or “always”. We tend to use quantifiers very casually in conversation and frequently omit them altogether. There is no explicit quantifier in P(4), so we do not know whether the claimant is proposing that the benefits of pair programming are always manifest, occasionally manifest, or just more often than not.

The quantifier chosen governs the strength of the resulting proposition. If the proposition is intended as a hard generalization (one that applies without exceptions), then a quantifier like “always” or “never” is applicable. If the proposition is intended as a soft generalization, then a quantifier like “usually” or “mostly” may be appropriate.

Suppose P(4) was actually intended as a soft generalization: P(5): Pair programming usually produces code with a lower defect density than that produced by individuals programming alone.

P(5) nearly sounds like it could be used as a hypothesis in an empirical investigation. However the term “pair programming” is still rather vague. If we don’t clarify it, we might conduct an experiment that finds the defect density of pair programmed code to be higher than that produced by individuals programming alone, only to find that advocates of pair programming dismiss our experimental method as not being real pair programming. In other words, the definition of the term “pair programming” can be changed on an ad hoc basis to effectively render P(5) non-falsifiable.

“Pair programming” is a vague term because it carries so many secondary connotations. The primary connotations of the term are clear enough: two programmers, a shared computer, one typing while the other advises. But when we talk of pair programming we tend to assume other things that are not amongst the primary connotations. These secondary connotations need to be made explicit for the proposition to become falsifiable. To the claimant, the term “pair programming” may have the following secondary connotations:

The pair partners contribute more or less equally, with neither one dominating the activity
The pair partners get along with each other i.e. there is a minimum of unproductive conflict.
The benefits of pair programming are always manifest, but to a degree that may vary with the experience and ability of the particular individuals.

To augment P(5) with all of these secondary connotations will make for a very wordy statement. At some point we have to consider what level of detail is appropriate for the context in which we are voicing the proposition.

Non-Falsifiable Propositions

Why should we seek to refine a proposition to the point that it becomes falsifiable? Because a proposition that can not be tested empirically and thereby determined true or false is beyond the scrutiny of rational thought and examination. This is precisely why such propositions are often at the heart of irrational, pseudo-scientific and metaphysical beliefs.

I contend that such beliefs have no place in the software engineering domain because they inhibit the establishment of a shared body of knowledge – one of the core features of a true profession. Instead, they promote a miscellany of personal beliefs and superstitions. In such circumstances, we cannot reliably interpret the experiences of other practitioners because their belief systems color their perception of their own experiences to an unknown extent. Our body of knowledge degrades into a collective cry of “says who?”.

Here are a few examples of non-falsifiable propositions that many would consider incredible:

There is a long-necked marine animal living in Loch Ness.
The aliens have landed and walk amongst us perfectly disguised as humans.
Some people can detect the presence of water under the ground through use of a forked stick.

Try as you might, you will never prove any of these propositions false. No matter how many times you fail to find any evidence in support of these propositions, it remains true that “absence of evidence is not evidence of absence.” If we are willing to entertain non-falsifiable propositions such as these, then we admit the possibility of some very fanciful notions indeed.

Here a few examples of non-falsifiable propositions that many would consider credible:

Open source software is more reliable than commercial software
Agile techniques are the future of software development
OO programming is better than structured programming.

These three propositions are, as they stand, just as worthless as the three propositions preceding them. The subject areas they deal with may well be fruitful areas of investigation, but you will only be able to make progress in your investigations if you refine these propositions into more specific and thereby falsifiable statements.

Engage Brain Before Engaging Flame Thrower

Vagueness and non-falsifiable propositions are the call to arms of technical holy wars. When faced with a proposition that seems set to ignite the passions of the zealots, a useful diffusing technique is to identify the non-falsifiable proposition and then seek to refine it to the point of being falsifiable. Often the resulting falsifiable proposition is not nearly as exciting or controversial as the original one, and zealots will call off the war due to lack of interest. Also, the very act of argument reconstruction can be informative for all parties to the dispute. For example:

Zealot: Real programmers use Emacs
Skeptic: How do you define a “real programmer?”
Zealot: A real programmer is someone who is highly skilled in writing code.
Skeptic: So what you’re claiming is “people who are highly skilled in writing code use Emacs”?
Zealot: Correct.
Skeptic: Are you claiming that such people always use Emacs?
Zealot: Well, maybe not all the time, but if they have the choice they’ll use Emacs.
Skeptic: In other words, they prefer to use Emacs over other text editors?
Zealot: Yep.
Skeptic: So you’re claim is really “people who are highly skilled in writing code prefer Emacs over other text editors?”
Zealot: Fair enough.
Skeptic: Are you claiming that all highly skilled coders prefer Emacs, or could there be some highly skilled coders that prefer other text editors?
Zealot: I guess there might be a few weird ones who use something else, but they’d be a minority.
Skeptic: So you’re claim is really “Most people who are highly skilled in writing code prefer Emacs over other text editors?”
Zealot: Yep.
Skeptic: Leaving aside the issue of how you define “highly skilled”, what evidence do you have to support your proposition?
Zealot: Oh come on – everyone knows it’s true.
Skeptic: I don’t know it’s true, so clearly not everyone knows it’s true.
Zealot: Alright – I’m talking here about the programmers that I’ve worked with.
Skeptic: So are you saying that most of the highly-skilled programmers you’ve worked with preferred Emacs, or that they shared your belief that most highly-skilled programmers prefer Emacs?
Zealot: I’m talking about the editor they used, not their beliefs.
Skeptic: So your claim is really “Of the people I’ve worked with, those who were highly skilled in writing code preferred to use Emacs over other text editors”.
Zealot: Yes! That’s what I’m saying, for goodness sake!
Skeptic: Not quite as dramatic as “real programmers use Emacs”, is it?

You may find that it is not possible to get your opponent to formulate a specific proposition. They may simply refuse to commit to any specific claim at all. This reaction is common amongst charlatans and con men. They only speak in abstract and inscrutable terms (sometimes of their own invention), always keeping their claims vague enough to deny disproof. They discourage scrutiny of their claims, preferring to cast their vagueness as being mysterious and evidence of some deep, unspoken wisdom. If they cannot provide you with a direct answer to the question “What would it take to prove you wrong?” then you know you are dealing with a non-falsifiable proposition, and your best option may simply be to walk away.

Summary

Before engaging in any debate or investigation, ensure that the proposition being considered is at least conceivably falsifiable. A common feature of non-falsifiable propositions is vagueness.

Such propositions can be refined by:

Defining any broad or novel terminology in the proposition
Making implicit quantifiers explicit
Making implicitly relative statements explicitly relative
Making both primary and secondary connotations of the terminology explicit

Anecdotal Evidence and Other Fairy Tales⁷

As software developers we place a lot of emphasis upon our own experiences. This is natural enough, given that we have no agreed upon body of knowledge to which we might turn to resolve disputes or inform our opinions. Nor do we have the benefit of empirical investigation and experiment to serve as the ultimate arbiter of truth, as is the case for the sciences and other branches of engineering - in part because of the infancy of Empirical Software Engineering as a field of study; in part because of the difficulty of conducting controlled experiments in our domain.

Therefore much of the time we are forced to base our conclusions about the competing technologies and practices of software development upon our own (often limited) experiences and whatever extrapolations from those experiences we feel are justified. An unfortunate consequence is that personal opinion and ill-founded conjecture are allowed to masquerade as unbiased observation and reasoned inference.

So absolute is our belief in our ability to infer the truth from experience that we are frequently told that personal experience is the primary type of evidence that we should be seeking. For example, it is a frequent retort of the XP/AM#EXAM crowd that one is not entitled to comment on the utility of XP/AM#EXAM practices unless one has had first hand experience of them. Only then are you considered suitably qualified to make comment on the costs and benefits of the practice - otherwise “you haven’t even tried it.”

Such reasoning always makes me smile, for two reasons:

It contains the logical fallacy called an “appeal to privileged knowledge”. This is the claim that through experience one will realize some truth that forbids a priori description.
If a trial is not conducted under carefully controlled conditions, it is very likely you will achieve nothing more than a confirmation of your own preconceptions and biases.

This post is concerned with the second point. It goes to the capacity humans have to let their personal needs, prior expectations, attitudes, prejudices and biases unwittingly influence the outcomes of technology and methodology evaluations – both researchers and subjects. There are a number of statistical and psychological effects whose influence must be eliminated, or at least ameliorated, before one can draw valid deductions from human experiences. Some of these effects are briefly described in the table below. Conclusions drawn from anecdotal evidence are frequently invalid precisely because the evidence has been gathered under circumstances in which no such efforts have been made.

Observational Bias

When a researcher allows their own biases to color their interpretation of experimental results. Selective observation is a common type of observational bias in which the researcher only acknowledges those results which are consistent with their pre-formulated hypothesis.

Population Bias

When experimental subjects are chosen nonrandomly and the resulting population exhibits some unanticipated characteristic that is an artifact of the selection process, which influences the outcome of an experiment in which they participate.

The Hawthorne Effect

Describes the tendency for subjects to behave uncharacteristically under experimental conditions where they know they are being watched. Typically this means the subjects improve their performance in some task, in an attempt (deliberate or otherwise) to favorably influence the outcome of the experiment.

The Placebo Effect

Describes the tendency of strong expectations, particularly among highly suggestible subjects, to bring about the outcome expected through purely psychological means.

Logical Fallacies

Conclusions drawn from anecdotal evidence often exhibit one or more of the following deductive errors:

Post Hoc Ergo Propter Hoc

Meaning “after this, therefore because of this”. When events A and B are observed in rapid succession, the post hoc fallacy is the incorrect conclusion that A has caused B. It may be that A and B are correlated, but not necessarily in a causal manner.

Ignoring Rival Causes

To disregard alternative explanations for a particular effect, instead focusing only upon a favorite hypothesis of the researcher. It is common to look for a simple cause of an event when it is really the result of a combination on many contributory causes.

Hasty Generalization

The unwarranted extrapolation from limited experimentation into a broader context.

Examples

The following scenarios demonstrate how easily one or more of the above factors can invalidate the conclusions that we reach based on our own experience - thereby reducing the credibility of those experiences when later offered as anecdotal evidence in support of a conclusion.

The Linux Enthusiast

Chris is a Linux enthusiast. On his home PC he uses Linux exclusively, and often spends hours happily toying with device drivers and kernel patches in an effort to get new pieces of hardware working with his machine. In his work as a software developer he is frequently forced to use Microsoft Windows, which he has a very low opinion of. He is prone to waxing lyrical on the unreliability and insecurity of Windows, and the evil corporate tactics of Microsoft. Whenever he experiences a Blue Screen of Death on his work machine, his cubicle neighbors know that once the cursing subsides they are in for another of his speeches about the massive productivity hit that Windows imposes on the corporate developer. When surfing the web during his lunch hours, if he should come across a reference to Linux being used successfully as an alternative to Windows, then he will print out the article and file it away for future reference. He is confident that it is only a matter of time before Linux replaces Windows on the desktop, both in business and at home.

Analysis: Chris exhibits observational bias in a few ways. The hours he spends getting his Linux machine to recognize a new piece of hardware is enjoyable to him, and so he chooses not to observe that the same outcome might be achieved on a Windows system in a minute, thanks to plug-and-play. When he gets a BSOD, he chooses to observe its negative effect on his productivity while he waits for a reboot, but chooses to disregard the productivity cost of his subsequent anti-Microsoft pontifications. When surfing the web, he selectively observes those stories which are pro-Linux and/or anti-Microsoft in nature. Indeed, the media is complicit in this practice, because such stories make good press. There may be many more occasions in which Linux was unsuccessful in usurping Windows, but they are unremarkable and unlikely to attract media coverage. His confidence in Linux’s ultimate victory based upon his selective observations is a very hasty generalization.

The XP Proponent

Ryan and his team have been reading a lot about XP recently and are keen to try it out on one of their own projects. They have had difficulty getting permission to do so from their management, who are troubled by some aspects of XP such as pair programming and the informal approach to documentation. Through constant badgering, Ryan finally gets permission to use XP on a new project. But he is warned by his management that they will be watching the project’s progress very carefully and reserve the right to switch the project over to the company’s standard methodology if they think XP is not working out. Overjoyed, Ryan’s team begins the new project under XP. They work like demons for the next six months, doing everything in their power to make the project a success. At the end of that time, the project delivers a high quality first release into the hands of a few carefully chosen customers. Feedback from these customers is unanimously positive. Management is suitably impressed. Ryan and his team breathe a sigh of relief.

Analysis: The participants are a self-selected group of enthusiasts, which is an obvious source of population bias. It could be that they have an above-average level of ability in their work, and a commensurately higher level of enthusiasm and dedication - which drives them to try new approaches like XP. Their project’s success may be partly or entirely attributable to these greater capabilities they already had. Knowing they are being closely evaluated by management and have put their necks on the line by trying XP despite management’s concerns, they are also victims of the Hawthorne Effect. They are very motivated to succeed, because they perceive potential adverse consequences for themselves individually if they should fail. If Ryan’s team or their management attributes the project’s success to XP itself, then they are guilty of ignoring the rival causes just described. It may be that they succeeded despite XP, rather than because of it.

The Revolutionary

Seymour thinks there is something wrong with the way university computing students are taught to program. He feels there is insufficient exposure to the sorts of problems and working conditions they will encounter when they finish their degrees. He strongly believes that students would become better programmers and better employees if there were a greater emphasis upon group programming assignments in the academic environment. This would enable them to develop the skills necessary to function effectively in a team, which is the context in which they will spend most of their working lives. To demonstrate the effectiveness of the group approach, he asks for some volunteers from his third year software engineering class to participate in an experiment. Rather than do the normal lab work for their course, which is focused on assignments to be completed by the individual, they will do different labs designed to be undertaken in groups of four or five. These labs will be conducted by Seymour himself. About 30 students volunteer to take part. At the end of the semester, these students sit the same exams as the other students. Their average mark is 82% while the average mark of the other students is 71%. Seymour feels vindicated and the volunteer students are pleased to have taken part in a landmark experiment in the history of computing education.

Analysis: Here is a case of population bias that any competent researcher would be ashamed of. The volunteer group is self-selected, and so may be biased toward those students that are both more interested and more capable. Poor performing, disinterested students would be unlikely to volunteer. The Hawthorne Effect comes into play due to the extra focus that Seymour places upon his volunteer group. They may receive extra attention and instruction as part of their labs, which may be enough in itself to improve their final grades. Additionally, knowing they are part of a select group, at some level they will be motivated to please the researcher and demonstrate that they have performed well in their role as “lab rats.” Their superior performance in the final exam may be a result of these confounding factors, and have nothing to do with the difference between individual and group instruction. It would certainly be a hasty generalization to conclude that their better exam results will translate into better performance in the workforce.

Conclusion

I hope this post will give you pause for thought when you next conduct a technology trial, and when you are next evaluating anecdotal evidence supplied to you by friends and colleagues. Because personal experiences are particularly vivid, we often tend to over-value them. From there, we can easily make unwarranted generalizations and overlook the confounding effect of our own preconceptions and biases.

In particular, next time one of the XP/AM crowd voice the familiar retort of “How could you know? You haven’t even tried it” - bear in mind that in the absence of quantification and controlled experimental technique, they don’t know either.

Function Points: Numerology for Software Developers⁸

“Where else can one get such a marvelous return in conjecture from such a modest investment of fact?” – Mark Twain

Numerology is the study of the occult meanings of numbers and their influence on human life1. Numerologists specialize in finding numeric relationships between otherwise disparate figures, and attributing to them some greater significance.

For instance, some claim that by adding up the component numbers in your birth date, together with the numeric equivalent of your name (where A=1, B=2 etc) then a figure is derived that, if properly interpreted, can yield insight into your personality.⁹

Others consider that the reoccurrence of the number 19 in Islamic texts is evidence of their authorship by a higher being¹⁰. The Koran has 114 (6 x 19) chapters and 6346 verses (19 x 334) and 329,156 (19 x 17,324) letters. The word “Allah” appears 2,698 (19 x 142) times. The sum of the verse numbers that mention Allah is 118,123 (19 x 6,217).

Pyramids are a favorite topic for numerologists, and there are dozens of “meaningful” numeric relationships to be found in their dimensions. For instance, the base perimeter of the Great Pyramid of Cheops is 36,515 inches – 100 times the number of days in the solar year. And so on.

We can laugh at such desperate searches for meaning, but before we laugh too hard we should consider that software development has its own brand of numerology, which we have given the grand name of Function Point Analysis (FPA).

Overview Of Function Points

FPs were proposed in 1979 as a way of finding the size of a piece of software given only its functional specification. It was intended that the FP count of an application would be independent of the technology, people and methods eventually used to implement the application, focusing as it did upon the functionality the application provided to the user. Broadly speaking, basic FPs are calculated by following these steps:

Divide a functional view of the system into components.
Classify each component as being one of five types – external input, external output, external inquiry, internal logical file or external interface file.
Classify the complexity of each component as low, average or high. The rules for performing this classification vary by component type.
For each type of component, multiply the number of components of that type by a numeric equivalent of the complexity e.g. low = 3, average = 4, high = 6. The numeric equivalents that apply vary by component type.
Sum the results of step 4 across all five component types. The total is a figure called Unadjusted Function Point count (UFP).You can then multiply the UFP by a Value Adjustment Factor (VAF) which is based on consideration of 14 general system characteristics, to yield the final Function Point count.

I won’t bore you with the excruciating specifics of the component calculations. The above gives you some idea of the nature of FP counting and it’s reliance upon subjective judgments. Specifically, the placement of component boundaries and the values chosen for the many weighting factors and characteristics are all determined on a subjective basis. Some of that subjectivity has been embodied in the standardized FP counting rules that are issued by the International Function Point Users Group (IFPUG).¹¹

So lacking have FPs been found, that there has been a steady stream of proposed improvements and alternatives to them since 1979. But none of these have challenged the basic FP ethos of modeling functional size as a weighted sum of arbitrarily selected attributes. They simply change the number and definition of those attributes, and the means by which they are mangled together into a final figure. The basic chronology of the FP family tree has been:

1979 Function Points (Albrecht) 1986 Feature Points (Jones) 1988 Mark II Function Points (Symons) 1989 Data Points (Sneed) 1991 3 D Function Points (Boeing) 1994 Object Points (Sneed) 1997 Full Function Points (St. Pierre et. al) 1999 COSMIC Full Function Points (IFPUG) —— ——————————————

To understand why the FP and its many variants are fundamentally flawed, it is first necessary to understand the difference between measuring and rating.

Measurement Vs. Rating

To measure an attribute of something is to assign numbers to it on an objective and empirical basis, so that the relationships between the numbers preserve any intuitive notions and empirical observations about that attribute.¹²

For example, the metric meter is a measure, which implies:

4 meters is twice as long as 2 meters, because 4 is twice 2
The difference between 9 and 10 meters is the same as the difference between 1 and 2 meters, because 10-9 = 2-1
If you moved 4 meters in 2 seconds (at constant velocity) then you moved 2 meters in the first second and 2 meters in the last second.
If two different people measure the same length to the nearest meter, they will get the same number.

To rate an attribute of something is to assign numbers to it on a subjective and intuitive basis. The relationships between the numbers do not preserve the intuitive and empirical observations about the attribute. In contrast to the above example, consider the rating out of 10 that a reviewer gives a movie:

A movie that gets a 4 is not twice as good as a movie that gets a 2.
The difference between movies that get 9 and 10 is not the same as the difference between movies that get 1 and 2.
A 2 hour movie that gets a 6 did not rate 3 for the first hour and 3 for the second hour.
Two different people rating the same movie may award different ratings.

To clarify, suppose a reviewer expresses their assessment of a movie in words rather than numbers. Instead of rating a movie from 1 to 10, they rate it from “abysmal” to “magnificent”. We might be tempted to think a movie that gets an 8 is twice as good as a movie that gets a 4, but we would surely not conclude that “very good” is twice as good as “disappointing”. We can express a rating using any symbols we want, but just because we choose numbers for our symbols does not mean that we confer the properties of those numbers upon the attribute we are rating.

In summary:

A measurement is objective and can be manipulated mathematically.
A rating is subjective and cannot be manipulated mathematically.

Function Points Are A Rating, Not A Measurement

From the above, it is clear that FPs are a rating and not a measurement, due to the subjective manner in which they are derived. Hence, they cannot be manipulated mathematically. And yet the software literature is rife with examples of researchers attempting to do just that. Many researchers and reviewers continue to ignore the fundamental implications of the non-mathematical nature of the FP¹³, such as:

You cannot measure productivity using FPs – If a team completes an application of 250 FP in 10 weeks, their productivity is not 25 FP/week. The figure “25” has no meaning. Similarly, a given team need not take 50% longer to write a 1800 FP application as they will a 1200 FP application.
You cannot compare FP counts numerically – An application of 1000 FP is not twice as big, complex or functional as an application of 500 FP. The first application is not “twice” the second in any meaningful sense.
You cannot compare FPs from disparate sources – The subjectivity of FP analysis makes it sensitive to contextual variations in application domain, technology, organization and counting method.

Given such limitations, there are very few valid uses of an application’s FP count. If the FP counts of two applications differ markedly, and their contexts are sufficiently similar, then you may be justified in saying that one is functionally bigger than the other, but not by how much. The notion that FPs can participate in mathematical calculations, and thereby be used for scheduling, effort and productivity measures, is without theoretical or empirical basis.

Why Are Function Points So Popular?

Although their use may have declined in recent years, Function Points are still quite popular. There are several factors which might account for their continued usage, despite their essential invalidity:
The fact that other organizations use FPs is enough to encourage some to follow suit. However, we should be aware that an argument from popularity has no logical basis. There are many beliefs that are both widely held and false. The popularity of FPs may only be indicative of how desperately the industry would like there to be a single measure of functional size that can be calculated at the specification stage. It certainly would be desirable for such a measure to exist, but we cannot wish such a metric into existence, no matter how many others have the same wish.
Some researchers claim to have validated function points (in their original form, or some later variant thereof). However, if you examine the details of these experiments, what you will find is pseudo-science, ignorance of basic measurement theory and statistics, and much evidence of “fishing for results.” There is a lot of fitting of models to historical data, but not a lot of using those models to predict future data. This is not so surprising, for the general standard of experimentation in software is very poor, as Fenton observes. Altman makes an observation¹⁴ about the legion of errors that occur in medical experimentation that could apply equally well to software development:
“The main reason for the plethora of statistical errors is that the majority of statistical analyses are performed by people with an inadequate understanding of statistical methods. They are then peer reviewed by people who are generally no more knowledgeable.”
Hope springs eternal. Rather than concede that efforts to embody functional size in a single number are misguided, it is consoling to think that FPs are “nearly there”, just a few more tweaks away from being useful. Hence the many FP variants that have sprung up.
FP enthusiasts selectively quote the “research” that is in their favor, and ignore the rest. For example, the variance between FP counts determined by different analysts is often quoted as “plus or minus 11 percent.”¹⁵ However other sources¹⁶ have reported worse figures, such as a 30% variation within an organization, rising to more than 30% across organizations.
Some choose to dismiss the theoretical invalidities of FPs as irrelevant to their practical worth. Their excuses may have some appeal to the average developer, but don’t withstand scrutiny. Examples of such excuses are:
As long as FPs work, who cares what basis they have or don’t have? - The problem is that in general, FPs don’t work. Even FP adherents will admit to the numerous shortcomings of FPs, and the need to constrain large numbers of contextual factors when applying them. Witness the various mutations of FP that have arisen, each attempting to address some subset of the numerous failings of FPs.
It doesn’t matter if you’re wrong, as long as you’re wrong consistently¹⁷ – Unfortunately, unless you know why you’re wrong, you have no way of knowing if you are indeed being consistently wrong. FPs are sensitive to a great many contextual factors. Unless you know what they are and the precise way they effect the resulting FP count, you have no way of knowing the extent to which your results have been influenced by those factors, let alone whether that influence has been consistent.

Function Point’s True Believers

FPs have attracted their own league of True Believers – like many technical schools whose tenets, lacking an empirical basis, can only be defended by the emotional invective of their adherents. I encountered one such adherent recently in David Anderson, author of “Agile Project Management.” Anderson made some rather pompous observations¹⁸ on his blog as to how surprising it was that people should express disbelief regarding his claims to 5 and 10-fold increases in productivity using TDD, AM and (insert favorite acronym here)FDD. I replied that their incredulity might stem from the boldness of his claims or the means by which he collected his data, rather than an inherently obstreperous attitude. He indicated his productivity data was expressed in FPs per unit time! I tried explaining to him that FPs cannot be used to measure productivity, because not all FPs are created equal, as explained above. He wasn’t interested. That discussion has now been deleted from his blog. He also denied me permission to reproduce that portion of it which occurred in email.

Such is the attitude I typically encounter when dealing with self-styled gurus and experts. There is much talk of science and data, but as soon as you express doubt regarding their claims, there is a quick resort to insult and posture. Ironic, given that doubt and criticism are the basic mechanisms that give science the credibility that such charlatans seek to cloak themselves in.

Why Must Functional Size Be A Single Number?

The appeal, and hence the popularity, of FPs is their reduction of the complex notion of software functional size to a single number. The simplicity is attractive. But what basis is there for believing that such a single-figure expression of functional size is even possible?

Consider this analogy. When you walk into a clothing store, you characterize your size using several different measures. One figure for shirt size, another for trouser size, another for shoe size and another for hat size. What if, by way of misguided reductionism, we were to try and concoct a single measure of clothing size and call it Clothing Points. We could develop all sorts of rules and regulations for counting Clothing Points, including weighting factors accounting for age, diet, race, gender, disease and so on. We might even find that if we sufficiently controlled the influence of external factors, given the limited variations of the human form, we might eventually be able to find some limited context in which Clothing Points were a semi-reasonable assessment of the size of all items of clothing. We could then walk into a clothing store and say “My size is 187 Clothing Points” and get a size 187 shirt, size 187 trousers, size 187 shoes and size 187 hat. The items might even fit, although we would likely sacrifice some comfort for the expediency and convenience of having reduced four dimensions down to a single dimensionless number.

The search for a grand unified “measure” of functional size may be just as foolhardy as the quest for uni-metric clothing.

Conclusion

The continued use and acceptance of Function Point Analysis in software development should be a source of acute embarrassment to us all. It is a prime example of muddle-headed, pseudo-scientific thinking, that has persisted only because of the general ignorance of measurement theory and valid experimental methodology that exists in the development community. We need to stop fabricating and embellishing arbitrary sets of counting rules. In doing so, we are treating these formulae as if they were incantations whose magic can only manifest when precisely the correct wording has been discovered, but whose inner workings must forever remain a mystery. Rather, we need to go back to basics and work towards understanding the fundamental technical dimensions that contribute to the many and varied notions of an application’s functional size. How can we hope to measure something when we can’t even precisely define what that something is? Empiricism holds some promise as a means to improve software development practices, but the pseudo-empiricism of Function Point Analysis is little more than numerological voodoo.

Programming and the Scientific Method¹⁹

In 1985 Peter Naur wrote a rather cryptic piece entitled Programming as Theory Building²⁰ in which he drew an analogy between software development and the scientific method. Since then, other authors have attempted to co-opt this analogy as a means of enhancing the perceived credibility of particular programming practices. This post aims to explain the analogy between the scientific method and programming, and to explore the limitations of that analogy.

The Scientific Method

There is no canonical representation of the scientific method. Different sources will explain it in different ways, but they are all referring to the same logical process. For the purposes of this discussion, I will adopt a simplified definition of the scientific method, considering it to be comprised of the following activities repeated in a cyclic manner:

Model – Form a simplified model of a system by drawing general conclusions from existing data.
Predict – Use the simplified model to make a specific prediction about how the system will behave when subject to particular conditions.
Test – Test the prediction by conducting an experiment.

If the test confirms our prediction, we return to step 2 and make a new prediction based upon the same model. Otherwise, we return to step 1 and revise our model so that it accounts for the results of our most recent test (and all preceding tests).

More formal descriptions of the scientific method often include the following terms:

Hypothesis: A testable statement accounting for a set of observations. It is equivalent to the model in the above description.
Theory: A well supported and well tested hypothesis or set of hypotheses.
Fact: A conclusion confirmed to such an extent that it would be reasonable to offer provisional agreement.²¹

An Example Of The Scientific Method

Suppose you are given a sealed black box that has only three external features – two toggle switches marked A and B, and a small lamp. By playing around with the switches you notice that certain combinations of switch positions result in the lamp lighting up. Your task is to use the scientific method to develop a theory of how the box operates. In other words, to create a model which can account for the observed behavior of the box.

Round 1

Model: Casual observation suggests that the switches and lamp are connected in circuit with an internal power source. Let’s suppose that this is the case, and that the two toggle switches are wired in series.
Predict: If our model is accurate, then we should find that turning both switches on causes the lamp to light up.
Test: We get the box, turn both switches on and find that the lamp does indeed light up. Our model has been partially verified. But there are other predictions we can make based upon it.

Round 2

Model: As in experiment 1.
Predict: If our model is accurate, then we should find that turning switch A off and switch B on causes the lamp to go out.
Test: We get the box, turn switch A off and switch B on and find that the lamp actually lights up. Our prediction was incorrect, therefore our model is wrong.

Round 3

Model: Now we need to rework our model so that it correctly accounts for all our observations thus far. Then we can use it as a basis for further prediction. Suppose the box were wired with the two toggle switches in parallel. That would account for our observations from rounds 1 and 2. Let’s make that our new model.
Predict: If this new model is accurate, then we should find that turning switch A on and switch B off causes the lamp to light up.
Test: We get the box, turn switch A on and switch B off and find that the lamp actually goes off. Our prediction was incorrect; therefore our new model is wrong.

Round 4

Model: Once again, we need to reformulate our model so that correctly accounts for all of our existing observations. After some thought, we realize that if the box were wired so that only switch B effected the lamp, with switch A out of the circuit entirely, then this would account for all of our existing observations, as well as giving us a new prediction to test.

Predict: If this latest hypothesis is true, then we should find that turning switch A off and switch B off causes the lamp to go out.
Test: We get the box, turn switch A off and switch B off and observe that the lamp does indeed go out. Our prediction was correct, and our model is consistent with our observations from all four experiments.

You can see why the scientific method is sometimes described as being very inefficient – there is a lot of trial and error involved. But it’s important to note that it’s not random trial and error. If we just made random predictions and then tested them through experiment, all we would end up with is a disjoint set of cause/effect observations. We would have no way of using them to predict how the system would behave under situations that we hadn’t already deserved. Instead, we choose our predictions deliberately, guided by the intent of testing a particular aspect of the model currently being considered. In this way, each experiment either goes some way toward confirming the model, or confuting it.

Note that all we can ever have is a model of the system. We make no pretense to know the truth about the system in any absolute sense. Our model is simply useful, at least until new observations are made that our model can’t account for. Then we must change it to accommodate the new observations. This is why all knowledge in science (even that referred to as fact) is actually provisional and continually open to challenge.

A Programming Example

The following example demonstrates how software development is similar to the scientific method.

The task is to develop an application which models the behavior of the black box in the above example. The software will present a simple GUI with two toggle buttons marked A and B, and an icon which can adopt the appearance of a lamp turned on or off. The lamp icon should appear to be turned on as if the lamp were a real lamp connected to an internal power source, and the toggle buttons were toggle switches, with switch B in circuit with the lamp, and switch A out of circuit.

The table below compares the activities in the scientific method with their programming counterparts. Keep these analogs in mind as you read through the following example. Scientific Method

            Scientific Method                                                                                                                Programming   **Model**     Form a simplified model of a system by drawing general conclusions from existing data.                                           Developing a mental model of how the software works.   **Predict**   Use the simplified model to make a specific prediction about how the system will behave when subject to particular conditions.   Taking a particular case of interaction with that model, and predicting how the software will respond.   **Test**      Test the prediction by conducting an experiment.                                                                                 Subjecting software to a test and getting a result.   ------------- -------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------

Round 1

Model: Unlike experimentation, we begin by assuming our model is correct. It is created from our requirements definition and states “The lamp icon should appear to be turned on as if the lamp were a real lamp connected to an internal power source, and the toggle buttons were toggle switches, with switch B in circuit with the lamp, and switch A out of circuit.”
Predict: If the software is behaving correctly, toggling both buttons on should result in the lamp icon going on.
Test: We run the software, toggling the buttons A and B on, and observe that the lamp icon does indeed come on. So far our hypothesis has been confirmed; which is to say, the software behaves as the requirements say it should. But there are other behaviors specified by the requirements.

Round 2

Model: As per round 1
Predict: If the software is behaving correctly, then toggling button A off and button B on will cause the lamp icon to go on.
Test: We run the software, toggle button A off and button B on, and find that the lamp icon actually turns off. Our prediction was incorrect; therefore our software is not behaving as per its requirements. Instead of adjusting our model to suit the software, we adjust the software to suit the model i.e. we debug the software. In the software world, we can change the “reality” we are observing to behave however we want unlike the real world where we have to adjust our model to fit an invariant reality. Once the software behaves in a manner consistent with the above prediction, we have to repeat our test from round 1 (i.e. regression test), to confirm that the prediction made there still holds i.e. that we haven’t “broken” the software reality.

Round 3

Model: As per round 1.
Predict: If the software is behaving correctly, then toggling button A on and button B off should cause the lamp icon to turn off.
Test: We run the software, toggle button A on and button B off and find that the lamp icon actually turns on. Our prediction was incorrect; therefore our software is in error. Once again we debug the software until it behaves in a manner consistent with the above prediction. Then we regression test by repeating the tests in rounds 2 and 3.

Round 4

Model: As per round 1.
Predict: If the software is behaving correctly, then toggling buttons A and B off should cause the lamp icon to turn off.
Test: We run the software, toggle buttons A and B off and find that the lamp icon does indeed turn off. Our prediction was correct; therefore the software is behaving as per its requirements.

Notice the critical difference between programming and experimentation. In experimentation, reality is held invariant and we adjust our model until the two are consistent. In programming, the model is held invariant and we adjust our reality (the software) until the two are consistent.

Limits Of The Analogy

Rote performance of the model/predict/test cycle does not mean that one is doing science, or even that one’s activities are science-like. There are critical attributes of the way these activities are carried out that must be met before the results have scientific validity. Two of these are objectivity and reproducibility. Some authors have taken the analogy between scientific method and programming too far by neglecting these attributes.

McCay3 contends that pair programming is analogous to the peer review process that scientific results undergo before being published. The reviewers of a scientific paper are chosen so that they are entirely independent of the material being reviewed, and can perform an objective review. They must have no vested interest in the material itself, and no relationship to the researcher or anyone else involved in the conduct of the experiment. To this end, scientific peer reviews are often conducted anonymously. Clearly this independence is missing in pair programming. Both parties have been intimately involved in the production of the material being reviewed, and as a coauthor each has a clear personal investment in it. They have participated in the thought processes that lead to the code being developed, and so can no longer analyze the material in an intellectually independent manner.

Mugridge²² contends that the continuous running of a suite of regression tests is equivalent to the concept of scientific reproducibility. But here again, the independence is missing. A single researcher arriving at a particular result is not enough for those results to be considered credible by the scientific community. Independent researchers must successfully replicate these results, as a way of confirming that they weren’t just a chance occurrence, or an unintentional byproduct of situational factors. But running regression tests does not provide such confirmation, because each run of the regression tests is conducted under exactly the same circumstances as the preceding ones. The same tests are executed in the same environment over and over again, so there is no independence between one execution and the next. Thus the confirming effect of scientific reproducibility is lost.

Both Mugridge and McCay try and equate the XP maxim “do the simplest thing that could possibly work” (DTSTTCPW) with Occam’s Razor. Occam’s razor is a principle applied to hypothesis selection that says “Other things being equal, the best hypothesis is the simplest one, that is, the one that makes the fewest assumptions.” Because the scientific hypothesis is analogous to the system metaphor in XP, the XP equivalent of Occam’s Razor would be “Other things being equal, the best system metaphor is the simplest one, that is, the one that makes the fewest assumptions.” However XPers often invoke DTSTTCPW with regard to implementation decisions, not choice of metaphor. Indeed, the metaphor is one of the least used of XP practices.²³

Additionally, the “all other things being equal” part of Occam’s razor is vital, and neglected in XP’s DTSTTCPW slogan. We evaluate competing hypotheses with respect to the criteria of adequacy²⁴ – which provide a basis for assessing how well each hypothesis increases our understanding. The criteria include testability, fruitfulness, simplicity and scope. Note that simplicity is only one of the factors to consider. The scope of a hypothesis refers to its explanatory power; how much of reality it can explain and predict. We have a preference for a hypothesis of broader scope, because it accounts for more natural phenomena. In a programming context, suppose we have two competing models of a piece of software’s operation. One is more complex than the other, but the more complex one also has greater scope. Which one is better? It’s a subjective decision; but it should be clear that considering simplicity alone is a naive basis for hypothesis selection.

Conclusion

OK, so there are parallels between the scientific method and programming. Aside from the intellectual interest, what value is there in recognizing these parallels?

Naur claims that the theory of a piece of software corresponds to the model that the programmer builds up in their head of how it works. Such a theory might say “The software is like a box with two toggle buttons and a lamp”, or “The software is like an assembly line with pieces being added on as the item proceeds”. Perhaps multiple metaphors are used. Once a programmer has a theory (model) of the software in their head, they can talk about and explain its behavior to others. When they make changes to the code, they do so in a way that is consistent with the theory and therefore “fits in” with the existing code base well. A programmer not guided by such a theory is liable to make modifications and extensions to the code that appear to be “tacked on” as an afterthought, and not consistent with the design philosophy of the existing code base. I believe there is some validity in this notion.

Cockburn then extends this by claiming that this theory is what is essential to communicate (in documentation or otherwise) from one generation of programmers to the next: “What should you put into the documentation? That which helps the next programmer build an adequate theory of the program”. He also sees this as validation of the “System Metaphor” practice from XP. Perhaps so, but I think there is only limited utility in identifying what has to be communicated. The real problem is identifying how to communicate; how to persist that knowledge in a robust form, and transfer it from one programmer to another as new programmers arrive on a project and old ones leave.

From Tulip Mania to Dot Com Mania²⁵

“Those who cannot remember the past are condemned to repeat it.” – George Santayana

Those of us working in IT tend to think of ourselves as being modern, savvy and much more advanced than our forebears. This conviction is often accompanied by a certain degree of hubris, and a somewhat derisive attitude towards older technologies and practitioners. You’ve probably encountered this ageist bias in your own work place, or even displayed it yourself. Older members of our profession are viewed as out-dated and irrelevant. Older programming languages such as C and FORTRAN are viewed as inherently inferior to those more recently introduced such as Java and C#. Contempt for that which has come before us is as common place as the fascination with novelty and invention that breeds it.

In our struggle to stay abreast of the rapid rate of change in our industry, our focus is so intensely upon the present and immediate future, that we neglect the lessons of the past. We excuse our parochialism by kidding ourselves that the pace of technological makes any comparison with the past all but irrelevant anyway. But here lies a serious error in thinking – for although technology changes rapidly, people do not. For example, throughout history there are numerous examples of large groups of people succumbing to mass panics, group delusions and popular myths. Notable events are:

The Martian Panic of 1938, in which many Americans became convinced that a radio broadcast of H.G. Well’s War of the Worlds was a news broadcast of an actual Martian invasion, leading some to flee their homes to escape the alien terror.²⁶
The Roswell Flying Saucer crash of 1947, a myth sustained by many even today.
The widespread belief in Satanic Ritual Abuse of children in America in the 1970’s and 1980’s.
The Witch Mania of the 15th-17th centuries on multiple continents.

Exemplified by the Salem witch trials of 1692.

The Face on Mars myth of 1976

It is easy to dismiss such phenomena as unique to their times, the like of which could never be experienced by modern, technology-aware, scientifically informed people such as ourselves. But we view our modern world with old brains. Psychologically, we have the same predilections and foibles as the witch-hunters and alchemists of centuries past. We still experience greed, we still feel a need to belong to a group, and we can still sustain false and irrational beliefs if we see others doing the same.

To illustrate our continuing susceptibility to irrational group behaviors, consider the Tulip Mania of the 1630s, which exhibits striking parallels with the dot-com mania that would follow it some 400 years later.

Tulip Mania

The collecting of tulips began as a fashion amongst the wealthy in Holland and Amsterdam in the late 16th century²⁷. The popularity of the flower spread to England in 1600, and filtered down from the upper class to the middle class. By 1635 the mania had reached its peak amongst the Dutch, and preposterous sums were being paid for bulbs of the rarer varieties. A single bulb of the species Admiral Liefken sold for 4400 florins, and a Semper Augustus for 5500 florins, at a time when a sheep cost 10 florins.

In 1636 the demand for rare tulips became so great that regular marts for their sale were established on the Stock Exchange of Amsterdam. At this time, speculation in tulip bulbs appeared, and those fortunate enough to buy low and sell high quickly grew rich. Seeing their friends and colleagues profiting from the tulip mania, ordinary citizens began converting their property into cash and investing in bulbs. All were convinced that Europe’s current infatuation with tulips would continue unabated for the foreseeable future and that vast wealth awaited those who could satiate the frenzied demands that were sure to come from the rest of Europe.

But the more prudent began to see that this artificial price inflation could not be sustained for much longer. As confidence dropped, so too did the market price of tulips – never to rise again. Those caught with significant investments in bulbs were ruined, and Holland’s economy suffered a blow from which it took many years to recover.

There are obvious similarities with the dot com boom – the artificial escalation of value, the widening scope of investors, the confusion of popularity with substance, the progression from investor over-confidence to widely held belief, and finally, the sudden deflation of value promoted by the growing awareness of the role that non-financial factors were playing in the trend.

Conclusion

It has always been the province of recent generations to view the mistakes of earlier generations with a contempt derived from the assumption that they are somehow immune to such follies. Those of us who are more technology-aware than some others are particularly prone to this. And yet, even the geekiest techno-junkie can fall prey to the same psychological and sociological traps that have plagued our species for centuries. Indeed, far from inuring us to metaphysical thinking, it seems that the sheer success of science has lead many to deliberately pursue “alternative” beliefs as a way of restoring some feeling of mystery and wonder into their lives. A 1990 Gallup poll of 1236 adult Americans found that 52% believed in astrology, 46% in ESP and 19% in witches.²⁸ The result is that superstition and technology are both coexistent and symbiotic. As software developers, we need to heed the lessons of the mass manias of the past, acknowledge that we are still psychologically vulnerable to them today, and guard against their re-emergence by making a deliberate effort to think critically about the trends, fashions and hype which so predominate our industry.

First published 19 Oct 2003 at http://www.hacknot.info/hacknot/action/showEntry?eid=30 ↩
http://agilemanifesto.org/ ↩
The Demon-Haunted World, C. Sagan and A. Druyan, Ballantine Books, 1996 ↩
How To Win An Argument, 2nd Edition, M. Gilbert, Wiley, 1996 ↩
First published 18 Jan 2004 at http://www.hacknot.info/hacknot/action/showEntry?eid=45 ↩
http://groups.yahoo.com/group/extremeprogramming ↩
First published 22 Mar 2004 at http://www.hacknot.info/hacknot/action/showEntry?eid=49 ↩
First published 28 Jun 2004 at http://www.hacknot.info/hacknot/action/showEntry?eid=59 ↩
The Skeptic’s Dictionary, R. Carroll, Wiley and Sons, 2003. http://www.skepdic.com/ ↩
Did Adam and Eve Have Navels?, M. Gardner, W.W. Norton and Company, 2000 ↩
http://www.ifpug.org/ ↩
Software Measurement: A Necessary Scientific Basis, N. Fenton, IEEE Trans. Software Eng., Vol. 20, No. 3, 1994 ↩
The Problem with Function Points, B. Kitchenhas, IEEE Software, March/April 1997 ↩
Statistical Guidelines for Contributors to Medical Journals, Altman, Gore, Gardner, Pocock, British Medical Journal, Vol. 286, 1983 ↩
Why We Should Use Function Points, S Furey, IEEE Software, March/April 1997 ↩
Comparison of Function Point Counting Techniques, J.Jeffery, G. Low, M. Barnes, IEEE Trans. Software Eng., Vol. 19, No. 5, 1993 ↩
Measurement and Estimation, Burris ↩
http://www.agilemanagement.net/Articles/Weblog/WorldClassVelocity.html ↩
First published 21 Aug 2004 at http://www.hacknot.info/hacknot/action/showEntry?eid=64 ↩
Programming as Theory Building, Peter Naur ↩
Why People Believe Weird Things, Michael Shermer ↩
3if (extremeProgramming.equals(scientificMethod)), Larry McCay ↩
Agile and Iterative Development, C. Larman ↩
How to Think About Weird Things, 3rd edition, T. Schik and L. Vaughn, McGraw Hill, 2002 ↩
First published 5 Jun 2004 at http://www.hacknot.info/hacknot/action/showEntry?eid=56 ↩
Hoaxes, Myths and Manias, R. Bartholomew and B. Radford, Prometheus Books, 2003 ↩
Extraordinary Populat Delusions and The Madness of Crowds, Charles Mackay, Wordsworth Editions, 1995 ↩
Why People Believe Weird Things, Michael Shermer, Henry Holt and Company, 2002 ↩

The Skeptical Software Development Manifesto1

The Skeptical Software Development Manifesto

Argumentum Ad Hominem

Appeal To Ignorance

Special Pleading

Observational Selection

Begging The Question

Doubtful Evidence

False Generalization

Straw-Man Argument

Argument From Popularity

Post Hoc Argument

False Dilemma

Arguments From Authority

Principles Behind The Skeptical Software Development Manifesto

Basic Critical Thinking for Software Developers5

Vague Propositions

Non-Falsifiable Propositions

Engage Brain Before Engaging Flame Thrower

Summary

Anecdotal Evidence and Other Fairy Tales7

Observational Bias

Population Bias

The Hawthorne Effect

The Placebo Effect

Logical Fallacies

Post Hoc Ergo Propter Hoc

Ignoring Rival Causes

Hasty Generalization

Examples

The Linux Enthusiast

The XP Proponent

The Revolutionary

Conclusion

Function Points: Numerology for Software Developers8

Overview Of Function Points

Measurement Vs. Rating

Function Points Are A Rating, Not A Measurement

Why Are Function Points So Popular?

Function Point’s True Believers

Why Must Functional Size Be A Single Number?

Conclusion

Programming and the Scientific Method19

The Scientific Method

An Example Of The Scientific Method

Round 1

Round 2

Round 3

Round 4

A Programming Example

Round 1

Round 2

Round 3

Round 4

Limits Of The Analogy

Conclusion

From Tulip Mania to Dot Com Mania25

Tulip Mania

Conclusion

The Skeptical Software Development Manifesto¹

Basic Critical Thinking for Software Developers⁵

Anecdotal Evidence and Other Fairy Tales⁷

Function Points: Numerology for Software Developers⁸

Programming and the Scientific Method¹⁹

From Tulip Mania to Dot Com Mania²⁵