2008-03-19

On the word "bumblebee"

The story of the word bumblebee is curious, but (contra Mr. Burns of the Simpsons) certainly doesn't lead back to a form like bumbled bee, in the way that ice cream leads back to iced cream, or the American form skim milk descends from the form skimmed milk still current elsewhere. The bee part is transparent, and there is a Middle English verb bomb(e)len, meaning to make a humming sound, presumably of imitative origin. So there you are.

However, it's clear that the older form was humble-bee, where hum(b)le is an intensive of hum, which is also presumably of imitative origin. Whether bumblebee is a new coinage based on bombelen, or whether it is an alteration of humble-bee by dissimilation, or a mixture of both, it's impossible to say.

But when we look in Pokorny's etymological dictionary of Indo-European for hum, we see it under the root kem²-, as expected by Grimm's Law, and with Lithuanian reflexes in k- and Slavic ones in ch- that also refer to humming noises and bees. That certainly does not sound imitative to me -- the sharp sound of [k] is nothing like a bee hum, which has no beginning and no end. So in the end the obvious imitative nature of bumblebee leads to a riddle wrapped in a mystery inside an enigma.

And there remains at least one dangling oddity: Pokorny also lists an Old Persian -- at least I think that's what "Ai." means -- reflex meaning "yak". Yaks grunt (as the Linnaean name Bos grunniens indicates), they don't hum, and what is Old Persian doing with an inherited word for "yak" anyhow? English, like most modern languages, has borrowed its word from Tibetan.

2008-03-03

Elements or attributes?

Here's my contribution to the "elements vs. attributes" debate:

General points:

  1. Attributes are more restrictive than elements, and all designs have some elements, so an all-element design is simplest -- which is not the same as best.

  2. In a tree-style data model, elements are typically represented internally as nodes, which use more memory than the strings used to represent attributes. Sometimes the nodes are of different application-specific classes, which in many languages also takes up memory to represent the classes.

  3. When streaming, elements are processed one at a time (possibly even piece by piece, depending on the XML parser you are using), whereas all the attributes of an element and their values are reported at once, which costs memory, particularly if some attribute values are very long.

  4. Both element content and attribute values need to be escaped, so escaping should not be a consideration in the design.

  5. In some programming languages and libraries, processing elements is easier; in others, processing attributes is easier. Beware of using ease of processing as a criterion. In particular, XSLT can handle either with equal facility.

  6. If a piece of data should usually be shown to the user, use an element; if not, use an attribute. (This rule is often violated for one reason or another.)

  7. If you are extending an existing schema, do things by analogy to how things are done in that schema.

  8. Sensible schema languages, meaning RELAX NG, treat elements and attributes symmetrically. Older and cruder schema languages tend to have better support for elements.

Using elements:

  1. If something might appear more than once in a data model, use an element rather than introducing attributes with names like part1, part2, part3 ....

  2. If order matters between two pieces of data, use elements for them: attributes are inherently unordered.

  3. If a piece of data has, or might have, its own substructure, use it in an element: getting substructure into an attribute is always messy. Similarly, if the data is a constituent part of some larger piece of data, put it in an element.

  4. An exception to the previous rule: multiple whitespace-separated tokens can safely be put in an attribute. In principle, the separator can be anything, but schema-language validators are currently only able to handle whitespace, so it's best to stick with that.

  5. If a piece of data extends across multiple lines, use an element: XML parsers will change newlines in attribute values into spaces.

  6. If a piece of data is in a natural language, put it in an element so you can use the xml:lang attribute to label the language being used. Some kinds of natural-language text, like Japanese, also require annotations that are conventionally represented using child elements; right-to-left languages like Hebrew and Arabic may similarly require child elements to manage bidirectionality properly.

Using attributes:

  1. If the data is a code from an enumeration, code list, or controlled vocabulary, put it in an attribute if possible. For example, language tags, currency codes, medical diagnostic codes, etc. are best handled as attributes.

  2. If a piece of data is really metadata on some other piece of data (for example, representing a class or role that the main data serves, or specifying a method of processing it), put it in an attribute if possible.

  3. In particular, if a piece of data is an ID (either a label or a reference to a label elsewhere in the document) for some other piece of data, put the identifying piece in an attribute. When it's a label, use the name xml:id for the attribute.

  4. Hypertext references (hrefs) are conventionally put in attributes.

  5. If a piece of data is applicable to an element and any descendant elements unless it is overridden in some of them, it is conventional to put it in an attribute. Well-known examples are xml:lang, xml:space, xml:base, and namespace declarations.

  6. If terseness is really the most important thing, use attributes, but consider gzip compression instead -- it works very well on documents with highly repetitive structures.

Michael Kay says:

Beginners always ask this question.
Those with a little experience express their opinions passionately.
Experts tell you there is no right answer.

I say:

Newbies always ask:
     "Elements or attributes?
Which will serve me best?"
     Those who know roar like lions;
     Wise hackers smile like tigers.
          --a tanka, or extended haiku

Final words:

Break any or all of these rules rather than create a crude, arbitrary, disgusting mess of a design if that's what following them slavishly would give you. In particular, random mixtures of attributes and child elements are hard to follow and hard to use, though it often makes good sense to use both when the data clearly fall into two different groups such as simple/complex or metadata/data.