Feed aggregator
The Trouble with Tribbles: Why Dumb Data Multiplies Like Bunnies
I’ve been reading a lot of sci-fi lately, mixed with books about data, which got me thinking about how dumb data is, what the costs of stupid data are, and how smarter data could lower our information costs: for access, for development, and for maintenance.
Why Data are like Tribbles
In a classic original Star Trek episode, The Trouble with Tribbles, a gerbil-like organism is reproducing. (You can see 3 minutes of it here, courtesy of CBS.) Tribbles aren’t that smart (cute, but not smart), and they don’t live long, but they multiply and cause problems. While our data doesn’t multiply at the same rate, I’m more surprised than ever at just how many copies of the data there are inside the enterprise, and it certainly lives a long time. Backups, sure, those make sense. Data warehouses, arguably a necessary evil. Let’s back those up. Data marts and operational data stores, well, ok, since we’re going in this direction, and there’s processing that just has to be done on our data to run a real-time enterprise. Our own personal copies? We have to personalize the data, massage it a little more. In Excel. And its presentation, too. In PowerPoint. Let’s back those up too.
The Marginal Cost of a Copy? It’s not zero.
We’re on a slippery slope. We each can argue the absolute necessity of the different pieces. Heck, our jobs depend on some of those pieces! Business speed matters. Disk is cheap. The marginal cost of a digital copy is close to zero. But to be absolutely clear, the marginal cost of a digital copy is greater than zero. There are costs, some direct, some hidden, but definitely real costs, and they may be greater than we suspect. It’s not just disk, it’s that the copies we make are each less likely to reflect the ground truth of the data. It’s the increasing ambiguity of the data, and our inability to find what we should be using in the first place, that keeps cost mounting. The data isn’t just more difficult to find because there are copies, it’s more difficult to use…because there are copies.
An Echo of a Reflection of a Shadow
Ever made a copy of a copy of a copy? It’s not supposed to matter in the digital world, but in fact, it does. We combine, we aggregate, we summarize. We necessarily have to gloss over some detail. We lose that detail. We lose the ability to track back to the source. We use the corporate copy. We use the departmental copy. What provenance is there is so difficult to navigate it does not get used. We use our best (read “most easily accessible”) copy of the data and go with that. Is that good enough? It has to be, because we’ve got deadlines to meet, a business to run. It was good enough last time.
What if I could guarantee I could get to the data, correctly each time? What if every part, every product, every person had its own identity, and we were guaranteed to be able to find it? What if, when I found the data, I also automatically found every piece of data it was known to be related to? And their identities? Would we consider the data “smart” then? I sure would.
If every piece of meaningful data had its own identity to begin with, I might not have to copy it so many times. When I did copy it, I’d know what I was copying from, why, and be able to verify that at any time. Data would become a lot less ambiguous. Developers would spend less time finding the data, and more time using their creativity to solve problems. So would information users. Simply providing the links to and from data - while not a panacea – makes the data more useful, keeping it alive, rather than dead in the application. There are some simple principles we can use to make these possibilities realities, and they don’t require a lot of money or elapsed time. They require a change in our thinking about data, our applications, and how we can add value. If we knew, for instance, that our knowledge about a particular thing would be captured and available, guaranteed, then we would be more likely to take the time to capture our knowledge. These simple principles help us do these things: store data once in a place where we know we can get it every time, capture knowledge about that data in a consistent way, and lower the costs for information access and maintenance.
A Brief History of Classification
The earliest known means of classifying an object and keeping it in order are girginakku. These are ancient Mesopotamian clay tablets that were attached to scrolls and tablets and used to identify the contents. Examples of approximately 5300 years in age can be found in the British Museum.
Girginakku at Glencairn. These clay tablets were used for many purposes, including cataloging.The famous Library of Alexandria in Egypt housed one of the earliest forms of library catalog in the third century BCE. The library reportedly housed more than 120,000 scrolls which were stored in bins categorized by subject. Each of these bins was labeled, and the labels were indexed in Pinakes. The taxonomy of subjects was devised by Callimachus, the second recorded librarian at Alexandria. He created a system with 11 main categories: six genres and 5 kinds of prose (6 categories for non-fiction, 5 for fiction.) These were rhetoric, law, history, medicine, mathematics, natural science, epic, tragedy, comedy, lyric poetry and miscellaneous. The influences of this system are still seen today in such systems as the Dewey Decimal Classification system.
Beginning in the 8th century CE, the Islamic library at Baghdad, The House Of Wisdom, began collecting books in earnest. The knowledge of papermaking had been acquired from Chinese prisoners and books proliferated. This is akin to the explosion of digital information we see today. These books were organized into genres, categories and sub-categories to make them easier to manage until the library was destroyed by a Mongol invastion in the mid 13th century.
The Leiden University Library, The Netherlands, created the first printed institutional library catalog shorty after it opened in the late 16th Century. The book was titled Nomenclator, and was a list of all authors whose books - in manuscript or print - were available in the library. The Library continued on the leading edge until the 20th century: it was among the first to use cards for its catalog and in 1969 began work on an automated system which was bought by OCLC in 2000. OCLC maintains WorldCat, the Worldwide Catalog, a machine system for libraries large and small, private and public, worldwide.
In 1735 Carolus Linnaeus published his Systema Naturæ, more commonly known as the Linnaean or Animal Kingdom taxonomy. Most of us are familiar with this system from grade school biology - there are three kingdoms (animals, plants, minerals) which are divided into classes, orders, genus and species. This is purely hierarchical in nature, and while it is capable of greater things, is used as an information placement tool mostly by non-biologists - akin to navigation taxonomies today. When you speak to people about taxonomy, this is often what they think of, and it is very useful to have some examples of similarity and differentiation at the ready to explain how your own taxonomy relates.
Three hundred years later Melvil Dewey created the Dewey Decimal system, which organizes artifacts by subject into 10 main categories. This system took hold quickly in the public and school libraries in the United States. The Library of Congress created their first dictionary catalog a couple of decades later in 1898, the Library of Congress Subject Headings. This is the basis for cataloging and classifying all of the works that are in or are sent to the Library of Record in the USA. These catalog entries are the basis for a fee-based service which generates income for the LoC. It charges other libraries for copies of their catalog cards so that the subscribing library doesn’t have to do the cataloging work themselves.
In the middle of the 20th Century an Indian mathematician and scholar by the name of Ranganathan created Colon Classification, a system still in use in Indian Libraries today. He posited that everything could be organized under 5 key facets, combined appropriately for the resource: Personality, Matter, Energy, Space, and Time. Each of these facets has a controlled value entered which is obtained from a taxonomy or thesaurus. The delimiters between the facets is a colon, and they are always entered in the PMEST order. This type of faceted taxonomy is a more practical solution for cataloging items in a digital world. Rather than having to have a list of 10k items, one can have 4 lists of 10 items, which is much easier to manage. This is NOT a rule - it is an example. Each application has its own business requirements.
Taxonomies in the enterprise reach back further than one thinks, but became known to researchers in 1858 when the NY Times began its index to the newspaper. It became such a valuable tool that publishers began indexing books and periodicals and publishing such - H.W.Wilson is a great publisher of indexes. The Reader’s Guide to Periodical Literature is one that most school students are introduced to. Database providers and large academic/scholarly/professional publishers added this capability early on as well. Proquest/Gale/Cengage, Dialog, Factiva, Reuters, IEEE, ACM all have indexes. Large government organizations also have indexes organized by subject taxonomies or thesauri: NASA, DTIC, NIH, BLS, CIA, NAICS, SEC.
Taxonomies for the enterprise and the web as we know them today began as experiments in search improvements in the 1990s. Yahoo’s first release and Open Directory were clearly a librarian-like effort to organize the then small web. Those categorization structures were re-created within the realm of Natural Language Processing - math with letters. Pattern matching is the basis for much of what occurs in these systems for rules based categorization. In simplest terms, a rule which tags a piece of content with a term from the taxonomy is an if-then statement.
Efforts are underway to transform semantic systems into more than just known item or NLP derived labeling to systems capable of contextual understanding. Ontologies are the means by which much of this effort will be accomplished in the short term. An ontology is more advanced than a taxonomy as it an contain self-defined relationships beyond that of parent-child. It can also be used to infer data and reason over information. The World Wide Web Consortium is one of the key leaders in efforts for standards in this space, as a semantic space is what Tim Berners-Lee had in mind for the web from the beginning.
Basic Material Repair
This is a re-working of a manual I created as the terminal assignment for a course I took in graduate school: “Basic Materials Repair” offered at Simmons College (Spring, 1999). Prof. Sheila Intner coordinated the class, and Todd Pattison of the Northeast Document Conservation Center (NEDCC) in Andover, MA provided lessons in repairs, resources, tools and a great deal of fun!
I finally “re-discovered” my course materials in 2011 and set about re-creating the text and images electronically, in a format usable by today’s computers. I am also incorporating the suggestions made by Prof. Intner which would have made my ’97’ a full ‘100!’ I hope the formatting updates make this easier to use. It is my intention to take this piece by piece and add photographic images and video.
I am releasing this work under a Creative Commons license. I do so in the hopes that others may add their knowledge, experience and questions to make it a better resource - I am far from being a true expert in conservation! In a nutshell, you may use this work for non-commercial purposes - ie, you can’t make any money from it - as long as you give me credit for what I’ve created and release your new version with the same license. I credit Mr. Pattison for sharing his knowledge and expertise with my fellow students and I; all I ask is that you do the same. Thank you!
Basic Material Repair by Christine Manuel Connors is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
The Continuum
Many people ask what the difference between a taxonomy and ontology is. This slide is meant to address that, and more. Inspired by work by such folks as Margie Hlava, Mills Davis, Dave McComb and others, it's my take on the spectrum of complexity and power available from various means of organizing and classifying information.
~Christine
Continuum View more presentations from Christine Connors.Less is More
I'm doing some file cleanup and stumbled across a copy of a "CM Briefing" from several years ago - CMb 2005-13, entitled "More Users=Simpler CMS." It was written by James Robertson of Step Two Designs, an Australian consultancy with a specialty in intranets. I've known James for years, he has a solid background in intranet design, content management, user-centered design and knowledge management.
I'm writing this quick post because this briefing opens with:
In many projects, the plan is to deploy a new content management system (CMS) across the whole organisation. In these organisation-wide deployments, an assumption is made that a “big” CMS will be needed to meet the “enterprise” needs. In practice, a better rule is that the more users that will be accessing the CMS, the simpler (and more usable) the system should be.
YES! Less is more, even in the world of linked data. For years we've seen attempts at building very large, very complicated ontologies, taxonomies and metadata schema for public use. The big ones are fine, but for the right reason, in fewer scenarios. What we've seen gain adoption on a larger scale are some relatively simple frameworks: Dublin Core & FOAF; more recently Open Graph Protocol and Schema.org.
Are there times when a large ontology is needed? Absolutely. Do you need one to get started? Heck no. Start small and simple.
First determine what you need: a simple schema with small controlled vocabularies? A lightweight ontology? That will depend on your goals for publishing data and the kinds of questions you want users to be able to ask of your data.
Next decide on the smallest number of elements you need to get the important data modeled. For example, an Address Record. You need a Street, Building Location, City, State and Zip Code (in the U.S.). Having a controlled vocabulary for the States will make your life much simpler. That's it; you're good to go. Move on to the next data problem.
Finally, encode in a way that will allow it to grow, integrate with other data sets, be usable in many applications and have reasonable maintenance requirements.
Keep it simple, until you need more.
Semantic Link Podcast, Episode 11
In this month's episode of the Semantic Link podcast, we talk with R. Guha about Schema.org. #semanticweb #seo
Guha provided us with clean answers about the current and future state of schema.org as he knows it. There is still a great deal of work to be done, decisions to be made and goals to be set.
I asked about support for the Enterprise. I was curious to know if support for the schemas will be baked in to the algorithms of the Google Search Appliance, Google Mini, Custom Search, Fast, and other search solutions from players such as Autonomy. I do not believe it is a top priority at the moment, but Guha did indicate plans for that at Google and to discuss those options with other vendors. This is a tricky one - balancing 'turnkey' support with the ability to customize for each organization. Large organizations with dedicated enterprise search teams may wish to manage it themselves (as I did when I was at Raytheon) but smaller companies would no doubt appreciate simply being able to "turn it on." I hope that the requirements gathering is as open as the schema and vocabulary development. It is not unusual for a company to ask its best clients for input and feedback on new products; I would like to see these search providers go one step further - ask current and future clients about their needs!
I hope you enjoy this podcast - press play and then send us your questions!

