CIDOC Relational Data Model A Guide by Patricia Ann Reed April 1995 Copyright (C) 1994-1995, International Documentation Committee of the International Council of Museums (CIDOC) The CIDOC Data Model may be reproduced and shared without restriction as long as this copyright notice is retained, except that it may not be licensed or sold for profit as a portion of any software product, and it may not be included in or distributed with commercial products or otherwise distributed by commercial concerns to their clients or customers without the written permission of the Chair of CIDOC's Working Group for the Development and Distribution of the CIDOC Data Model. This model was developed by volunteer contributors as a public service, and is furnished without warranty of any kind. Neither the International Council of Museums, nor its International Documentation Committee, nor the individual authors, nor any other institution or individual that has contributed to its development and documentation warrant this model in any way. __________________________________________________________________ Table of Contents Introduction I. Purpose of a Relational Data Model II. Logical Data Model - What It Is, What It Isn't A. Metadata B. Principles for Creating Metadata C. Data Model and Database Schema D. Logical Data Groups (LDGs) and Logical Data Elements (LDEs) III. Standards for Defining and Naming Logical Data A. Defining Logical Data B. Naming Logical Data C. Adapting Standards to Local Environments IV. Data Dictionary Reports __________________________________________________________________ INTRODUCTION The CIDOC Data Model Working Group is creating a relational data model as a prerequisite to recommending a relational data structure for the interchange of museum information worldwide. Advances in database technology and processing offer opportunities for using information flexibly and efficiently when data is organized and stored in relational structures. This guide is for those who wish a better understanding of relational datamodeling - its purpose, its nature, and the standards used in creating the CIDOC model. The examples used are found in the CIDOC model reports. A relational data model defines what the data is rather than how it is used, because data is used in multiple applications to serve multiple functions. For example, data is collected about Object, not Object-on-loan or Object-being-photographed or Object- acquired-from-donor. Loan, photograph, and acquire are functional contexts - the settings in which Object information is used. In relational technology, each automated function uses the same Object data. This is a sea change in thinking for many museum professionals responsible for the management of their collections. If data was automated in the past, it was stored in flat file structures where duplicating the data was the only way to automate multiple functions or activities. Today's technologies, supported by a well-defined relational data model, offer better solutions. I. Purpose of a Relational Data Model Data is the raw material from which information is produced, and it can be stored on disk, on tape, or in a file drawer (or in a brain!). Information is data processed and presented in meaningful form and context. Data is collected, modeled, and documented to serve functions. In other words, data must support what is done and provide the information needed to perform daily tasks and plan for the future. Data separated into its smallest discrete parts and defined precisely can be organized in a structure which achieves the following objectives: * Eliminate logical data redundancy, thereby reducing physical data redundancy. * Ensure consistency of logical data names and definitions within and across systems and disciplines. * Enable multiple use of physical databases. * Enable greater flexibility of data usage. * Enhance the capability to deliver decision support information. * Provide data structures which enable data interchange across systems and disciplines. It is the last objective which is the goal of the CIDOC Data Model Working Group. II. Logical Data Model - What It Is, What It Isn't At the highest level of abstraction, there are five big entities which can be defined and documented: People Places Things Events Concepts These five entities and the relationships among them can document anything in the entire spectrum of human (or inhuman) experience. This highest-level model is sometimes called a Conceptual Data Model. It contains major entities, broadly defined and without attributes or details. The task of a Logical Data Model is to particularize the Conceptual Data Model entities and relate them to each other, creating a data structure which supports the intellectual and physical worlds in which work is done. A logical data model does not contain real data. Rather, it contains the infrastructure into which real data fits. This section describes the infrastructure and distinguishes it from the physical database structure. A. Metadata Data in a relational data model is called metadata, i.e., data about data. Metadata provides * a commonly understood body of data which can be used in multiple applications and * common data structures which users from diverse process areas can populate with unique data values. B. Principles for Creating Metadata When defining metadata, the following principles apply: * Logical data is defined in the abstract and without redundancy. * Logical data is defined independent of, and outside the context of, functions, processes, and automated applications. * Logical data is defined by users from diverse functional areas who need the same logical data. * Logical data element names are consistent and meaningful; they are created according to naming standards. (See Section III. Defining and Naming Logical Data) * Composite data is broken down into its smallest meaningful parts, each of which is defined separately. C. Data Model and Database Schema The logical data model contains the characteristics of real data, whereas a physical database contains real data. The following comparative table characterizes the differences between metadata in a relational data model and data descriptions (also called data schema or record layouts) for the contents of a physical database. * Relational Data Model: Logical, abstract in nature. Contains metadata, i.e., data about data. Contains information about the attributes of data entities and the logical relationships among them. Stable, reusable product; logical data definitions seldom change; relationships among data entities seldom change. Logical data is defined and documented independent of, and outside the context of, functions, processes, and automated applications. Logical data is defined without redundancy. Composite data is broken down and logically defined at the level of the smallest meaningful part. * Physical Database: Physical in nature. Contains real data. Contains a body of data facts which are instances, or occurrences, of logical data entities. Technologies change; over time, changes in hardware and software force migrations to new information systems implementations. Physical data is stored and used in the context of one or more automated or manual processes to satisfy a functional need. D. Logical Data Groups (LDGs) and Logical Data Elements (LDEs) The logical data model contains information about two levels of data: Logical Data Group (LDG) and Logical Data Element (LDE). In this discussion, the terms "LDG" and "Element" are used. LDGs are groups of Elements. Elements are the discrete pieces of data which describe and define entities. 1. LDGs LDGs are logical groups of data which define and describe entities. They can be equated roughly to a physical data record, database schema, or relational table. In the CIDOC model, LDGs are designated as primary, repetition, recursion,type, or intersection in the "LDG TYPE" category. A primary entity is something which is important to an organization's work, in this case museum work. There are two questions to ask in determining whether an entity is primary: "Can it stand alone, or is it merely an attribute?" and "If it can stand alone, do we want to define its attributes and document it as a separate entity?" Some primary entities originally were thought to be attributes of another entity. These former attributes became primary entities because they were not intrinsic to the entity itself, and because users wanted to keep detailed information about them. An example is STYLE, which originally was considered an attribute of OBJECT. However, STYLE is not dependent on OBJECT for its existence - it can stand alone, has attributes of its own, and users want to describe it in more detail. New technologies make possible this discrete separation of entities. Primary entities in the current CIDOC model are ALPHABET, AWARD, CALENDAR, CLASSIFICATION, COLOR, CONCEPT, EVENT, LANGUAGE, MATERIAL, METHOD, OBJECT, OCCUPATION, OPUS, PEOPLE-GROUP, PEOPLE- PERSON, PLACE, ROLE, STYLE, AND TIME-SPAN. A repetition entity is created when an attribute can occur more than one time for any given occurrence of an entity. An example is OBJECT MARK LDG. MARK is an attribute of OBJECT. Because more than one mark may appear on any given OBJECT, MARK is removed from the OBJECT LDG and becomes a repetition entity. OBJECT MARK LDG has its own repetition entity called OBJECT MARK TRANSCRIPTION LDG because there can be more than one TRANSCRIPTION for any given MARK. OBJECT MARK TRANSCRIPTION LDG has its own repetition entity called OBJECT MARK TRSCRPTN TRANSLN LDG because there can be more than one TRANSLATION of any given TRANSCRIPTION. A recursion entity is an entity which is related to itself. It is indicated by the term "RELATED" in the LDG name. PEOPLE RELATED LDG is an example of a recursion entity, where two instances of PEOPLE LDG are associated. In PEOPLE RELATED LDG, there are two occurrences of the Elements PEOPLE OCC IDN and ROLE OCC IDN which represent either two persons, two groups of persons, or a person and a group ofpersons; an Element called PEOPLE PEOPLE RELATIONSHIP NAM which documents the nature of the association between the two PEOPLE; and Elements documenting the time during which the relationship occurred. An intersection entity is created by linking together two or more primary, repetition, or type entities. Intersection entities are indicated in the CIDOC model by an ampersand (&). An example is OBJECT & EVENT LDG, where an OBJECT is associated with an EVENT. The intersection entity contains Elements which document the association of the OBJECT and the EVENT, i.e., the relationship between them and the time during which the relationship occurred. A type entity is a subset of a primary entity. It has special attributes which set it apart from the larger entity. 2. Elements Although "Element" and "attribute" sometimes are used interchangeably, in the context of this document there is a difference: "Element" is a data fact logically defined and contained within an LDG. "Attribute" is an intrinsic characteristic of an entity. Elements define the attributes of entities, answering the question "What is it?" They can be equated roughly to the data fields in a flat file or the columns in a relational table. Elements comprise the contents of LDGs. An Element is dependent on an entity - it cannot exist apart it. In the CIDOC Model, for example, "OBJECT LDG" contains the Elements "OBJECT OCC IDN", "OBJECT CNT", and "OBJECT MEDIUM SUPPORT DISPLAY," which describe OBJECT and cannot exist apart from OBJECT. Elements defining many of the attributes of entities are documented in repetition LDGs. For example, MARK is an attribute of OBJECT, although no Elements describing MARK appear in the OBJECT LDG. The Elements describing MARK appear in the repetition entity OBJECT MARK LDG because there can be more than one MARK for any given OBJECT. III. Standards for Defining and Naming Logical Data Using standards to define and name LDGs and Elements assures consistency and reliability in metadata retrieval and usage. These standards are for logical, not physical, data. Standards do not preclude the use of traditional, familiar data names in data entry screens, forms, reports, and the like. A. Defining Logical Data *** Standard: Logical data is defined without reference to and outside the context of process, function, or physical information system. Relational: OBJECT & EVENT LDG OBJECT & EVENT LDG OBJECT & EVENT LDG Non-Relational: OBJECT LOANED OBJECT ACQUIRED OBJECT CATALOGUED In the non-relational example above, the words LOANED, ACQUIRED, and CATALOGUED describe the context in which an OBJECT was used, and they do not describe intrinsically the OBJECT itself. They are EVENTs in which an OBJECT participated. In the relational example, the OBJECT is stored once in an information system, each EVENT is stored once, and OBJECTs and EVENTs are linked together when appropriate. *** Standard: Differences between data elements and data values are resolved. Relational: PEOPLE PERSON LDG ROLE LDG Non-Relational: CALLIGRAPHER PAINTER PRINTER DONOR The non-relational examples above are typical of data defined in a flat-file OBJECT record. In the non-relational examples four pieces of data are defined as roles, and each will be populated with a person's name. Conceivably, the same person's name could populate all four of the non-relation data definitions. In addition, that same person may be logically related to additional objects. Relational modeling and technology solve both these anomalies by separating a person from a role he plays and creating a data group for each. Once information about a person is stored in a database, it can be linked to many roles related to the same object, and it can be linked to many different objects. Another benefit occurs when a new ROLE is desired: Instead of defining a new piece of data, one only need add a new data value to the ROLE database. *** Standard: An Element appears in one, and only one, LDG. The exception is a foreign key, which may appear in multiple intersection LDGs. Relational: OBJECT LDG OBJECT MARK LDG Non-Relational: MARK1 MARK2 SIGNATURE This example was taken from a flat-file OBJECT record. These three data elements appeared in every OBJECT record, whether they were populated or not. Accepting that SIGNATURE is a kind of MARK, there are three MARK data elements in the flat-file OBJECT record. By removing the MARKs from the OBJECT record and creating a Repetition Entity called OBJECT MARK LDG, it is now possible to document an unlimited number of MARKs without defining additional data elements. Data elements within the OBJECT MARK LDG describe the MARK fully, eliminating the need for the SIGNATURE data element in the flat-file structure. B. Naming Logical Data Data dictionary names reflect the abstract, process-independent nature of a relational data model. The following standards for naming logical data impose a structure which facilitates understanding a complex set of data requirements. *** Standard: Nouns are used in singular form. Relational: OBJECT LDG EVENT ACTION LDG OBJECT MARK LDG Non-Relational: OBJECTS LDG EVENT ACTIONS LDG OBJECT MARKS LDG *** Standard: Logical data names are ordered by facet, or segment, according to the following formula: PRIMEWORD MODIFIER(S) CLASSWORD/SUFFIX The facets are separated by a space. CLASSWORD applies only to Elements, and SUFFIX applies to LDGs. The purpose of using CLASSWORD and SUFFIX is to indicate at-a-glance what kind of dictionary entry one sees. The dictionary can be expanded to document other kinds of information such as Users, Applications, Systems, and Modules, for which one might choose suffixes of USE, APP, SYS, and MOD. Following are standards for each facet of a logical name: *** Standard: PRIMEWORD represents the name of a primary entity to which a LDG or Element belongs. It must be the first facet in a name. Relational: OBJECT LDG OBJECT CONDITION NAM OBJECT MEASURE LDG OBJECT MARK OCC IDN Non-Relational LDG OBJECT NAME CONDITION OBJECT MEASURE OBJECT LDG IDN OCC OBJECT MARK *** Standard: MODIFIER qualifies and further defines a LDG or an Element emanating from a major entity. Ordering of multiple modifiers is left to right from general to specific. Examples: OBJECT LDG OBJECT MARK LDG OBJECT MARK TRANSCRIPTION LDG OBJECT MARK TRSCRPTN TRANSLN LDG (TRANSCRIPTION and TRANSLATION abbreviated in the above example because of software length constraints) In the above example the placement of modifiers is left to right from general to specific. OBJECT MARK LDG indicates that MARK is an attribute of OBJECT; OBJECT MARK TRANSCRIPTION LDG indicates that TRANSCRIPTION is an attribute of a MARK on an OBJECT; and OBJECT MARK TRSCRPTN TRANSLN LDG indicates that TRANSLATION is an attribute of a TRANSCRIPTION of a MARK on an OBJECT. The LDGs above are examples of the Repetition Entity. *** Standard: The key identifier of an LDG is indicated by an Element containing the standard modifier "OCC". The modifier "OCC" precedes immediately the Element CLASSWORD "IDN" (see CLASSWORDs below). Key Identifier in this context is defined as the unique identifier by which a computer recognizes a unique occurrence of a data group. The identifier may be machine-generated to guarantee uniqueness. Examples: EVENT OCC IDN CLASSIFICATION TERM OCC IDN PLACE ADDRESS OCC IDN *** Standard: CLASSWORD defines the intrinsic or inherent nature of an Element. It is the last facet of an Element name. The following CLASSWORDs are mutually exclusive categories which define the nature of an Element and answer the question "What is it?" * AMT Amount (numeric) Indicates a monetary amount. (How much?) * CDE Code (alphanumeric) Predefined values which represent specific names or terms and are formulated by the systematic use of symbols, letters, or numbers. Ex: Codes for country names, i.e., UK is a code for the United Kingdom, FR for France, etc. Codes may be standard, universal, or specific to a local system. Multiple code sets may exist for the same entity, as is the case for country names. * CNT Count (numeric) Indicates a non-monetary numeric quantity or accumulation. (How many?) * FLG Flag (alphanumeric) Indicates a binary state or condition where only two opposite values are possible, and where the values have no function other than to indicate a described state or condition. (YES or NO, ON or OFF, IS or IS NOT) * IDN Identifier (alphanumeric) Non-coded data which identifies an entity; not necessarily unique. (Ex: Museum catalog number, donor catalog number, exhibition catalog number, specimen tag number, and employee number cannot be guaranteed to be unique within a database.) * NAM Name (alphanumeric) Alphanumeric data which documents an appellation, or name, given to a person or organization, place, thing, event, or concept. May be a single word or a short phrase; different in nature from "TXT". * TME Time (alphanumeric) Identifies a duration or period of time, including dates, or a specific instant in which something occurs. (When?) Format is standard ISO (International Organization for Standardization) format: YYYYMMDDHHMMSS.SS YYYY year MM month DD day HH hour MM minute SS second .SS tenths, hundredths of second * TXT Text (alphanumeric) Textual data which is imprecisely defined, has an unpredictable structure, and does not fit into one of the above classifications. Typically consists of notes, remarks, descriptions, and comments. The following examples illustrate how CLASSWORD is used in naming a data element: Relational: OBJECT PART CNT CALENDAR NAM CONCEPT APPELLATION NAM PLACE ADDRESS BUILDING IDN Non-Relational: NUMBER OF OBJECT PARTS NAME OF CALENDAR NAME GIVEN TO CONCEPT BUILDING NUMBER *** Standard: The standard SUFFIX for LDGs is "LDG". Examples: OBJECT LDG OBJECT MARK LDG *** Standard: The ampersand - "&" - is the standard character for documenting the linking of one LDG with another, indicating relationships among entities. Examples: OBJECT & EVENT LDG OBJECT NOTE & PEOPLE PERSON LDG OBJECT & PEOPLE & ROLE LDG *** Standard: Each facet in a logical data name is spelled in full. Abbreviations are used when needed to accommodate the 32-character length limit imposed by the current software which documents the model. If abbreviations are necessary, begin with the MODIFIER facets, from specific to general (right to left), when possible. CLASSWORD and SUFFIX are not abbreviated. C. Adapting Standards to Local Environments While reviewing the standards in this document, there are considerations to keep in mind, especially if information will be stored in a commercial software package such as a data dictionary or a CASE (computer assisted software engineering) tool. A few of these considerations are listed below: * Some software does not permit spaces to be used between facets of a name; a dash or underscore may be required. Examples: OBJECT & EVENT LDG OBJECT-&-EVENT-LDG OBJECT_&_EVENT_LDG * The software which produced the CIDOC Data Model documentation accommodates use of the ampersand (&) to link one LDG to another. Other software products preclude the use of special characters. Another single character may be substituted, or the linking character may be omitted altogether. Examples: OBJECT & EVENT LDG OBJECT A EVENT LDG OBJECT N EVENT LDG OBJECT EVENT LDG * Some software packages allow only upper case or only mixed case alphabetic characters in a dictionary name, while others allow a choice of upper case, lower case, mixed case, and special characters including spaces. * A dictionary name may be limited in length to a specific number of characters. The software used in the accompanying reports allows a maximum of 32 characters, thus forcing abbreviations in complex names. The abbreviations are predetermined to assure consistency. * Become familiar with all the features of a software package before setting standards for its use. * If multiple software packages are used, consider compatibility. IV. Data Dictionary Reports The term data dictionary is used to describe 1) a repository for the definition of logical metadata and 2) a DBMS-specific description of a schema, or record layout, for storing physical data. It is the first definition which documents the CIDOC data model. There are three reports comprising the documentation package: LIST OF ENTITIES BY TYPE, ENTITY CONTENTS REPORT, and USED-BY DIRECTLY. The LIST OF ENTITIES alphabetically lists first the Elements and then the LDGs. The ENTITY CONTENTS REPORT contains a full description of Elements and LDGs, entries appearing together in alphabetical order. The VALUES attribute (field) in an Element entry is intended to further define logically the Element by providing examples of real data values which might appear in a physical implementation. The CONTAINS attribute (field) in an LDG entry lists the Elements which comprise the LDG. Other fields are self- explanatory. The USED-BY DIRECTLY lists alphabetically each Element along with the LDGs in which it is found. * Pat Reed - Smithsonian Institution, OIT, A&I 2310, MRC 433 * * Ph:(202)357-4059 Fax:(202)786-2687 Email:preed@sivm.si.edu *