Discussion: more about keys

Let’s look again at the relation scheme diagram for Customers and Orders.

Customer-order 
        relation scheme diagram

Other views of this diagram: Large image - Data dictionary (text)

Remember that a super key is any set of attributes whose values, taken together, uniquely identify each row of a table—and that a primary key is the specific super key set of attributes that we picked to serve as the unique identifier for rows of this table. We are showing these attributes in the scheme diagram for convenience, understanding that keys are a property of the table (relation), not of the scheme.

Candidate keys

Before picking the pk, we need to identify any candidate key or keys that we can find for this table. A ck is a minimal super key; “minimal” means that if you take away any one attribute from the set, it is no longer a super key. If you add one attribute to the set, it is no longer minimal (but it’s still a super key). The word “candidate” simply means that this set of attributes could be used as the primary key, since we don’t want any more attributes in the pk than are necessary.

Whether a set of attributes constitutes a ck or not depends entirely on the data in the table—not just on whatever data happens to be in the table at the moment, but on any set of data that could realistically be in this table over the life of the database. Does the Customers pk {cFirstName, cLastName, cPhone} meet this test?

• A father and son with the same first and last names might be living together and have the same phone number. Adding the street address or the zip code wouldn’t help to distinguish them. On the other hand, we could assume that there is some other way by which the three attribute values can be made unique, perhaps with a middle name or initial in the cFirstName field. (U.S. father-son Presidents George Bush and George W. Bush are distinguished in this way.) With this assumption, our ck (and pk) works correctly.

• The attribute set {cFirstName, cLastName, cStreet} might also be a candidate key for the Customers table, if we make the same assumption about first and last names. With more than one ck, we try to pick the one that is most descriptive of the individual and/or least likely to change over time. There’s not much difference in this example.

Every table must have at least one candidate key; if you can’t find one, your design isn’t finished. Former heavyweight boxing champion George Foreman is said to have named every one of his sons “George Foreman.” If they were all customers of ours, living at the same address with the same phone number, we would need at least one more descriptive attribute—probably the birth date—to distinguish between them.

PK size might matter

You’re probably thinking that it’s a real nuisance and waste of space to copy all three of the Customer pk attributes to make the fk in Orders. If so, you’re right. Remember that we purposely designed the Customers table without considering its association with Orders. Now that it’s the parent in a one-to-many association, we have to ask if the pk is small enough to be copied into the child table. In some tables, it will be— so we’re done. In this one, it isn’t—so we’ll have to make up a pk that is small enough. There are two types of “made up” primary keys:

• A surrogate PK is a single, small attribute (such as a number) that has no descriptive value—it doesn’t tell us anything about the real-world individual. Most ID numbers are like this. Surrogate keys are created for the convenience of the database designer (only). They are a nuisance for database users, and should normally be hidden by the user interface of a database system.

• A substitute PK is a single, small attribute (such as an abbreviation) that has at least some descriptive value. Examples of substitute keys include the two-letter postal codes for the states of the United States and the three-letter codes for worldwide airports. Substitute keys are also created for the convenience of the database designer. They are frequently still a nuisance for database users, although perhaps less so than surrogate PKs.

Do not automatically add “ID numbers” (surrogate keys) or substitute keys to a table until you are sure that:

• There is at least one candidate key (before the surrogate is added),

• the table is a parent in at least one association, and

• there is no candidate key small enough for its values to be copied many times into the child table.

These rules apply to surrogate and substitute keys that you (and your co-workers) add to your own tables. However, you might find that a class type already has an attribute that appears to be a surrogate or substitute key, but has been defined by someone else—usually a standards-setting organization or a government agency. We call this attribute an external key. In the external organization’s database, there is a candidate key for it, whether or not you have access to it or include its value(s) in your own database. You may use it as a descriptive attribute in both UML and the relation scheme diagram. Like other descriptive attributes, it might or might not become part of a candidate key in your database. We will encounter many of these (such as the zip code, UPC, and ISBN) in our examples.

• There is one special case of an external key that requires careful handling in your database design: the United States social security number (SSN). Originally intended for use only to identify social security participants, it has now become so over-used as an identifier that access to it poses risks of serious damage to individuals, even including identity theft. Please do not ever use the SSN in your database unless you are required to do so by law (for example, to file tax information). Even then, do not use it as a primary key that would be viewable to everyone who can access your database.

Revising the relation scheme

All of what we have done here applies to the relation scheme and tables only—the UML class diagram doesn’t change (unless we have to add descriptive attributes). Our revised scheme, with surrogate pk customer ID, looks like this:

Customer-order 
        relation scheme diagram

Other views of this diagram: Large image - Data dictionary (text)

In the Orders scheme, the custID fk still represents the ck attributes from Customers—it tells us which customer placed the order. In the child table, it can be used as if it were a descriptive attribute in a candidate key or primary key, as shown here.

The Customers and Orders tables can be joined exactly as before, only this time the sole join attribute (intersection of the schemes) is the custID:

Customers
custidcfirstnameclastnamecphonecstreetczipcode
1234TomJewett714-555-121210200 Slater92708
5678AlvaroMonge562-333-41412145 Main90840
9012WayneDick562-777-30301250 Bellflower90840

Orders
custidorderdatesoldby
56782003-07-14Patrick
90122003-07-14Patrick
56782003-07-18Kathleen
56782003-07-20Kathleen

Customers joined to Orders
custidcfirstnameclastnamecphonecstreetczipcodeorderdatesoldby
5678AlvaroMonge562-333-41412145 Main908402003-07-14Patrick
9012WayneDick562-777-30301250 Bellflower908402003-07-14Patrick
5678AlvaroMonge562-333-41412145 Main908402003-07-18Kathleen
5678AlvaroMonge562-333-41412145 Main908402003-07-20Kathleen