mlcastle

  

“Common” PBCore Errors

PBCore is an XML metadata standard for public broadcasters. It is design to store metadata about multimedia objects (files, tapes, and so on), and to enable broadcasters, producers, and others to exchange that metadata.

Working with Dave Rice of the WNET digital archive, I’ve been developing a tool (an early demo of which is online) to perform basic CRUD and import/export on PBCore data.

While there have been a few users of PBCore as an internal schema for storing metadata, cases where PBCore has lived up to its promise and actually been used to exchange metadata have been relatively rare to date.

However, as this starts to change, it is becoming more important that users adhere strictly to the PBCore specification. It can be difficult or impossible for users to exchange data if the recipient’s tool can’t read the output of the sender’s.

Not covered here

I want to first qualify a couple of things that this post is not about.

This post is not about errors, or things I would have done differently, in the PBCore standard itself. There are one or two of the former, and a bit more of the latter, but the standard has been released, and we’re better off using it than not.

This post is also not about content errors that people may make in their PBCore entries. If you create a PBCore record which claims that “Joe Smith” is a genre rather than a contributor or which has a URL which can’t be resolved in the formatLocation field, then that record may not be useful to your recipient, but even so their tool should be able to read in the record and display this incorrect data to the user.

Instead, we’re going to focus on errors which may prevent the importing tool from being able to do the right thing—or in some cases, anything at all—with the exported PBCore record.

“Common” errors

Note the scare quotes around the word “common”. This is because there have been (so far) so few instances of actual exchange of PBCore records that an error that I’ve seen only once can quite possibly be considered “common”.

I’ve simplified some of the examples and left out names and identifying details. The point is not to embarrass or criticize any of the PBCore pioneers who have begun to exchange data with each other but instead to highlight interoperability errors so future users won’t make them. However, all of these errors have actually been seen “in the wild.”

Invalid XML

PBCore is a dialect of XML. XML is a relatively simple format, so it can be tempting for exporters to write their own XML generators. One batch of PBCore records we received was apparently generated by a tool which forgot to properly escape ampersands in some fields, so we’d get records like this:

<contributor>Alice & Bob Productions, Inc.</contributor>

instead of

<contributor>Alice &amp; Bob Productions, Inc.</contributor>

The former can not be read by standard XML parsers, so it can be difficult or impossible for importing tools to be able to do anything with an XML file containing such a mistake.

The best way to avoid this class of error and ensure that your XML is “well-formed” is to avoid rolling your own XML generator. Instead, use something like XmlBuilder for ruby, Jakarta ECS for java, or your favorite XML library in your favorite programming language. Or, if your source data is some other dialect of XML, you may be able to use XSLT to do what you want.

Null elements

This problem actually occurs in a number of the example files on the PBCore website!

According to the PBCore specification, the pbcoreContributor tag “may appear as many times as necessary, but when it does appear, the contributor tag appear inside it. The contributor role is optional.” This means that it is invalid to write

<pbcoreContributor/>

If you have the pbcoreContributor tag, you must provide content for it:

<pbcoreContributor>
  <contributor>Carol</contributor> <!-- contributor is required -->
  <contributorRole>Panelist</contributorRole> <!-- contributorRole is optional -->
</pbcoreContributor>

If on the other hand, you have no information about the contributors to a work, just omit the pbcoreContributor tag from your output file.

This applies, of course, not just to pbcoreContributor, but to almost all of the element classes in a PBCore document.

Non-standard picklists

This is a bit different from the other problems, because files which exhibit this behavior should not break PBCore-compliant parsers. Nonetheless, interoperability could be impeded in more subtle ways.

For many PBCore fields, the standard recommends “picklists” of suggested values. As an example, the following are the recommended values for the formatColors element:

  • B&W
  • Grayscale
  • Color
  • B&W with grayscale sequences
  • B&W with color sequences
  • Grayscale with B&W sequences
  • Grayscale with color sequences
  • Color with B&W sequences
  • Color with grayscale sequences
  • Other

This means that it is probably not a good idea to make a record like this:

<pbcoreInstantiation>
  ...
  <formatColors>Black and white</formatColors>
</pbcoreInstantiation>

The following would be preferred:

<pbcoreInstantiation>
  ...
  <formatColors>B&amp;W</formatColors>
</pbcoreInstantiation>

But, if you truly have a need which is not met by any of the suggested values, go ahead and do something which makes sense:

<pbcoreInstantiation>
  ...
  <formatColors>Red-cyan 3D anaglyph</formatColors>
</pbcoreInstantiation>

Strange concatenation

Related to the above problem, it seems that some users will concatenate multiple values into one audienceLevel element:

<pbcoreAudienceLevel>
  <audienceLevel>Intermediate (grades 7-9)|High School (grades 10-12)</audienceLevel>
</pbcoreAudienceLevel>

This says that there is one audience level with a very complicated name. It’s better to create distinct audienceLevel elements for each audience level:

<pbcoreAudienceLevel>
  <audienceLevel>Intermediate (grades 7-9)</audienceLevel>
</pbcoreAudienceLevel>
<pbcoreAudienceLevel>
  <audienceLevel>High School (grades 10-12)</audienceLevel>
</pbcoreAudienceLevel>

Towards solutions

I think that, as people become more familiar with PBCore and more commonly exchange PBCore data, we will start to see fewer errors like these. The best solution to these sorts of problems may simply be a robust PBCore environment where data exchange happens often enough that users notice failures.

Another useful step will be the creation of some sort of PBCore validator along the lines of those that the W3C makes available for HTML, CSS, and RSS feeds. It’s possible that this could come out of the rails database tool I’m building—watch this space for news if it does.

Posted on Saturday the 13th of December 2008 at 11:37 PM

  1. mlcastle posted this