The term metadata literally means ‘data about data’. Metadata provide additional information about a certain file, such as its author, creation data, possible copyright restrictions or the application used to create the file. The way metadata can be used in PDF files is described on this page. The content is geared towards the graphic arts industry but may be practical for other types of PDF usage as well. It covers:
How to view the metadata in a PDF file
To view metadata in a PDF document, open it with Adobe Reader or Adobe Acrobat and select ‘Properties’ in the File menu. The screen capture below shows the Additional Metadata window in Adobe Acrobat DC. Adobe Reader does not have this additional window.
Applications geared towards managing libraries of data can show metadata. Adobe Bridge, for example, allows you to browse through folders containing PDF files and check basic metadata such as the author, description, and copyright of PDF files. Theoretically operating systems should also be able to do this but while an operating system like Windows 7 is great at showing picture related metadata (such as the resolution, bit depths, keywords,..) or music related metadata (such as the artist, album, and genre), it fails to do so for PDF files.
Professional content management systems can not just display metadata but also allow for extensive searches based on the keywords or description field.
How to add or edit metadata
Many content creation applications, such as Microsoft Word, Adobe InDesign or Adobe Photoshop, allow users to define metadata for its files. In InDesign, for instance, you can use the ‘File Info’ menu option to define metadata such as the document title, its description, the author, keywords and copyright-related information. Such information is embedded in PDF metadata fields when the layout is exported to PDF.
PDF editing tools, such as Adobe Acrobat Professional, allow you to add metadata or edit them. For very specific types of metadata, a plug-in might be available to facilitate data entry or provide users with clear guidelines and choices for entering data. Tools like Exiftool allow you to extract or embed the metadata.
How to remove metadata
Metadata add value to a file but there may be circumstances where you want to remove them. This is sometimes a requirement for legal reasons or done because of security or privacy concerns.
- If you have Adobe Reader select File > Properties which brings up the Document Properties window. This shows the most important metadata fields which you can delete by hand.
- To remove metadata in individual files, you can also use the PDF Optimizer option in Adobe Acrobat. In Acrobat 9 Professional select Advanced > PDF Optimizer. In the window that pops up select the Discard User Data option to the left and enable the Discard document information and metadata checkbox to the right. If you need to clean dozens or hundreds of files, you can do so using the batch function of Acrobat Professional: select Advanced > Document Processing > Batch Processing. Click on New Sequence and name the new sequence (don’t worry about the Select sequence of commands box just click on the Output Options button at the bottom. In the Output options window activate the PDF Optimizer option and click on Settings, edit the Optimizer settings as desired and name the settings. When you are back at the Batch Sequences window run the sequence you just created, choose your files and let Acrobat do its thing.
- If you have the Enfocus Pitstop plug-in for Acrobat, it includes an action for removing metadata. The Callas pdfAutoOptimizer tool has a similar function.
- There are command line tools to batch clean PDF files as well as companies that offer this type of service for a fee. Google is your friend.
How metadata is stored in PDF files
There are several mechanisms available within PDF files to add metadata:
- The Info Dictionary (or info dict) has been included in PDF since version 1.0. It contains general information about a PDF file using a set of document info entries, simple pairs of data that consist of a key and a matching value. From PDF 1.1 onwards, these are the eight default keys that can optionally be filled in:
- Author – who created the document
- CreationDate – the date and time when the document was originally created
- Creator – the originating application or library
- Producer – the product that created the PDF. In the early days of PDF people would use a Creator application like Microsoft Word to write a document, print it to a PostScript file and then the Producer would be Acrobat Distiller, the application that converted the PostScript file to a PDF. Nowadays Creator and Producer are often the same or one field is left blank.
- Subject – what is the document about
- Title – the title of the document
- Keywords – keywords can be comma separated
- ModDate -the latest modification date and time
The values must be text, no other types of data are allowed. Applications can add their own sets of data to the info dictionary.
- Since PDF 1.4 (2001) a second and more elaborate mechanism, called metadata streams, is available to embed metadata in PDF files. A metadata stream can be associated with the overall document or it can apply to a single object within the file, such as a font or image. For its structure XMP (Extensible Metadata Platform) is used. XMP is a technology Adobe developed for embedding metadata into files. It can also be used in other file formats, such as JPG or SVG, and is an ISO-standard (ISO 16684-1). Like the info dictionary, an XMP packet can contain a simple list of name-value pairs. The data can however also be nested and a namespace can be used to standardize its structure. Since the XMP data is embedded in a stream, it can be compressed to reduce the file size.
- Additional ways of embedding metadata are the PieceInfo Dictionary (used by Illustrator and Photoshop for application-specific data when you save a file as a PDF), Object Data (or User Properties) and Measurement Properties. Adobe Acrobat allows you to name or label pages with a meaningful description. Such page labels are metadata on the page level, since they can indicate which pages belong to the sports section of a magazine or are part of the index of a book.
PDF metadata standards
There are a number of standards for enriching PDF files with metadata. Below is a short summary:
- There are PDF substandards such as PDF/X and PDF/A that require the use of specific metadata. In a PDF/X-1a file, for example, there has to be a metadata field that describes whether the PDF file has been trapped or not.
- The GWG ad ticket provides a standardized way to include advertisement metadata into a PDF file using XMP.
- Certified PDF is a proprietary mechanism for embedding metadata about preflighting – whether a PDF file intended to be printed by a commercial printer or newspaper has been properly checked for the presence of all fonts, images with a sufficient resolution,…
- The GWG processing steps specification is fairly new and meant to standardize the way production information for the printing industry can be embedded in PDF files. This is done using both additional objects and metadata. By standardizing the way information about die cutting, embossing, varnishing, etc is included a PDF, it will become easier for brands, design agencies, converters and printers to collaborate and automate production.
- The German ZUGFeRD standard for electronic invoices is an interesting twist on embedding additional data. A PDF invoice that is ZUGFeRD-compliant includes limited metadata in the XMP document metadata (e.g. the document type must be set to ‘INVOICE’) while the actual invoicing data are contained in an XML file that is embedded within the PDF.
The filename is metadata as well
The easiest way to add information about a PDF to the file is by giving it a proper filename. A name like ‘SmartGuide_12_p057-096_v3.pdf’ tells a recipient much more about what the file is about than ‘pages_part2_nextupdate.pdf’ does.
- Add the name of the publication and possibly the edition to the filename.
- Add a revision number (e.g. ‘v3’) if there will be multiple updates of a file.
- If a file contains part of the pages of a publication add at least the initial folio to the filename. That allows people to easily sort files in the right order. Use 2 or 3 digits for the page number (e.g. ‘009’ instead of just ‘9’).
- Do not use characters that are not supported in other operating systems or that have a special meaning in some applications: * < > [ ] = + ” \ / , . : ; ? % # $ | & •. Do not use a space as the first or last character of the filename.
- Don’t make the filename too long. Once you go beyond 50 characters or so people may not notice the full information or the filename may get clipped in browser windows or applications.
- Many prepress workflow systems can automatically insert files into a job based on a specific naming convention. This speeds up the processing of the job and can avoid costly mistakes. Consult with your printer – they may have guidelines for submitting files.
Other sources of information
Chapter 12 of Leonard Rosenthol’s book ‘Developing with PDF‘ is dedicated to metadata in PDF files. More information on XMP can be found on Wikipedia and the Adobe site. The B4print forum has a pretty good thread about removing metadata from which I picked up some useful information.
3 November 2017
11 Comments »
I work as a librarian and I'm responsible for the thesis department of my university library. Thesis are deposited and archived as PDF files. Some of them are accessible online, some are not.
I was advised to systematically fill the "Title" and "Author" metadata for every PDF deposited in our electronic despository.
My question has to do with the "Author" metadata. Is there a recommended structure for it ?
As it is I can think of many different alternatives :
As a librarian, my intuition was to choose the Lastname, Firstname format (used in DC author/creator element, for example), but I realized that whenever I saved the PDF file and then displayed the metadata part (via the Alt D shortcut), it would show "Lastname, Firstname" instead of Lastname, Firstname (for an unknown reason Adobe Pro adds " "). After some tests I came to the conclusion that the coma was the source of the problem. So I changed my mind and adopted the Lastname Firstname format but I am not sure that this is the best option.
So if I may ask you a couple of questions, here they are :
- What format do you recommend for the Author metadata (if any) ?
- Why is a coma generating " " ?
- Is there a specific separator character that I must use when I need to fill two different names (if a given thesis has been written by two different persons) ? (this question also applies to the "Keywords" metadata)
Thanks for your patience and for your time,