Saturday, July 17, 2010

Organizing a Digital Library

A Method and System for Organizing and Storing Digital Books on a Computer Hard Drive Using the Library of Congress Classification System
or How to Find Your Books.

As more and more researchers are obtaining and storing more and more of their documents in digital form on their computers, the question arises as to how to organize all these documents. This post addresses the issue of storing digital books on a standard computer hard drive. In devising this system I had several goals:

  • Documents should be stored as files in regular file system instead of in a proprietary database. This is so links and shortcuts can be created to point to the documents from just about any software that allows linking to files.
  • Consistent/persistent location so links to and between files don't get broken.
  • Easy to follow rules for naming folders and files.
  • Relatively easy to find what you have without using a database program.
  • File name and folder structure based on Library of Congress (LOC) classification system so no one has to make up their own classification system.
  • The path to the file should read almost exactly as if you are reading the actual call number to make it quick and easy to find your books.

I am intentionally avoiding technical "Library Science" terminology as much as possible within this document so it will be accessible to regular people who may not be familiar with Library of Congress terms. Besides, I am not sure I have all of them exactly correct anyway. I am still learning myself.

(If you want to skip right to the core of the system read the File Naming Scheme and Folder Naming Scheme sections.)

LOC Basics

There are a few things about Library of Congress (LOC) classification system that it will be helpful to know in order to better understand this system. If you are already somewhat familiar with LOC call numbers you can skip down to the part about cutters which most people are not familiar with.

The coded number you see on the spine of library books is called the "Call Number."The Library of Congress classification system is the coding system used to organize books in almost all University libraries in the United States. Most public and K-12 school libraries in the U.S. use the Dewey Decimal System. If you have never used the LOC system then you can learn more about the call numbers at one of these web sites:

These sites have pictures and diagrams which I will not duplicate here. A call number is divided up into multiple different parts. Many of the above tutorials explain the call number as a sequence of "lines" but call numbers are not always listed as they appear on the spine of a book. They are often listed all on one line. You have to learn to recognize where one "part" ends and another begins.

The first part of a call number consists of one or two letters. Yes, it is often on the same line as a set of numbers, with no space between them, but it is still considered a separate part. These one or two letters indicate the first one or two levels in the hierarchical tree of the classification system. The first letter in the call number divides the entire classification system into 21 major categories (often called "Classes"). You can see those classes in the Library of Congress Classification Outline here: http://www.loc.gov/catdir/cpso/lcco/. These classes can be thought of as the first set of branches in the hierarchical tree of the system. I will refer to them as the "first level" of the tree.

If there is only one letter, that means the book is either a general overview of that primary class or it contains so many of the different subclasses that it was impossible to classify it as any one of them in particular. If there is no second letter then the first numeric part (discussed below) will represent the second level of the hierarchical tree. If there is a second letter then that letter subdivides the primary class into different subclasses. There could be anywhere from just a few to almost 26 different subclasses. (They skip some letters to avoid confusion.) This second letter, if it exists, represents the second level of the hierarchical tree. To see this, go to the above link and click on any of the listed classes.

The second part of the call number consists all the numbers either up to the end of the line (on the spine), or a period, or a space, whichever comes first. This second part even further subdivides the sub-classes into perhaps thousands of different sub-sub-classes. (Don't worry, I am not going to make you start keeping track of the number of "sub-sub-sub-subs.") This second part can consist of from one to four digits. The important part to note is that this is the only part that is sorted as if it were a whole number. This is discussed further when I describe the actual system for creating the folder structure.

Now things start to get a little complicated due to the fact that all the parts after this are optional and some of the parts are divided from the other parts in different ways. I don't know exactly why they designed it this way. I could certainly have thought of a more consistent system but I wasn't there. Fortunately, we aren't actually classifying the books. All we have to do is read the number. For the most part, any time you see a period, a space, or an end of line after from this point in the call number forward, then that starts a new "part" of the call number. There are exceptions but you will see they aren't relevant to this system.

The next (optional) part consists of just a set of numbers. If it exists, it will have a period between it and the first set of numbers. It is usually listed right on the same line as the first two parts (the one or two letters and up to four digits).  This part is sorted as if it were the decimal part of the number right before it and even further subdivides whatever narrow category is indicated by the first number. If you thought 21 primary classes times an average of 12 subclasses, times perhaps 9999 more subdivisions - or 2,519,748 categories - were enough, you would be wrong. There are a heck of a lot of books out there.

In fact, there is yet another possible level of further sub-divisions. If it exists, it will consist of a single letter followed by one or more digits. It will likely have a period before it and a space or another period after it. This type of letter-number combination is called a "cutter" after the librarian who invented them way back when they were working all this stuff out. There are two tricky things to this part:

  1. Unlike the first two parts which was letters followed directly by numbers, this letter-number combination is just one part. The numbers are inseparable from the letter.
  2. There may be from zero to two parts that look like this. If there is only one "letter-number-combination (or "cutter") then that cutter is used to sort all the different books about a very specific topic by author. I call it the "author cutter" but I don't know if librarians use that term. Since we are not going to put each different book by each different author in a separate sub-folder all by itself, we are only concerned if there are two of these cutters. If there are two, then the first of the two is the one we use in this system.

Any codes in the call number after this point are used to indicate edition numbers, year of publication, and other things to differentiate different versions of essentially the same book. We do not need to be concerned with these when we are organizing the folder structure to store our files.

I know, I know. The whole Library of Congress call number system seems very odd and complicated. It is almost as if the guys who invented this stuff were intentionally trying to make it difficult to use. I can only assume they meant well. But let me assure you, most librarians are really nice. And, as you will see below, it is relatively easy to use this bizarre numbering system to create file names and a folder structure in which to store all your books.

File Naming:

Name the file before creating the folder for it. This makes it easier to see where to place the file without needing to write down the call number.

File Naming Limitations:

  • Maximum total filename length is 64 characters including the extension and the period (full-stop, dot) before it. This is a limitation of the popular Joliet CD file system. Longer file names will not burn to many CDs.
  • On some computers the extension is not shown. However, those characters still count towards the total. It is best to set your operating system to always show you the file extensions.  Look in your computer's help files for instructions on how to do it for your particular operating system.
  • Limit the characters used to any of the following:
    • ${}^[]`=,;`abcdefghijklmnopqrstuvwxyz._-0123456789
    • Personally, I just use abcdefghijklmnopqrstuvwxyz._-0123456789 and spaces. Leaving out all the other punctuation.
  • Make sure there is not a space as the last character.
  • See http://en.wikipedia.org/wiki/Filename  for more information.

Many might suggest to avoid spaces altogether. However, all modern operating systems can handle spaces in filenames just fine. Any software that cannot handle spaces should just be fixed as many people use spaces quite often. Besides, replacing all the spaces with underscores is a real pain and it makes the filename harder to read.

Because most of the information you need for creating the file name is within the .PDF file itself and it is impossible to rename the file while it is open, I usually build up the name in a temporary text file by copying and pasting from the .PDF file. Then I close the .PDF file and rename it by cutting and pasting from the text file to the file-rename dialog.

File Naming Scheme:

  • The first part of the name is the call number.Go ahead and use the whole call number in the file name even though we won't use it for the folder structure. That will ensure that your books are sorted the same way they would be in the library.
  • The first set of letters in the call number (one or two) along with the first set of numbers should be written all as one unit with no spaces or punctuation between them and then followed by a period.
  • All subsequent parts should be separated by periods except for the year (if it exists) which should have a space before it.
    • I know the MARC standards do not call for periods between every part. Spaces or end-of-lines are sometimes used. However, it is impossible to tell what is required in any specific instance without access to the official LOC schedules, so it is easiest to remain consistent by simply putting a period between each of these hierarchical levels.
    • There is a space before the year rather than a period simply because there is always a space before the year so it is easy to be consistent with this one.
  • Separate the call number from the rest of the file name with a dash (hyphen).
  • After the call number, use as much of the title and author as you can to fill in the rest of the 64 characters, not forgetting to include the extension.
  • Replace any colons in the name with dashes
  • Make sure you choose the name thoughtfully because you should never change it after this.

Folder Naming

Folder Notes:

Create a single "library" folder within which you will create all the subsequent folders discussed below. I just called mine "Library" and placed it in the root of one of my hard drives. But you can put yours anywhere. Within the "Library" folder create another folder called "Books". This is to separate your books from the perodical articles you will be storing in the "Periodical" folder which will be discussed in a separate post. The folders in "Library/Books" are named so that they match the subject hierarchy of the LOC call number system. Only use the part of the call number that pertains to the classification of the book's subject when creating this folder structure. Therefore, do not use the last (or only) cutter (the author cutter) or anything after it, as that part only pertains to differentiating between different books written about the same subject.

Each major level in the subject hierarchy of the call number is placed in its own level in the file-system hierarchy. A "level" is a set of folders (directories) all within the same "parent" folder. "The next level" or "another level" is equivalent to the sub-folders which reside in the current folder. Saying a part of a call number is "a level" means that a folder should be created with its name the same as that part of the call number.

Essentially, if you take the call number part of the file name you just created, put a slash after the first letter, another slash after the second letter (if it exists), replace the periods with slashes, but skip the last cutter and everything after it, you will get the required path. Do not include any periods as part of the folder name. Folders should only be created as they are needed to correctly place a file. It is not necessary to create an entire folder structure to match the complete Library of Congress classification system. This way, if a folder exists on your hard drive, you will know there is something in it. There will be no empty folders.

Special notes about the first numeric part of the call number:

The Library of Congress classification system actually incorporates a great deal of additional hierarchy within this one single number. One range of numbers can indicate one sub-topic while another range of numbers can represent a different sub-topic. And even one of those ranges can be further subdivided to indicate finer and finer levels of sub-topics. To see an example, go to http://www.loc.gov/catdir/cpso/lcco/, click on any of the classes, then click on any of the subclasses. Unfortunately, the way those ranges are assigned is completely different for each different subject and the only way to know how they are subdivided is to purchase an extremely large set of books called "Schedules" which are very expensive. When I first started designing this system I tried to follow this hierarchy by creating sub-folders named for the beginning of each range. This created two problems:

  • Every time I wanted to create a new set of sub-folders I had to go to the library and ask to look at one of their "Schedule" books. Not the most convenient proposition. Other than a very cursory outline of the first few levels of the hierarchy, there is no information about the LOC schedules available online for free.
  • I ended up with a long string of sub-folders like this "Q\A\71\75\76\75" for a book about computer software with a call number of QA76.75… due to the way the LOC organized the hierarchy within the QA subclass. As you can see the path name does not match the call number and it is difficult to tell which part of the path is which without looking at the call number itself.

Therefore, I decided to simply use that entire number (after the first letters and before the first period) as a folder name as in "Q\A\76\75".

Yes, this does consolidate all of the hierarchy information contained in that number into a single level of the folder hierarchy and then start the subdividing of the subjects again after that first period. Yes, this also throws away some information abut the LOC classification structure while increasing the number of folders within third level of the folder hierarchy (second level if there is only one letter in the first part). However, it also limits the number of folder levels in the file system to no more than five as well as make the folder structure much more predictable. Only a librarian who has also memorized the entire Library of Congress classification schedules will ever notice. As far as regular people are concerned, all the books of the same subject will still wind up in the same folder, and that is all that really matters.

A special note about sorting:

One of the most frustrating things for people first learning about LOC call numbers is that part of the number is sorted one way and the rest of the number is sorted another way. The first numeric part is sorted as if it is a whole number. In other words 7 is sorted before 101 because 7 is less than 101. The entire number is considered as a whole. This may make sense to many people but computers have traditionally sorted filenames one character at a time. 101.pdf would come before 7.pdf because 1 is less than 7 and the computer only considers one character at a time. It is just like sorting words alphabetically; you only look at one character at a time until you have gone far enough to establish which word comes first.

Unfortunately, many people expected computers to sort all numbers in file names as if they were whole numbers so many modern operating systems added the ability to sort files that way and some of them have now even set that as the default sorting method. Well, that just makes a mess of the rest of the LOC call number in our file and folder names because all the other numeric parts of the call number are supposed to be sorted alphabetically. Naturally, there is no way to tell the operating system to sort part of the file name one way and another part another way. So we have to use a little trick: We tell the operating system to sort the files alphabetically and then add leading zeros to the folder name for the first numeric part of the folder name. This causes it to be sorted properly even when sorting alphabetically.

For instructions on how to set Windows XP and up to always sort alphabetically see http://support.microsoft.com/kb/319827. Mac and Linux users will have to look this up on their own. If someone informs me of a good set of instructions, I will provide links to them.

Because some areas of the LOC classification system uses a maximum of three digits for this first numeric part and some areas use four digits, and it is impossible to tell which without consulting those expensive "Schedule" books, it is best to just always use leading zeros to make the folder name used for this part of the call number four characters long. So in my example above "Q\A\76\75" would become "Q\A\0076\75". There is no need to add leading zeros to any of the other levels of the folder system. They are supposed to be sorted alphabetically anyway. There is also no need to add the leading zeros to the file name because  the only files that will be in the folder with them are others with that part of the file name exactly the same anyway.

Folder Naming Scheme:

  • The first level is the first letter of call number. Create this folder in the "Library\Books" folder if it does not already exist. Yes, it will be a folder with a name consisting of only one letter.
  • The second level is second letter of the call number, if it exists. If not then everything below gets shifted up a level. There is usually a second letter.
  • The next level is the first numerical part of the call number (before the first period, space, or end of line). Use leading zeros to make this four characters long.
  • If there is a number only part after the first period then that is another level.
  • If and only if there are two parts composed of a letter followed by numbers (cutters), then the first of those is yet another level.
  • The last cutter (the only one if there is only one) and everything after it is ignored as far as the folder names are concerned. Those parts are only used to differentiate between different authors and or versions of books within a very, very specific subject.

Using this system:

  • Once you have the files named properly and stored in their respective folders it is relatively easy to use this system, if you use the resources you have available on your computer.
  • Do not try to memorize the entire LOC call number system. The sections you use the most will become familiar as you go. An added benefit will be that you will now know exactly where to go in the real library to find books in your interests.
  • Make use of the search features built into almost all modern operating systems. Set the search software to index the "Library" folder. When looking specifically for a book on your computer start the search at the "Library" folder. If you know the first parts of the LOC call number for the book you are looking for then you can start your search there.
  • You can now create links and shortcuts to these files from within various documents or note taking systems on your computer.
  • I am not totally familiar with cloud-based storage but you should be able to put your whole library folder on some cloud-based system so you could access it from anywhere. (Remember, though, sharing that library with others may be a major copyright violation depending on the licenses of the books you have in your digital library.)
  • Once you have placed a file in its appropriate folder, do not move it. Any links or shortcuts you have created to that file will be broken.
  • Do not think you can make things a little bit simpler by consolidating all the books in one major subject area into one parent folder rather than in their appropriate separate sub-sub-sub-folders. You will regret it in the end. It may look strange to have only one file buried four or five folders deep, but go ahead and do that. Remember, books are like rabbits, they keep multiplying. You will soon have so many books in that one general folder that you will feel the need to subdivide them by putting them where they should have been in the first place. But then you will break all the links and shortcuts to those files. Don't say I didn't warn you.

Working on Multiple Computers:

Many people have more than one computer: a desktop for working at home and a laptop for, well, everywhere else. If you do, then you probably want to be able to do much the same work in much the same way on both your desktop and laptop. I will discuss synchronizing notes and other documents between multiple computers in more detail in a separate post. However, there are a few things you should keep in mind as far as this system is concerned:

  • Your library folder should be located on the same path on all your computers. If the library - and the books inside it - are not in the same place on all your computers then links created on one computer will not work on another. For Windows users this means if that folder is at D:\Library on your desktop then it should be on D:\Library on your laptop. If you don't have a D: drive on your laptop then either create a separate partition to be the D: drive on your laptop or put your library folder on your C: drive on your desktop. (If you don't know how to create a partition then either look it up or get a technician to do it for you. There are dozens of tutorials available.) Mac and Linux users will have to look up the particular tricks necessary for their operating systems on their own.
  • You don't necessarily need to copy your entire library onto your laptop. You could just copy the books you work with most often and leave all the others on the desktop. Just make sure every book you copy is on the exact same path on both computers. Also remember, if you have a link or shortcut to a book in some document that you use on the laptop, but you don't copy the book itself to your laptop, then the link to that book will naturally not work. This should not be much of a problem as you will probably not often be following links to books you aren't actually working with at the time.
  • Some people like to make comments and do highlighting directly in their .PDF files if they have the appropriate software (Yet another topic I will discuss in another post.) If you do this then there is the possibility of having one set of comments on one computer and a different set on a different computer. Remember, the book on your desktop is a completely separate file from the book on your laptop and the comments and highlighting are stored within the .PDF file. In order to avoid this problem you should use some kind of synchronization software to keep your books synchronized between all your computers. I will discuss synchronization much more thoroughly in another post but essentially, synchronization software keeps track of which files have been modified and copies the most recent versions over the older versions. This can be a problem in itself if you edit both versions before synchronizing. As most synchronizing software works by merely overwriting an old file with a newer one, any edits made to the older file are lost. This is the same problem you have if you try to keep any document in two different places and edit them separately. To avoid this, you will have to make sure you synchronize often, preferably every time you bring the computers together. Also make absolutely sure that if you modify the book on one computer that you synchronize before modifying that same book on another computer. There are currently many programs for synchronizing files between computers. I use SureSync from www.softwarepursuits.com. It has a bit of a learning curve but it is very powerful, flexible, and fast. As I said, I will discuss synchronization of files and data between computers much more thoroughly in a separate post.

Conclusion

This is just the first in a long series of posts I expect to make about all the tools and tricks I use to make doing academic research much, much faster and easier. This was perhaps not the most important topic to choose for my first post but it is the one I felt like writing about now. As I am not a professional blogger and I am not trying to attract readers who will look at advertisements, I get to write what I want when I want. I write these merely because I want to share what I have figured out with other people who may be able to use them to do better work or simply have more time to relax. Hopefully, this post will help you clean up and organize what may have become a large mixed up collection of book files. If you have any questions or have what you think is a better system, post a comment. I love to help people with this kind of stuff and I am always glad to see what other people are doing that may be better than my own systems.

Please Note: I have not considered how to handle any of the eReader books that one can download for eReader devices like the Kindle or Nook. I will look into these and post more information in another post later. One thing I do know is that one should not depend on the notes one can make on these devices. They appear to only be stored "in the cloud" and could go away at any time. In my opinion, no serious researcher should rely on a system that could go away or be made unavailable at any time.


The content of this post is Copyright © 2010 by Grant Sheridan Robertson. Anyone is allowed to use this system but you are not allowed to reprint or republish this document without my permission.

No comments:

Post a Comment