Friday, February 27, 2015

Reference Requests to Knowledge Base?

I've come across dozens of binders, which contain reference requests and responses from the last 15 years or so.  On one hand, this is great information to have to help me learn more about my new institution and also to prevent duplicate research if the answer has already been found.  On the other hand, being in binders makes it unsearchable, unwieldy, and it takes up a lot of room.

At first, I thought, "There is no way that I will be able to use these efficiently and quickly" and I started to go through to ditch them.  But, the more I saw the painstaking effort of my predecessor to organize these, I realize that maybe it's not a terrible idea to keep them somehow as a knowledge base.  Alas, I turned to the Twitterverse:

I had some responses, but no real models of how it's been done before.  I was leaning towards the idea of OCRing them and having them available to full text search somewhere.  One idea came forth to try to OCR and convert to Excel, but the papers I had weren't really in any kind of consistent data standard and often had handwritten annotations and notes (otherwise that would have been awesome).

I decided to look at a variety of software to serve as a storage pot for OCR'ed .pdfs.  I had the ability to batch scan very quickly and with Adobe Pro, could have them scanned, OCR'ed, and extracted page-by-page in no time.  My thought to extract page by page is that some requests were 5 pages long, some were only 1.  But the time it would take to put the files together in the way they were supposed to be would take too long, so might as well just make them all separate and paginate them as they were chronologically in the binder.

I scanned a 30-page sample and extracted each page as a separate file.  I then used Adobe Pro and OCR'ed the 30 page sample and left the separate pages non-OCR'ed to act as a control (and to see if any of the software would magically OCR for me...).

I tested out Evernote, Google Drive, Mendeley, Zotero, and just plain Windows Explorer.  I based my evaluations on the following criteria: Ease of import, Ease of export, Ease of full text search, Portability, Ease of collaboration, and General gut-feeling.  In the end, Evernote won by a hair, followed by Google Drive. I thought I would post some of my thoughts about each tool here.

Searching in Evernote Desktop for Mac

The searching interface is really great in Evernote.  It searches as instantly as you type and it highlights the search term in the full PDF previews.  It is easy to import the PDFs and just as easy to export the PDFs in bulk (right click and choose "Save Attachments").  There is also a lot of potential for growth if I wanted to increase these in the future.  For example, I can moving PDFs to different notebooks, add notes to the attachment, annotate the actual PDF file, add tags to help categorize by subject, etc.  You have the option to search full text across all of your documents or just by a specific notebook if you'd like.  

One inconvenience is that you can't add multiple files as multiple notes.  Instead, if you do a select-all and drag to import them, it adds them as multiple attachments to one single note.  It's not a huge deal, just a little unintuitive.   Another inconvenience is that the web version of Evernote doesn't allow previewing of PDFs nor will it export the files.  In the desktop version it is always synching, so you can at least access your files anywhere with the Web version, but for my purposes, it will be more likely that I'll need to download the Desktop version on the student worker computers and my computer in order to be able to robustly search the files.

Google Drive
Google Drive is at its best when it comes to collaboration and multi-users. In addition, my institution is a Google Apps for Education campus, which means that we automatically have access to these tools.  The storage capacity is ample and it's great that I can access my docs from anywhere (web, desktop, mobile, etc.).  Import is as easy at drag-and-drop and Export is as easy as "Download as .zip." Google Drive was the only tool that allowed you to "convert" (aka OCR) at the point of upload, so you could technically skip the Adobe Pro step.  But it didn't really work well, as described below.

Google Drive doesn't allow for you to target a specific
folder for full text search, resulting in bad results.
This may or may not be a shock to my fellow librarian/archivists, but Google's full-text search feature is not very good.  There is no way to only search the full text of files in only one specific folder, which means that my queries will come back with lots of irrelevant files.  If only I could specify to only search within my "Old Reference Requests > 2004" folder!   The other major drawback is that if I want Google to OCR it for me, it turns the file into some weird Google doc with an image of the original pdf on the first page and then the jibberish OCR text below it.  It didn't seem like it was actually working very well, so I gave up on it.  I ended up just sticking with OCRing the file in Adobe Pro and then uploading to Google Drive.

Runners up: 
Ultimately, both Mendeley and Zotero are more citation managers than document managers.  But, being a librarian, I thought I would try them both to see how they would do (not well, unfortunately).  I also thought I would look at Windows Explorer to see how that might work on its own.  

Screen shot of Mendeley search interface
Mendeley has a nice search result interface that highlights the search terms within the PDF like Evernote.  However, it really won't export anything as PDF (export options are really limited to bibliographic formats) and the Web version of the software doesn't even offer full text search.  Additionally, the software was a little irritating as it kept asking me to "claim my publications" and "select popular papers" to read.  Also, for some reason it tried to give the file the title that was OCR'ed instead of the filename.  Unfortunately, it couldn't really understand the OCR and it turned into a jibberish title.

I ended up testing the Zotero standalone version for this purpose. It was super fast and easy to import files, but it didn't allow for any full text searching without adding a third party plugin.  It seemed like doing too much, so I just gave up.

Windows Explorer was a headache. I have had good luck with using Mac Finder for purposes of finding full text in pdf files and being pretty user friendly.  So maybe it's just my bias going into it.  (By the way, I can't wait until next fiscal year when I can finally transition back to a Mac!)  Anyway, I came across several different problems before I could even test out using Windows Explorer for managing the PDF files.  First, the search just didn't work.  So, I found an answer to that problem here where I realized that I actually need to download a missing Ifilter here.  Then I had to re-build the index because while that fix would fix PDFs going forward, it would not fix it retroactively.  I had to rebuild the index overnight, because it took so long.  I came back the next morning to see that the index was re-built and I could successfully find the file!  However, the PDF wouldn't preview.  Which meant that I had to fix a reg key according to these instructions.  Finally, after all of that, I could test out to see if the search was a good fit.

Turns out, it wasn't a bad fit.  There were robust searching options and the large preview pane was great.  However, it didn't highlight the search terms like Evernote or Mendeley.  Also, it isn't really portable- the files are still on my local drive and even if I were to put it on a shared drive, I would likely have to go through all those frustrating steps above for every computer I wanted to put it on.  Granted, the steps above are something that I should do anyway, but still- it was just too much work.
Speaking of too much work, I also tried to conceptualize manually rekeying all 13+ years worth of reference requests in a standard Excel sheet with subjects, dates, etc.  While it makes me *feel* better to turn unstructured data into structured data, it likely won't be worth the time investment.  Honestly, I'm not sure how much these will really be used, but at least they're there if we need them to be.

Next steps, aka the best laid plans:
I'll have a student scan the ref requests through our mega machine (at least, that's what I call it) and drop them in the network drive.  I'll OCR all of the files in Adobe Pro and extract them by pages.  I'll upload them all into Evernote, according to their year.  Voila- open for searching.  I will also likely shred/recycle the old forms once they've been scanned.  I don't need them taking up the shelf space.

Eventually, if we have time to go through them, I'd like to have student workers add subject titles and tags to the notes.  I'll also want to start adding my current/recent reference requests to make it a more comprehensive knowledge base.

We may find that we should upgrade to Evernote Premium ($45/year).  With the free version, there is a small monthly data allowance (60MB) versus the premium allotment which is 4GB.  I've done some heavy testing and I've been using Evernote for some other note taking, so I've already reached 16.6 MB within just a week.

Apparently there is also a "Search in PDFs" feature.  I am guessing that it means it will perform its own OCRing, but I cannot find any information that explains it more.  Those are the only two features that are very appealing to me, but if we find that using this for reference is a good thing, then maybe it might be worth it.

No comments:

Post a Comment