Category Archives: Scanning

Transcription from handwriting

samplePrior to offering transcription of scanned printed text as a service, I built up some experience with Project Gutenberg. I didn’t really have any experience of transcription from handwritten texts, but offered the chance to try, I gave it a go. Some thoughts on the process ….

With scanned printing, there are really three stages. The first is OCR – Optical Character Recognition.then proofreading, then final formatting. OCR happens by computer, and is pretty fast – it just requires the right software. Proofreading – making sure the text has been captured accurately, getting rid of spurious bits and pieces – is slower. The final formatting once you have a good text is a quick pass through. All of these are really present for handwritten texts, but the OCR has to be done by eye! (If there’s software that can reliably read handwriting, I’d like to know about it!)

Proofreading is quite slow – especially when there are things like names of people or places that may have been obvious to the writer, but aren’t so obvious without their mental context. It helps to have some sort of overview of the whole document, as the same names may crop up elsewhere.

The final format will depend on what is to be done with the document, but once the text is in place, it’s easy enough for it to be bashed into any required page or file format. The task of initially transcribing from handwriting was shared between people – and it was interesting how much extra work was required simply to get the different extracts back to the same format – note to self: make sure this is defined properly in advance next time!

What was most interesting was the sense of personal involvement in people’s stories. We were transcribing a kind of visitors’ book. To follow small elements of the family history over the years was a surprisingly touching experience.

The slide problem

It seems, absent large scale facilities, I happened on the best way of scanning slides.

We’ve been using a flat-bed scanner with a frame to put the slides in.scanner It is a cumbersome process. Slides have to be checked as clean as they can be, and placed into the frame on the scanner. If lucky, or more accurately, if the slides are “normal”, the software works out where they are and how big they are, and produces a fair scan. It’s possible to scan around 30 an hour If unlucky, it may be necessary to go from an “auto” mode to a “professional” mode, to define the edges of the image accurately. This slows things down even more. The scanning software has built-in dust removal and colour correction algorithms, and will scan at … well, a higher resolution than I can imagine people asking for. But it’s laborious.

So I started looking at dedicated slide and film scanners. I have ordered one, which seemed to perform well according to reviewers (a 14 megapixel sensor, adaptors for various media, some ability to adjust the image, saving direct to a memory card). scanner2They are substantially faster, but until you get to the really high-end ones (over £1000), they still seem to have their drawbacks. One of the major criticisms seems to be that they chop the edges of the image off – there are widespread claims from reviewers that 10-15% of the image is lost on a wide range of these scanners. That’s disappointing, as the slide frame provides what ought to be a good clear boundary for the image. The problem isn’t avoided entirely with flat-bed scanners: it is one of the reasons that I frequently had to engage “professional” mode when using it – but at least here it’s possible to tell the scanner, “No, I definitely want you to scan up to these edges.”

Another discovery was that not all scans turn out to be worthy of keeping. Some people are in the position of being able to filter down the slides they definitely want to keep prior to getting them scanned. But in other cases, people don’t know what they’ve got until they start to look at them again.

So I’m thinking that the possible work flow needs adapting…

  • Clean slides as far as possible – get rid of markings, dust etc.
  • Do a “fast scan” with the slide scanner, and review the slides to determine which ones are to be scanned properly. This will still return a good high-resolution JPG file, but is not so time consuming.
  • For these, do a “slow scan” using the flatbed scanner, ensuring that the whole image is captured. Software dust removal or colour correction can be done on this, and image manipulation software can be used to improve it further.

Realistically, a “slow scan” takes a considerable amount of human input, and so will cost more. But if the amount of “slow scans” can be reduced with the slide scanner doing “fast scans”, then the overall cost could come down. And some of the grind of lining up slides in the flatbed scanner might also be eliminated!