One of the features of The British Newspaper Archive is the OCR editing tool. This allows every user to contribute to the improvement of The Archive. OCR stands for Optical Character Recognition–the process we use to rapidly digitise millions of pages of newspapers. The computer takes an image of the page and then turns each line into text. However, as many of us have encountered, the text is not always correct and needs improvement. As a group, we can improve the text and subsequently the search feature on The British Newspaper Archive with the steps below.
How to correct the OCR text
- When you view a page of the newspaper, the article you want should be highlighted. If not, use your mouse to click on the column of the newspaper article you want to read.
- The OCR text is on the left side of the image. The text box allows you to select a line of text and it will allow you to edit. Also, it will zoom in on the image to show you exactly where this text appears.
This image is a clear example of the classic OCR error, recognising the letter ‘s’ as ‘f’. You can see in this image the newspaper states ‘Evening Posts’; however due to 18th century typography, the computer has recognised this text as ‘Evening Pofts’.
If you are searching early newspapers for a word that contains the letter s. try a second search and replace the letter s with the letter f.
- Use your mouse and click directly on the text, and you will be able to edit right there on the screen. In the image you can see that we have replaced Pofts with Posts.
- Once you are finished with all your changes, select the save button at the bottom of the text box. This will save your changes. It is not necessary to save after each line. Select save when you are completely finished with the article editing.
By editing the OCR, you will help us to greatly improve the site and the search function. It only takes seconds, and it is not necessary to edit an entire article. As you can see in the example, you can choose to only edit a single line in an article.
Why is the OCR text incorrect?
You may be wondering, ‘why does the computer get the text incorrect?’ OCR technology is constantly improving, however, much of our OCR work was completed in the early years of The Archive from 2011. Therefore, early editions added to the site have not benefited from our improvements to OCR. In order to continuously provide new newspaper titles and add more issues to existing titles, it is not possible to go back and re-digitise early editions with improved OCR software. A second reason that OCR technology can sometimes get the text wrong is because of the changes in typography over the centuries. Many of you have come across the classic s & f confusion, which has been demonstrated in the steps above.
Finally, OCR can also make mistakes due to the quality of the original image. Many of our newspapers have been digitised from microfilm. If the quality of the original microfilming is not perfect, then that is reflected in the images and consequently in the OCR. In this example, you can see that the word correspondent is slightly faded. This slight difference in the quality of the image has led to incorrect text.
After a quick correction, the text is now correct and this change can benefit future researchers.
The article used above is an excellent example of using historical context when searching the newspaper archive. Until as late as 1949, the island of Haiti was known as Hayti. If you searched for Haiti, you would have never discovered this article.
If you love the BNA as much we do, then become an active participant in our BNA community through OCR correction. While we work hard to continuously digitise new newspapers, you can help us to improve The Archive by taking a moment to correct the OCR text.