Join us in improving the BNA’s search | The British Newspaper Archive Blog

Blog

Join us in improving the BNA’s search

One of the features of The British Newspaper Archive is the OCR editing tool.  This allows every user to contribute to the improvement of The Archive.  OCR stands for Optical Character Recognition– the process we use to transform millions of digitised images into searchable text.   The computer takes an image of the page and then turns each line into text.  However, as many of us have encountered, the text is not always correct and needs improvement.  As a group, we can improve the text and subsequently the search feature on The British Newspaper Archive with the steps below.

How to correct the OCR text

  1. When you view a page of the newspaper, the article you want should be highlighted. If not, use your mouse to click on the column of the newspaper article you want to read.

 

  1. The OCR text is on the left side of the image. The text box allows you to select a line of text and it will allow you to edit. Also, it will zoom in on the image to show you exactly where this text appears.

 

This image is a clear example of the classic OCR error, recognising the letter ‘s’ as ‘f’.  You can see in this image the newspaper states ‘Evening Posts’; however due to 18th century typography, the computer has recognised this text as ‘Evening Pofts’.


Search tip
If you are searching early newspapers for a word that contains the letter s.  try a second search and replace the letter s with the letter f.

  1. Use your mouse and click directly on the text, and you will be able to edit right there on the screen. In the image you can see that we have replaced Pofts with Posts.

 

  1. Once you are finished with all your changes, select the save button at the bottom of the text box. This will save your changes.  It is not necessary to save after each line.  Select save when you are completely finished with the article editing.

By editing the OCR, you will help us to greatly improve the site and the search function.  It only takes seconds, and it is not necessary to edit an entire article.  As you can see in the example, you can choose to only edit a single line in an article.

 

Why is the OCR text incorrect?

You may be wondering, ‘why does the computer get the text incorrect?’  OCR technology is constantly improving, however, much of our OCR work was completed in the early years of The Archive from 2011.  Therefore, early editions added to the site have not benefited from our improvements to OCR.  In order to continuously provide new newspaper titles and add more issues to existing titles, it is not possible to go back and re-digitise early editions with improved OCR software.  A second reason that OCR technology can sometimes get the text wrong is because of the changes in typography over the centuries.  Many of you have come across the classic s & f confusion, which has been demonstrated in the steps above.

Finally, OCR can also make mistakes due to the quality of the original image.  Many of our newspapers have been digitised from microfilm.  If the quality of the original microfilming is not perfect, then that is reflected in the images and consequently in the OCR.  In this example, you can see that the word correspondent is slightly faded.  This slight difference in the quality of the image has led to incorrect text.

 

After a quick correction, the text is now correct and this change can benefit future researchers.

 


Search tip
The article used above is an excellent example of using historical context when searching the newspaper archive.  Until as late as 1949, the island of Haiti was known as Hayti.  If you searched for Haiti, you would have never discovered this article.

 

If you love the BNA as much we do, then become an active participant in our BNA community through OCR correction.  While we work hard to continuously digitise new newspapers, you can help us to improve The Archive by taking a moment to correct the OCR text.

 

 

Tags

, , , , ,

10 comments On Join us in improving the BNA’s search

  • I think it is hardly admirable that you suggest to people who have to pay as subscribers to view pages, are then asked to ‘help’ out with text corrections. Perhaps their time correcting mistakes could be offset against their subscription costs if you really are serious ?

  • I already correct text for Australian newspaper and would love to do the same on the British ones I’ve seen only difference is one can view and correct the Australian ones FREE

  • I would agree with the above comments: I have added edits in the past, as a goodwill gesture, but do think you’d get more support for this if you offered some sort of credits. I assume that you have a system to review edits, so would have no additional costs in checking and not open to abuse? As it is, I would now probably only spend time editing items that affect articles of interest to me directly, whereas I would happily amend things I notice in passing or volunteer time to free/goodwill-based transcription projects.

  • I will try to do my best with items that are of interest to me Thank you

  • I agree with Howard,Jeff and Ruth that to request fellow users to correct text in your newspapers while charging subscription fees. I love correcting the Australian newspaper text but in Australia no one is charged to access newspapers. What I don’t understand is why the British National Archives requires patrons to subscribe to Find My Past to enable them access to the information.
    What has happened to the oath of of all information held by all National Libraries and Archives are free to all users, and information is not owned by anyone person or entity? If the National Australian Library and Trove University can allow all users around the world free access to all information why can’t the British National Archives do the same?

  • My view on the comments above are that we are very lucky to have such a brilliant device at our call and the annual subscription is very reasonable.
    I have been compiling my family history for many years and the BNA has provided me with a lot of fine and interesting detail that has allowed me to “flesh” out some of my ancestors lives.
    Please carry on as you are doing and ignore the moaners who always want more and more and are never satisfied.
    They should have been researching in the days when you went to the record offices and had to seek/heave the information from the shelves.

  • We take instant access to information for granted, and having spent many long hours searching through physical archives, often to be disappointed at the relevance of an article, I am simply glad that I can now search and read such a mass of literature from home at any hour. My reaction to this post was pleasure that I could correct errors I find, and therefore help fellow researchers. If you don’t want to improve t his resource without financial reward, then don’t – leave it to those who will.
    Resources at any archive are finite, and I’d rather my fees went to supporting the limited number of staff and volunteers to feed more titles into their scanners (a time-consuming job) than to proof-read what they have scanned. Access to the information in UK libraries and archives IS free, but if you want it delivered to your computer, then National Archives will charge Find My Past or other organisations for the delivery service.
    If the Australians can do it for free, that’s great, but this isn’t Australia.

  • I find this resource absolutely invaluable for research and interest purposes. The annual subscription is very reasonable considering what you get for your money. I will be happy to make amendments during my researches as I come across them. Thank you to everyone involved for such a wonderful resource.

  • In reply to J Moxon above, many of us have been researching for decades and are used to archives, libraries and the rest. I used to go to Collinwood and request the physical newspapers to pore over, long before this digital archive existed. Of course, this is a wonderful resource and it has saved me time and money when I can search thousands of newspapers from the comfort of my home. But that is not the point in relation to correcting OCR text. Putting it on online is a commercial operation (and I have no problems with that) and commercial operations do not expect their paying customers to do their work for them. I look up thousands of pages a year and my work would soon grind to a halt if I started to correct all the mistakes (on every line!) I come across. But let’s say some altruistic people do this. Who benefits? The great majority who are too busy, too selfish or too lazy to do it. And DC Thomson, one the largest publishing houses in the country. Sounds like a pointless transfer of benefits to me. Since this is a commercial operation done purely for profit, it is logical that if Thomson want to improve their product they should encourage it by offering credits/discounts to those who take the time & trouble to edit the text. I had the same problem with Findmypast census records. I used to post corrections to the many mistakes i found, but I never got even a thank you. I realised after a while that I was a mug . I did suggest to FMP that they would get almost everybody improving their product if they offered some kind of incentive, but they weren’t interested. Do it for nothing was their attitude. I did find the original FMP people aeons ago (when they were called 1837 online, or similar) rather more helpful and approachable than the current lot. it’s just big business now, and not worthy of my charitable support.

  • Well there are a few problems with this “correcting text” feature.

    Firstly it doesn’t always select an area correctly. It will select an area yes but not always where you want to make a correction.

    Secondly, once you have selected an area it is difficult to unselect it.

    Thirdly, quite often, the text itself is so faded one cannot make our what it says at all. Please can you re-scan faded pages or at least try to make the text more readable to begin with, make it darker etc?

    Lastly, OCR is all well and good but it often garbles the text beyond all recognition. I know this is improving but it is still an issue. And that complicates searches as well.

Leave a Reply to Lynn Cancel Reply

Your email address will not be published.