Tuesday, 26 June 2012

Data merging and de-duplication

One of the challenges in bringing together bibliographic data from a number of libraries to make a consortium is merging the right records while avoiding incorrect merges.  We're fortunate that we have had an agreed cataloguing standard that our suppliers work to, and most catalogue records are downloaded from P2 - thereby increasing the level of standardisation.

However a combination of libraries making local changes, changes to cataloguing standards over time, libraries adding records for stock not on P2 and other factors does mean that we do have a diversity of records, often for what is essentially the same item.

To resolve as much of this issue as we can in an automated manner we are undertaking a 2 step process.  The first stage is a "match & bump" process that happens as each library's data is added to the system.This has meant that where an exact match can be found then bib records have been merged and all item records are attached to that bib record.  There are many examples of this to be seen now.  And it is this merging that is allowing holds from one library to flow onto the copies from other libraries.

We are aware that this match & bump process can only be used for exact matches using such keys as ISBN.  However we are aware that there are often copies of works which are essentially the same work, but they may have been published in different countries or by different publishers and they will therefore have a different ISBN.  While a match & bump process will not address these issues a more nuanced de-duplication process can be used where a number of match points besides ISBN can be used to identify works that are essentially identical and then merge them.

It has been agreed that once Port Adelaide Enfield goes live (planned for 5 July) that the de-duplication process will occur. This will take a number of days of consulting time by SirsiDynix staff to achieve and will therefore be a billable activity.  We expect to see it completed during July.  Leading up to this event Chris Kennedy (from PLS) and a group of staff from a range of libraries have been working on the de-duplication documentation to ensure that we maximise the benefits of the process.

Once the de-duplication has occurred the number of bibliographic records will be decreased, and customers will find it easier to place holds on bib records with more items attached to them. This will increase the likelihood of the system filling holds even quicker than it does now.

Once the de-duplication process has been run we will report on the outcomes of this - i.e. the success rate at merging records etc.

We are aware that as we add additional libraries after this de-duplication process that we will again have multiple bib records for essentially the same items.  So we will be looking to do further de-duplications at key points during the project.

Unfortunately we are unable to use this process to merge DVD records.  There are several reasons for this - one being that they do not have ISBNs and also the data that distinguishes different versions of a similar work are not as easily distinguishable and able to be resolved by an automated process. 

Therefore the consortium members will need to work at cleaning up the DVD records using a manual process.  While this will be time consuming, it will have real benefits for the customers, so is something that we will all want to pursue over time.

No comments:

Post a Comment