Z39.50 Duplicate Detection Service

April 1999

This service definition and ASN.1 specification has been approved by the ZIG. An amendment (Z39.50-1995 Amendment 2), has been approved, adopting this definition into the Z39.50 Service Definition and Protocol.

The Z39.50 Duplicate Detection Service

The Z39.50 Duplicate Detection service allows the origin to request that the target analyze one or more result sets in terms of potential duplicates and to construct a new result set according to origin-specified criteria for detecting, retaining, grouping, and ordering the records including duplicates.

Parameters

The following notation is used in the "Origin Request" and "Target Response" column:

[0,1] means parameter is optional, not repeatable; i.e. zero or one.

0+ means parameter is optional, repeatable; i.e. zero or more.

1 means parameter is mandatory, not repeatable; i.e exactly one.

1+ means parameter is mandatory, repeatable; i.e. one or more.

Parameter	Origin Request	Target Response	Condition
Input Result Set Id	1+
Output Result Set Name	1
Applicable Portion of Record	[0,1]
Duplicate-detection Criterion	0+
Clustering	[0,1]		May be omitted if representative record only is to be retained (if Retention Criterion is 'number of entries' and its value is 1). Otherwise must be supplied.
Retention Criterion	1+
Sort Criterion	0+
Status		1
Result count		[0,1]	Must occur if Status is 'success'.
Diagnostic		0+	Must occur if Status is 'failure'.

Input Result Set Id and Output Result Set Name
The origin identifies one or more transient result sets belonging to the current Z-association. The target is to logically merge the sets (removing duplicates and ordering equivalence classes according to the parameters below) into a single result set, specified by the parameter Output Result Set Name.

Applicable Portion of Record
The origin may specify what portion of the record is subject to matching (for example, one or more fields) for purposes of duplicate-detection. If this parameter is omitted, the target decides what portion of the record is subject to matching.

Duplicate-detection Criterion
For modeling purposes, a temporary, intermediate result set (not the output result set) is assumed to be created, which includes all of the result set items from all of the input result sets (including duplicate result set items). The target applies duplicate-detection criteria supplied in this parameter (or if the origin omits this parameter, the target applies whatever duplicate detection criteria it chooses) to partition the intermediate result set into one or more equivalence classes where two result set items are considered equivalent if they are duplicate. That is, the partitioning has the following properties:

Every result set item from one of the input result sets is in exactly one class.
Any two result set items are in the same class if and only of they are duplicates.

The target distinguishes a single result-set item within each equivalence class as the representative record for that class. The selection of representative record might be based on the value of the parameter Sort Criterion.

The origin may specify one or more criteria for detecting duplicates. These include the following (the list is subject to extension):

level of match
If this criterion is included, the origin specifies a level of match in terms of a percentage. For example, fingerprints might be duplicates based on a 60% match; 100% might mean that records are duplicates only if they are identical.
Case sensitive
Punctuation sensitive
Regular expression
If this criterion is included the origin supplies a regular expression to govern matching.
result-set duplicates
(Two result set items are result-set duplicates if they point to the same database record.)

Clustering
The origin indicates one of the following:

Clusters
The output result set is to contain one item for each equivalence class. For each equivalence class, create a result set item for the representative record only and maintain duplicates as metadata. (Records may subsequently be presented either as (a) representative record with duplicates attached as metadata, using, for example, GRS; or (b) as a cluster record, using an appropriate cluster syntax.)
Individual Entries
Create individual result set items for representative records as well as duplicates that are to be retained (according to Retention Criterion). Order the output result set such that records within an equivalence are grouped together. The parameter Sort Criterion may be supplied, to indicate how the records within a class are to be ordered.

This parameter may be omitted only if 'Number of entries' is supplied as a retention criterion (parameter Retention Criterion) and the value supplied is 1.

Retention Criterion
The origin specifies one or more criteria for how records are to be selected for inclusion in or exclusion from each equivalence class. These include the following (the list is subject to extension):

Number of entries
If this criterion is selected, the origin supplies a number, N>0, meaning retain (up to) N entries in each equivalence class. N=1 means retain the representative record only. This value may be used in combination with (3) and/or (4), but not (2).
Percent of entries
If this criterion is selected, the origin supplies a percentage, xx, meaning retain xx percent of the entries in each equivalence class. xx=100 means retain all entries. This value may be used in combination with (3) and/or (4), but not (1).
Duplicates only
Discard representative record. This value should not be specified unless the value of parameter Clustering is 'Individual Entries'. This value may be used in combination with (1) or (2), and/or (4).
Discard result-set duplicates
This value may be used in combination with (1) or (2), and/or (3). If used with (1) or (2) the result-set duplicates should be discarded first (before entries are selected).

Sort Criterion
The client may provide one or more sort criteria for selecting the representative record as well as for ordering records within an equivalence class.

This parameter will affect the ordering of result set items only within an equivalence class (it does not affect the ordering of equivalence classes). If the value of parameter Clustering is 'Clusters' then this parameter will have no effect whatever on the result set order (though it may be supplied anyway, to govern the selection of representative records as well as the order in which duplicates are presented within a single cluster record).

More than a single sort criterion may be supplied; if so, the order in which they are supplied is from major to minor, and only the first criterion supplied is used to govern selection of a representative record. The sort criteria include the following (the list is subject to extension):

Most Comprehensive
Select the longest (more comprehensive) record as the representative record; order duplicates within an equivalence class by descending comprehensiveness.
Least Comprehensive
Select the shortest (least comprehensive) record as the representative record; order duplicates within an equivalence class by ascending comprehensiveness.
Most Recent
Select the most recent record as the representative record; order duplicates within an equivalence class by ascending age.
Oldest
Select the oldest record as the representative record; order duplicates within an equivalence class by descending age.
Least Cost
Select the least expensive record as the representative record; order duplicates within an equivalence class by ascending cost.
Preferred Database
Select a record from the most preferred database as the representative record; order duplicates within an equivalence class corresponding to order of preference of databases. When this criterion is supplied the origin includes a list of databases in order of preference.

Status
The target indicates a status of 'success' or 'failure'.

Result Count
If the value of parameter Status is 'success' then the value of this parameter is the size of the output result set.

Diagnostic
The target may always include one or more diagnostics in the response. If the value of parameter Status is 'failure', at least one diagnostic must be included.

ASN.1 Changes and Additions

Change to Definition of PDU

In Z39-50-APDU-1995 (OID 1.2.840.10003.2.1), the definition of PDU is:


PDU ::= CHOICE{
 initRequest		[20] IMPLICIT InitializeRequest,
 initResponse		[21] IMPLICIT InitializeResponse,
......
 close			[48] IMPLICIT Close}

Change it to:


PDU ::= CHOICE{
 initRequest  			[20] IMPLICIT InitializeRequest,
 initResponse			[21] IMPLICIT InitializeResponse,
......
 close				[48] IMPLICIT Close,
 duplicateDetectionRequest	[49] IMPLICIT DuplicateDetectionRequest,
 duplicateDetectionResponse	[50] IMPLICIT DuplicateDetectionResponse}



ASN.1 Definition for Duplicate Detection APDUs

DuplicateDetectionRequest ::= SEQUENCE {
   referenceId			    ReferenceId  OPTIONAL,
   inputResultSetIds 		[3] IMPLICIT SEQUENCE OF InternationalString,
   outputResultSetName		[4] IMPLICIT InternationalString,
   applicablePortionOfRecord	[5] IMPLICIT EXTERNAL OPTIONAL,
   duplicateDetectionCriteria	[6] IMPLICIT SEQUENCE OF
                                            DuplicateDetectionCriterion OPTIONAL,
   clustering			[7] IMPLICIT BOOLEAN OPTIONAL,
				  	   -- 'true' means "clustered".
					   -- This parameter may be omitted
					   -- only if retentionCriteria CHOICE is
					   -- 'numberOfEntries' and its value is 1.
   retentionCriteria		[8] IMPLICIT SEQUENCE OF
                                            RetentionCriterion,
   sortCriteria			[9] IMPLICIT SEQUENCE OF
                                            SortCriterion OPTIONAL,
   otherInfo 			    OtherInformation OPTIONAL}

		DuplicateDetectionCriterion ::= CHOICE{
		  	   levelOfMatch		[1] IMPLICIT INTEGER,
				-- a percentage; 1-100.
			   caseSensitive 	[2] IMPLICIT NULL,
			   punctuationSensitive	[3] IMPLICIT NULL,
			   regularExpression	[4] IMPLICIT EXTERNAL,
			   rsDuplicates		[5] IMPLICIT NULL
                 -- values 6-100 reserved for future assignment.
                                                       }

		RetentionCriterion ::= CHOICE{
			  numberOfEntries	[1] IMPLICIT INTEGER,
							--  greater than 0
			  percentOfEntries	[2] IMPLICIT INTEGER,
							-- 1-100,
			  duplicatesOnly 	[3] IMPLICIT NULL,
						 -- should not be chosen
						 -- if clustering is 'true'
			  discardRsDuplicates	[4] IMPLICIT NULL
                 -- values 5-100 reserved for future assignment.
                                                       }


		SortCriterion ::= CHOICE{
			 mostComprehensive	[1] IMPLICIT NULL,
			 leastComprehensive	[2] IMPLICIT NULL,
			 mostRecent		[3] IMPLICIT NULL,
			 oldest			[4] IMPLICIT NULL,
			 leastCost		[5] IMPLICIT NULL,
			 preferredDatabases	[6] IMPLICIT
                                     SEQUENCE OF InternationalString
                 -- values 7-100 reserved for future assignment.
                                                       }


DuplicateDetectionResponse ::= SEQUENCE {
   referenceId				ReferenceId  OPTIONAL,
   status	 		[3]	IMPLICIT INTEGER{
					  success 		(0),
					  failure 		(1)},
   resultSetCount		[4]	IMPLICIT INTEGER OPTIONAL,
   diagnostics			[5]	IMPLICIT SEQUENCE OF DiagRec OPTIONAL,
    otherInfo				OtherInformation OPTIONAL}




Option Bit Assigned to the Duplicate Detection Service

Option bit 18 is assigned to this service. See Z39.50 Option Bits.


 Library
of Congress