Matching similar string values

The Fuzzy Matching action allows you to compare string values from one table with string values from a lookup table to match similar values. An output table is returned containing likely matches, along with match scores from 0 to 1. You can:

  • Define the matching logic to be used by selecting one of the three Fuzzy Matching Algorithms -
    • Jaro-Winkler measures similarity between two strings, and is best suited for shorter strings such as city or country names
    • Smith-Waterman compares segments of strings; for example, bc vs. abcde is a perfect match score of 1, but cb vs. abcde is a match score of 0.5
    • Levenshtein is best to use for longer strings, as the formula for match score is 1 - #Single Character Edits/Length of longer string
    • Overlap is effective across a variety of data characteristics; it breaks strings into all possible 3-character segments and then produces a score between 0-1 based on the formula -

      #matching segments / minimum number of segments of the two strings being compared

  • Define pass-through fields to be returned as additional columns in the output table
  • Adjust the fuzzy matching options to change the number of matches written to the output table, or alter the logic for case sensitivity and special characters

On the Configuration tab:

  1. Use Input Fields to identify fields from the input table to use for pass-through and matching.

  2. Drag fields from Input Fields to define pass-through fields.

  3. Drag fields from Input Fields to define the field to match.

  4. Use Lookup Fields to identify fields from the lookup table to use for pass-through and matching.

  5. Drag fields from Lookup Fields to define pass-through fields.

  6. Drag fields from Lookup Fields to define the field to match.

  7. Select matching options.

  8. Select the algorithm to be used for matching.

Possible applications include:

  • Find a match for missing/invalid string values in a “master” table
  • Compare two lists of string values to find those values without a match in the other list
  • Find matches for values that may be a result of user error or some other type of error

Use fuzzy matching to compare similar strings

  1. Drag the Fuzzy Matching icon onto the design surface.
  2. On the Connections tab , enter a Name and a Description to identify the action, then choose input tables and output options for the record matching.

    Input -

    • For Database Connection, select the database from which the tables originate, or choose New Database Connection to establish a new connection to another database.
    • For Input Table Name, select the table containing the candidate records.
    • For Lookup Table Name, select the table containing the master records.

    Output -

    • For Database Connection, select the database to which the output is written, or choose New Database Connection to establish a new connection to another database.
    • Enter or select the Output Table Name depending on the Output Mode you choose.
      • For Create/overwrite a table, enter the name of the table being created or overwritten, and indicate whether the table should be deleted after the macro or workflow has executed.
      • For Use an existing table, select the table to be used from the drop-down, and indicate whether the rows in the table should be deleted before the new data is written to the table.
  3. On the Configuration tab :

    Fields to Match -

    • Drag from Input Fields to the Input placeholder icon. This field must be a string data type.
    • Drag from Lookup Fields to the Lookup placeholder icon. This field must be a string data type.
    • Optionally, drag any input pass-through fields from Input Fields to the Input Pass Through Fields area.
    • Optionally, drag any lookup pass-through fields from Lookup Fields to the Lookup Pass Through Fields area.

    Options -

    • For Threshold, indicate the minimum percentage of a matching data characteristics required. The default is 85 (for 85%).
    • For Max Matches, enter the maximum number of matches within the threshold to be written to the output table.
    • Indicate whether to Ignore Case and/or Ignore Special Characters. If selected, these characters do not affect the threshold of the match.

    Fuzzy Matching Algorithm -

    Choose the matching algorithm to be used. Select each one for a description of the logic it employs:

    • Jaro-Winkler
    • Smith-Waterman
    • Levenshtein
    • Overlap

Fuzzy Matching data example

In the following example, fuzzy matching is used to find city names that might include misspellings:

Last modified: Thursday December 19, 2024

Is this useful?