String comparison is a key step in data pre-processing, but functions in Excel such as MATCH and VLOOKUP falter in fuzzy string matching. In this post, let’s explore how the Python library “FuzzyWuzzy” overcomes these limitations.
What is FuzzyWuzzy
FuzzyWuzzy is a widely used Python library that employs the Levenshtein distance algorithm to calculate the similarity between two strings. By assigning similarity scores based on the degree of similarity between strings, FuzzyWuzzy offers a practical approach to fuzzy string matching (also known as approximate string matching), even in cases involving minor differences.
Matching Article Titles with FuzzyWuzzy
Matching article titles across different publication platforms is a common challenge due to slight variations in titles, including cases involving chemical formulas (Figure 1).
Figure.1 Slight variation in article title in Web of Science (a) and Scopus (b)
The following Python code demonstrates how FuzzyWuzzy can be utilized for article title matching across common scenarios.
Before we begin, let’s install the FuzzyWuzzy library using pip:
# Install the FuzzyWuzzy and Levenshtein library
!pip install fuzzywuzzy
!pip install python-Levenshtein
Use case: Get similarity score
We use the fuzz.ratio() function to calculate the similarity score between two strings using the Levenshtein distance algorithm. The score ranges from 0 to 100, where higher values indicate greater similarity. Alternatively, we can use the fuzz.WRatio() function, which allows for case-insensitive comparisons.
from fuzzywuzzy import fuzz
# Example article titles
wos_title = "SYNTHESIS AND ELECTROCHEMISTRY OF DIALKYLOSMIUM-(IV) AND DIALKYLOSMIUM-(V) PORPHYRINS - CRYSTAL-STRUCTURE OF [OS(TTP)(CH(2)SIME(3))(2)] [H(2)TTP=5,10,15,20-TETRA(P-TOLYL)PORPHYRIN]"
scopus_title = "Synthesis and electrochemistry of dialkylosmium-(IV) and -(V) porphyrins. Crystal structure of [Os(ttp)(CH2SiMe3)2] [H2ttp = 5,10,15,20-tetra(p-tolyl)porphyrin]"
# Calculate similarity score
similarity_score = fuzz.WRatio(wos_title, scopus_title)
# Print the similarity score
print("Similarity score: ", similarity_score)
Output:
Similarity score: 93
Use case: Find the best match of an article from a list of options
The extractOne function from the FuzzyWuzzy library is used to find the best match for a given string within a list of options. In the following example, the function is used to find the best match for the article title: “Synthesis and electrochemistry of dialkylosmium-(IV) and -(V) porphyrins. Crystal structure of [Os(ttp)(CH2SiMe3)2] [H2ttp = 5,10,15,20-tetra(p-tolyl)porphyrin]” within the wos_list (a list of article titles in Web of Science):
from fuzzywuzzy import process
wos_list = [
"Synthesis and reactivity of ruthenium(II) complex with an ortho-metalated bis(diphenylthiophosphoryl)imide",
"Synthesis, Structure, and Reductive Elimination of Cationic Monoarylpalladium(IV) Complexes Supported by a Tripodal Oxygen Ligand",
"SYNTHESIS AND ELECTROCHEMISTRY OF DIALKYLOSMIUM-(IV) AND DIALKYLOSMIUM-(V) PORPHYRINS - CRYSTAL-STRUCTURE OF [OS(TTP)(CH(2)SIME(3))(2)] [H(2)TTP=5,10,15,20-TETRA(P-TOLYL)PORPHYRIN]",
]
process.extractOne("Synthesis and electrochemistry of dialkylosmium-(IV) and -(V) porphyrins. Crystal structure of [Os(ttp)(CH2SiMe3)2] [H2ttp = 5,10,15,20-tetra(p-tolyl)porphyrin]", wos_list)
Output:
(‘SYNTHESIS AND ELECTROCHEMISTRY OF DIALKYLOSMIUM-(IV) AND DIALKYLOSMIUM-(V) PORPHYRINS – CRYSTAL-STRUCTURE OF [OS(TTP)(CH(2)SIME(3))(2)] [H(2)TTP=5,10,15,20-TETRA(P-TOLYL)PORPHYRIN]’, 93)
Use case: Find similar articles from two lists
We can further expand the above query to compare two lists by creating a loop to iterate over each title in the scopus_list. Inside the loop, the process.extractOne() function is called to find the best match in the wos_list for the current scopus_title. Similarly, the extractOne() function returns a tuple containing the best match, its corresponding score, and the index of the match in the wos_list. The tuple is then appended to the matches list. After the loop is completed, a pandas DataFrame named df is created. The DataFrame is constructed with the columns “Scopus Title”, “WoS Title”, and “Score”:
from fuzzywuzzy import process
import pandas as pd
wos_list = [
"Synthesis and reactivity of ruthenium(II) complex with an ortho-metalated bis(diphenylthiophosphoryl)imide",
"Synthesis, Structure, and Reductive Elimination of Cationic Monoarylpalladium(IV) Complexes Supported by a Tripodal Oxygen Ligand",
"SYNTHESIS AND ELECTROCHEMISTRY OF DIALKYLOSMIUM-(IV) AND DIALKYLOSMIUM-(V) PORPHYRINS - CRYSTAL-STRUCTURE OF [OS(TTP)(CH(2)SIME(3))(2)] [H(2)TTP=5,10,15,20-TETRA(P-TOLYL)PORPHYRIN]",
"Reaction of the [WSe4]2- anion with phosphines: Synthesis and crystal structure of a novel cubane-like cluster [W2Se4(η2-Se2CH2) 2{Cu(PCy3)}2] (Cy=cyclohexyl)"
]
scopus_list = [
"Synthesis and electrochemistry of dialkylosmium-(IV) and -(V) porphyrins. Crystal structure of [Os(ttp)(CH2SiMe3)2] [H2ttp = 5,10,15,20-tetra(p-tolyl)porphyrin]",
"Reaction of the [WSe4]2- anion with phosphines:: synthesis and crystal structure of a novel cubane-like cluster [W2Se4(η2-Se2CH2)2{Cu(PCy3)}2] (Cy=cyclohexyl)"
]
matches = []
for scopus_title in scopus_list:
match = process.extractOne(scopus_title, wos_list)
matches.append((scopus_title, match[0], match[1]))
# Create a DataFrame from the fuzzy matching results
data = {'Scopus Title': [match[0] for match in matches],
'WoS Title': [match[1] for match in matches],
'Score': [match[2] for match in matches]}
df = pd.DataFrame(data)
# Print the DataFrame
df
Output:
Scopus Title | WoS Title | Score | |
0 | Synthesis and electrochemistry of dialkylosmium-(IV) and -(V) porphyrins. Crystal structure of [Os(ttp)(CH2SiMe3)2] [H2ttp = 5,10,15,20-tetra(p-tolyl)porphyrin] | SYNTHESIS AND ELECTROCHEMISTRY OF DIALKYLOSMIUM-(IV) AND DIALKYLOSMIUM-(V) PORPHYRINS – CRYSTAL-STRUCTURE OF [OS(TTP)(CH(2)SIME(3))(2)] [H(2)TTP=5,10,15,20-TETRA(P-TOLYL)PORPHYRIN] | 93 |
1 | Reaction of the [WSe4]2- anion with phosphines:: synthesis and crystal structure of a novel cubane-like cluster [W2Se4(η2-Se2CH2)2{Cu(PCy3)}2] (Cy=cyclohexyl) | Reaction of the [WSe4]2- anion with phosphines: Synthesis and crystal structure of a novel cubane-like cluster [W2Se4(η2-Se2CH2) 2{Cu(PCy3)}2] (Cy=cyclohexyl) | 99 |
FuzzyWuzzy offers additional functions, such as partial_ratio(), token_sort_ratio(), and token_set_ratio(), to further refine the matching process. Researchers can experiment with these functions and combine them as needed to achieve optimal results.
The notebook containing all the code can be found and run at Google Colab.
– By Ernest Lam, Library
Hits: 1270
Go Back to page Top
- Category:
- Research Tools
Tags: fuzzywuzzy, python, research data, string match
published December 1, 2023