Project P001 – Chinese Name-Entity Recognition (NER) Tool

HKUST Library Digital Scholarship DS CoLab Project P001

Project Background

Named-Entity Recognition (NER) is a natural language processing technique that can automatically identify and categorize key elements such as people, organizations, locations, dates, and other important concepts within a large amount of text. This technique enables researchers to conduct insightful analysis of textual information relevant to their research efficiently.

There are various models and tools available online for doing NER tasks but many of them need coding and time for customization in order to reach the desired results. The technical complexity of such methods may be an obstacle for those researchers without strong coding skills. 

To address this issue, in this project, our student developers, Sherry and Berry, conducted an extensive environmental scan on the existing NER tools, extracted some notable features, and developed a NER tool, with a particular focus on the Chinese language.

Project Goal

The aim of this project is to develop a user-friendly tool that can streamline the Chinese NER process, allowing non-technical researchers to overcome the technical barriers and to more conveniently utilize NER in their work, particularly for the Chinese language research community.

This is the NER tool that developed by our student developers – Sherry and Berry

Features highlights

DS-P001_Feature_AutoAnnotation

Auto Annotation

The foundation of our tool makes use of Academia Sinica’s CKIP models to execute automated Chinese NER task for annotation.

DS-P001_Feature_GroupAlias

Add Entity Manually

Accurately identifying all entities remains a challenging task for the current natural language processing technologies. For those entities that cannot be identified by the model, you can manually add entities as needed. Entities can also be edited or deleted as necessary.

DS-P001_Feature_GroupAlias

Group Different Entities by group, class and alias

You can group different recognized entities by customizable groups, classes, and aliases, and view the frequency of entities accordingly.

The table above explains the definition of the four terms – instance, class, group, alias – that allow you to categorize the entities in different ways.

DS-P001_Feature_visualization graphs

Visualizations

Various visualization charts are displayed, allowing you to quickly glimpse the frequency of the selected entities and conduct other insightful analysis.

DS-P001_Feature_multipleTxt

Upload Multiple Files for Annotation

You can upload multiple text files for analysis. You can find a chart showing the frequency of entities across different files. One use case is that you can treat this chart as a trend analysis to observe how the occurrence of certain entities varies across different chapters.

DS-P001_Feature_Upload-selfDefined-csv

Export Data

You may export the entities data in CSV format, which will contain the frequency of all recognized entities, the grouped entities and their frequencies, as well as the assigned alias groups and their frequencies.

Each entity, group and alias are assigned with a unique ID, and the relationships between them are indicated, allowing you to conduct further advanced analysis using these detailed CSV files.

For more features and explanations, please visit our user manual guide for details.

Download our tool and try it out!

URL: https://github.com/hkust-lib-ds/P001-PUBLIC_Chinese-NER-Tool

We would be delighted to learn how our tool can support your research. If you have the opportunity to utilize this tool in your work, we would greatly appreciate it if you could share your experience with us!

Words from students

YIP Sau Lai, Sherry

BSc in Data Science and Technology
Year 3


I realized the importance of dividing the project into smaller, independent modules and prioritizing them. This approach facilitated task division and procedural planning, making the implementation process more manageable and efficient…

HAN Liuruo, Berry

BSc in Data Science and Technology
Year 2


Throughout this project, I have had the opportunity to expand my knowledge and skills in various areas, such as website functionality planning, user-oriented design, collaborative workflow with GitHub, etc…

Project Team

Developers

  • YIP Sau Lai, Sherry ◇ Year 3 student, BSc in Data Science and Technology
  • HAN Liuruo, Berry ◇ Year 2 student, BSc in Data Science and Technology

Advisers

  • Holly CHAN ◇ Assistant Manager (Digital Humanities)
  • Leo WONG ◇ Librarian (Systems & Digital Services)
  • Jennifer GU ◇ Librarian (Research Support)
  • Aster ZHAO ◇ Librarian (Research Support)

Presentation

Yip, Sherry S.L., Han, Berry L., & Chan, Holly H.Y. (2024, November 27). From ideation to implementation: Develop a Chinese NER tool to enrich literary experiences (從構思到實現的全過程:製作中文自動實體標注工具,豐富文學體驗) [Conference presentation]. The 15th Conference on Cooperative Development and Sharing of Chinese Resources (CCDSCR), Hong Kong. https://www.hkpl.gov.hk/en/extension-activities/ccdscr2024/schedule.html

Yip, Sherry S.L., Han, Berry L., & Chan, Holly H.Y. (2024, December 1). Student-powered DS CoLab project: Develop a Chinese Named-Entity Recognition (NER) tool within one semester from the ground up [Conference presentation]. The 15th International Conference of Digital Archives and Digital Humanities (DADH), Taipei, Taiwan. https://sites.google.com/view/dadh2024/

Publication

Paper to be released soon. Stay tuned!