Project P001 – Chinese NER Tool – Sharing by Sherry and Berry

HKUST Library Digital Scholarship DS CoLab Project P001 Students Learning Journey

Students Learning Journey

YIP Sau Lai, Sherry
Year 3, BSc in Data Science and Technology


HAN Liuruo, Berry
Year 2, BSc in Data Science and Technology


Sherry and Berry are our student interns who actively participated in our project about developing a Chinese Named-Entity Recognition (NER) Tool during the 2023/24 Spring Semester (Feb-May 2024). See below the words from Sherry and Berry regarding their experience on this project!

Our role in this project

We are Sherry and Berry, Year 3 and Year 2 students from BSc in Data Science and Technology respectively. We were involved in the overall process of platform development, including functional design and programming. Our major duty is worked as a Python developer to develop a web platform to perform Named-Entity Recognition (NER) on Chinese paragraphs or articles. The main objective of the project is to provide users with a straightforward process of pasting or uploading a text file to the platform, where the web tool will automatically annotate entities, identify their occurrences with other features and visualization in one go. The aim of this initiative is to enable non-technical users to effortlessly conduct NER tasks through our user-friendly web interface.

Challenges and obstacles

Our first challenge is to identify what NER model(s) and decide what kind of features should be included in this platform. We conducted research on similar NER tools, such as Peking University’s “Wu Yu Dian” intelligent annotation platform, CKIP recognition platform and CORPRO. After evaluating these platforms, we summarized functionalities that are noteworthy and areas for improvement based on our concluded personal user experience into a wish list.

During the development stage, we co-developed the project with each other which posed communication and coordination challenges initially. For instance, we defined duplicated variables for the same purpose and induced some synchronization issues after some web form actions. To fine tune the platform with better performance with no errors, we come up with a standard set of variables by defining a list of shared variables. We also adopted the use of GitHub for better project management, we found this approach makes the development easier for merging our codes and recording changes to minimize the programming errors.

DS-P001 conducted environmental scan on the existing NER tools
Fig.1: We conducted environmental scan on the existing NER tools.
DS-P001 List of tasks for keeping track of the progress
Fig. 2: List of tasks for keeping track of the progress
DS-P001 Creating a framework at the beginning of the implementation stage
Fig. 3: Creating a framework at the beginning of the implementation stage was essential. This not only helped with code organization but also aided in structuring the data storage and overall architecture of the project.
DS-P001 Providing clear explanations through comments in the code
Fig. 4: Providing clear explanations through comments in the code can prove collaboration by making it easier for others to understand the code and track responsibilities efficiently.
DS-P001 significance of defining necessary functions as they enhanced code reusability
Fig. 5: We learned the significance of defining necessary functions as they enhanced code reusability. Moreover, we realized that implementing functions in the correct order can minimize duplication and overlapping tasks. Meanwhile, allowing for flexibility in future improvements or modifications should also be considered when designing a function.

What we have learned

Better communication and user-oriented design

Over the design stage, we kicked start the user interface (UI) design on Figma based on the collected requirement and the wish list of features. We sought further comments from team members from the library and potential end-users after the first version of UI design was completed. We found that stakeholders may change their expectations and interpretation of the operation flow slightly different in iterations. The setting up of milestones and continued communications are crucial to create an intuitive and user-oriented design application.

DS-P001 user interface (UI) design on Figma
Fig. 6: We drafted the user interface (UI) design on Figma based on the collected requirement and the wish list of features.

 
IT Proficiency in programming and operations in server environment

During the testing stage, we first developed the platform locally on the personal computer and deployed it for review during the regular meeting. However, the environment between computers may not be identical and caused different programming errors as well as performance issues. We then realized the using a shared server environment to execute the developed platform can minimize the implications and smoothen the development process by leveraging the cloud based computational power for efficient processing.

Due to the change in the environment as well as the way to deliver functions and features, we must embrace the mindset of continuous learning and adopt better solutions to implement the platform. We actively sought out new information, explored alternative approaches to implement the feature with different techniques and tools. At the end, we adopted the list of tools to support our development:

ToolFunction
PythonProgramming language for building the platform and handling all logical processes.
StreamlitPython framework to build and deploy websites with Python.
PlotlyPython graphing library makes interactive graphs.
CKIP TransformerPython library for handle Natural Language Processing (NLP) for Chinese natural language processing.
GitHubDeveloper platform to store, manage the programming source code.
FigmaCollaborative web application for interface design.