Project P003 – English and Chinese Topic Modeling Tool

HKUST Library Digital Scholarship DS CoLab Project P003

Project Background

Topic modeling is a natural language processing technique that can uncover the underlying themes or topics within a collection of documents. By analyzing large volumes of text, this technique helps researchers group related content of documents together, making it easier to understand the main subjects present in different group of documents. It is particularly useful in the fields like social sciences, marketing and digital humanities, where understanding the context and themes within large datasets is crucial.

While there are several tools available online for performing topic modeling, some of them require users to upload documents online, often with limitations on the number of files, or the need for login credentials, which present privacy concern or access barrier for some users. Moreover, more diverse visualizations tailored to different analytical purposes can be incorporated for better interpretation of results. These considerations underscore the potential of developing a custom topic modeling tool to fill these gaps.

To achieve this, our student developers, Sherry and Yolanda, conducted a comprehensive environmental scan on some existing topic modeling tools, identified key features, and designed a topic modeling tool for English language data and Chinese language data respectively.

Project Goal

The aim of this project is to develop tools that can streamline the topic modeling process, enabling users to uncover the underlying topics within a large collection of documents efficiently. Another goal is on customizing the tool to fit our specific workflow. By developing our own solution, we aim to ensure that the output is structured and convenient for our categorization processes when dealing with our collections, facilitating a seamless transition to the next stages when performing analysis towards our data.

We also hope our tools can help other researchers to perform topic modeling tasks with greater efficiency, be able to gain insights and quick analyses through our export function, topic over time feature and other more visualizations and functions, particularly for both the English and Chinese language research community.

DS-P003_TopicModeling_cover
This is the Topic Modeling tools that developed by our student developers – Sherry and Yolanda

Features highlights

DS-P001_Feature_AutoAnnotation

Custom Parameters

You can finetune your topic modeling results by setting different parameters, such as the minimum document threshold per topic, the number of keywords for each topic, and the maximum number of topics to generate, giving you greater control and flexibility in analyzing your data.

DS-P001_Feature_GroupAlias

Custom Stopwords

To help you achieve more precise results, we provide several stopwords lists, along with the option to add or remove specific words. You can also upload your own stopwords list according to the specific context of your data.

DS-P001_Feature_GroupAlias

Track Topic Changes Over Time

One of our notable features is the ability to not only generate topics within a collection but also to track the topics’ evolution over time. This allows you to observe how certain topics may dominate at specific periods and identify trends of their rise or fall.

DS-P001_Feature_visualization graphs

Visualizations

Our tool offers a variety of visualizations that allow you to quickly understand the distribution of topics across documents or among different topics.

DS-P001_Feature_multipleTxt

Export and Import Custom Model

Our tool provides to option to export your own trained model which is particularly useful for the datasets that are continually growing. When new data is available, simply upload the model you previously trained and applied the same topic modeling approach to the new data.

DS-P001_Feature_Upload-selfDefined-csv

Export Results

You can export the topic modeling results, which include the assigned topics for each document in CSV format. This format is convenient for performing further analysis or categorization tasks that fits your later workflow.

All visualizations that generated in our tool are also downloadable for your use.

For more features and explanations, please visit our user manual guide (Chinese version or English version) for details.

Download our tool and try it out!

URL: https://github.com/hkust-lib-ds/P003-PUBLIC_English-and-Chinese-Topic-Modeling-Tool

Words from students

YIP Sau Lai, Sherry

BSc in Data Science and Technology
Year 4


This is my second time to participate in the Library’s DS CoLab project. This opportunity not only enables me to apply lessons from past projects but also to gain new insights to refine my approach…In this project, I found that getting into published papers really helps grasp a field faster…As I move forward, these insights will continue to influence my approach to technology development, always with an eye towards making complex tools usable and useful for everyone…

WANG Yuning, Yolanda

BEng in Computer Science
Year 3


In this project, I’m responsible for developing a topic modeling tool that focused on Chinese text….Prior to this, I didn’t know anything about topic modeling. This project provided me with the opportunity to research and understand its theoretical foundations and existing techniques…Throughout the past four months, I’ve been on a cool journey…this project has significantly broadened my knowledge in the field of natural language processing…

Project Team

Developers

  • YIP Sau Lai, Sherry ◇ Year 4 student, BSc in Data Science and Technology
  • WANG Yuning, Yolanda ◇ Year 3 student, BEng in Computer Science

Advisers

  • Holly CHAN ◇ Assistant Manager (Digital Humanities)
  • Amanda MAK ◇ Officer (Systems & Digital Services)
  • Winny MAK ◇ Officer (Research Support)

Presentation

Yip, Sherry S.L., & Wang, Yolanda Y. (2025, April 4). From challenges to solutions: Insights from our journey of developing a topic modeling tool for English and Chinese [Symposium online presentation]. Global Digital Humanities Symposium 2025, Michigan State University, the United States. https://msuglobaldh.org/abstracts/#yip-wang