In today’s data-driven research landscape, the ability to extract insights from vast collections of text is no longer just an advantage—it’s a necessity. Text and Data Mining (TDM) transforms research by allowing computers to analyze massive text collections instantly. This computational method reveals hidden patterns and relationships that traditional reading would miss. As AI capabilities grow, TDM has become essential for modern academic work. However, many researchers wonder: can library-licensed materials, such as journal articles, be used for TDM? or even training AI models?
In this article, we’ll compare TDM policies, highlight useful tools, and showcase real-world applications to help you navigate this evolving field.
Navigating TDM Policies
Let’s start with policies. Understanding publisher policies is crucial when starting a TDM project. While many researchers are eager to apply TDM techniques to scholarly content, they often face uncertainty about what they can and cannot do with licensed materials. To help you clarify, we’ve organized TDM policies into four main categories.
1. Publishers Offering TDM Services for Free |
![]() Many publishers offer TDM services for their subscribing institutions, often via APIs for non-commercial use, with basic access for free, including Elsevier, Springer Nature, Taylor & Francis, and SAGE. These services typically cover both subscribed and open access content. Wiley restricts access to subscribed content only, while IEEE limits TDM to OA content only, requiring permission for non-OA and commercial use. Web of Science provides basic metadata for free but charges a fee for advanced use. Learn more: https://libguides.hkust.edu.hk/tdm |
2. Publishers Requiring Separate Licenses |
![]() |
3. Platforms with Built-in TDM Support |
![]() Gale Digital Scholar Lab, JSTOR Data for Research (through Constellate), and ProQuest TDM Studio offer built-in tools with visualizations and tutorials to support TDM research, particularly in digital humanities and social sciences. (Note that Constellate will sunset on July 1, 2025, but the notebooks and tutorials will remain accessible via the Constellate GitHub.)
|
4. Open Sources for TDM Research |
![]() Open sources provide alternative solutions for TDM, often with fewer barriers due to their open access nature. Popular sources include CrossRef, PubMed, and arXiv. CrossRef provides access to full-text documents through a standardized API; arXiv offers preprints under open licenses; PubMed Central (PMC) provides specific datasets for text mining through various access methods; and OpenAlex is an open-source bibliographic database with extensive metadata on academic publications, accessible via API or direct download. Learn more: https://libguides.hkust.edu.hk/tdm/open-sources |
Essential TDM Tools
A robust TDM workflow combines specialized tools for data access, analysis, and visualization.
- For data access, researchers commonly use publisher APIs like Elsevier’s API and Springer Nature’s API, along with open platforms like OpenAlex and CrossRef.
- For analysis, Python-based libraries form the backbone of many TDM projects: NLTK and spaCy handle natural language processing tasks, while scikit-learn provides machine learning capabilities. Data processing often relies on pandas and numpy for efficient handling of large datasets.
- For visualization, researchers can choose between programmatic options like matplotlib and seaborn (for Python), or no-code tools like VOSviewer for bibliometric network analysis.
Several integrated platforms simplify the TDM workflow: Gale Digital Scholar Lab provides humanities-focused analysis tools, while JSTOR’s Constellate (available until July 2025) combines content access with ready-to-use notebooks for text analysis.
Real-World TDM Applications
Here are several innovative applications showcasing how TDM enables researchers to systematically analyze vast amounts of scholarly content across different disciplines.
- Literature Review Acceleration
Researchers use TDM to rapidly identify relevant papers and extract key findings across thousands of publications. For example, biomedical researchers employed TDM to analyze PubMed articles, discovering hidden connections between diseases and drug compounds that weren’t obvious through traditional literature reviews (Tsuruoka et al., 2011).
- Research Trend Analysis
TDM helps track how research topics evolve over time. Climate scientists analyzed 400,000+ papers using Web of Science publications to map research trends and identify emerging areas needing more investigation (Callaghan et al., 2020).
- Market Intelligence
Financial analysts use TDM on news sources to gauge market sentiment and predict trends. One study developed models to analyze Reuters news articles, helping predict market movements based on news sentiment (Malo et al., 2014).
TDM and AI Model Training
Training AI models with large text and image datasets typically involves TDM, which must comply with specific licensing agreements. While many publishers support TDM for research purposes, they often impose restrictions such as download limits or access fees. When using TDM for AI model training, consider these key points:
- Check publisher policies specifically regarding AI training.
- Open access content under CC-BY licenses generally permits AI training, but third-party content within OA works may have stricter copyright terms.
- Commercial use typically requires separate agreements.
- Document your training data sources and permissions.
- Implement rate limiting to comply with publisher requirements.
Conclusion
TDM offers researchers a powerful methodology, but effective and legal implementation requires understanding the policies and tools offered by publishers, aggregators, and open sources. Keep in mind that this field evolves rapidly, with policies frequently updating as AI technologies advance. Always verify the most current information before launching any new TDM project.
By Aster Zhao, Library
Hits: 93
Go Back to page Top
- Category:
- Research Tools
Tags: TDM, text analysis, text and data mining
published February 28, 2025