Skip to content
Start main Content

Text and Data Mining: Full-text Databases

Library Research Support Services recently did a small-scale study to learn about the text and data mining (TDM) policies of some of our databases. We found that most of them allow TDM under certain conditions.

Needs for Data Mining Resources

The Library understands that our faculty and research students sometimes need massive amount of data for their scientific exploration work, especially for data-hungry research study such as machine learning, deep learning, and semantic analysis. Hence, we introduced the arXiv corpus in a previous post. In addition to this, we also wanted to find out if there are resources within our collections that can be used for TDM, and database is the natural target.

About the Study

The study involved 13 full-text databases; some with very high usage rates. We searched for the TDM policy on the publishers’ websites and looked at the details. Factiva, Lexis/Nexis (Nexis Uni), and ProQuest do not permit data mining. No TDM policy was found in EbscoHost’s website. These are all aggregator databases that collect full text from other publishers and may not own the copyright.

Findings

Majority of the publishers that support TDM offer the service free-of-charge. However, there are usually some rules and requirements to fulfill. The data mining and data delivery methods may also be quite different.

Commonly Seen Terms and Conditions

  • Use for non-commercial research purposes.
  • Can only text-mine subscribed and open access content.
  • Follow the download limit, e.g. 3 requests per second.
  • Disallow sharing the data with third parties.
  • Delete the data once the project ends.
  • Use APIs to extract data instead of crawling the database by web robots, spiders, etc.
  • Require a license or an agreement.

Specific Requirements and Services

The following are examples of specific requirements and services offered by some publishers.

Elsevier: Text and data mining

  • Supplies over 40 APIs for Elsevier’s products including Scopus, ScienceDirect, SciVal, PlumX, and others.
  • Users need to obtain an API key via Elsevier’s Developer Portal.
  • An Object Retrieval API is available for mining images.


Springer Nature: Text and Data Mining at Springer Nature

  • Offers various APIs to facilitate TDM, e.g. Citations API, SN SciGraph APIs, and more.
  • Provides a selection of metadata format such as JATS, Dublin Core, ONIX, or MARC records.
  • Supports argumentation mining.

Wiley: Text and Data Mining

Jstor Dataset Services
Anyone can request a dataset through either of the two services below.

  • Self-service: limit to 25,000 documents; does not cover full text.
  • Large/full-text request: by special request and requires an agreement about the use of the data.

Gale (Cengage): Data Mining FAQs

  • The library needs to sign an addendum contract.
  • Delivers data in a hard drive and the library is responsible to arrange users to access the content in the drive.
  • Charges a fee that based on the cost of production and delivery.

Other Databases

The following is a summary of the other four databases.

For Cambridge and Sage, their TDM terms of use are very similar to the ones listed under Commonly Seen Terms and Conditions above.

The last two, IEEE and Oxford Academic, both of them permit TDM for non-commercial use. However, there are no details in their web pages. Users are told to contact them by email.

We recommend researchers to read through the TDM terms and conditions of the resources they want to use. It is important to find out details such as what is allowed and what is not, who owns the copyright of the extracted results, and whether an acknowledgement is required in a specified format.
 

— By Poon Sau Ping, Research Support Services, Library

0 0 votes
Article Rating

Hits: 8510

Go Back to page Top

Tags:

published September 29, 2020
last modified December 28, 2022

Subscribe
Notify of
guest

1 Comment
Oldest
Newest
Inline Feedbacks
View all comments
Anonymous
Anonymous
September 29, 2020 12:47 pm

Good summary indeed! Thanks for sharing!