Drug-induced liver injury (DILI) poses a considerable challenge in clinical and research contexts, being a primary cause of acute liver failure in numerous Western countries and a significant factor in the attrition of new drug candidates. Manual literature review remains the predominant method for accessing DILI-related information, yet it suffers from inefficiency and susceptibility to human error. Consequently, there’s a growing demand for automated solutions capable of navigating the extensive literature corpus to extract pertinent data for the drug discovery community. Addressing this need, a recent study introduces DILIc (Diliclassifier), an artificial intelligence (AI) model integrating natural language processing (NLP) and machine learning (ML) techniques.
DILIC aims to streamline the retrieval of DILI-related articles from the vast expanse of available literature. The methodology involves several pivotal stages:
Data Collection and Preparation: A dataset comprising over 28,000 articles annotated for DILI was curated and divided into discovery and validation sets. External datasets from the FDA and SIDER were also incorporated to augment the model’s information.
Natural Language Processing (NLP): A bespoke NLP model was developed to extract pertinent keywords and patterns from article titles and abstracts. This encompassed tokenization, lemmatization, and filtering to generate meaningful keyword sets.
Pattern Mining: Utilizing the Apriori algorithm, frequent patterns were mined from the extracted keywords, with weights assigned based on their occurrence in DILI-positive and DILI-negative articles.
External Cohort Integration: Information from FDA-approved drug lists and the SIDER adverse events dataset was integrated to bolster the model’s accuracy in classifying DILI-related literature.
Classifier Training: Multiple ML classifiers were trained using the weighted keyword vectors, with gradient boosting machines (GBM) emerging as the most effective model.
Results demonstrated the robustness of DILIC, achieving a high accuracy of 94.91% in cross-validation and 94.14% in external validation. Additionally, to enhance accessibility, an R Shiny app was developed, enabling users to classify single or multiple abstracts for DILI, alongside providing supplementary resources for further comprehension and utilization.
In conclusion, DILIC represents a significant stride in automating DILI literature search, offering a valuable tool for researchers engaged in drug discovery and risk assessment. Future endeavors will concentrate on refining the model’s accuracy and broadening its applicability to other adverse events beyond DILI.
Next: READI-DEM: Machine Learning Powered Dementia Diagnosis Tool
References:
- Rathee, Sanjay, Meabh MacMahon, Anika Liu, Nicholas M. Katritsis, Gehad Youssef, Woochang Hwang, Lilly Wollman, and Namshik Han. “DILI C: An AI-based classifier to search for drug-induced liver injury literature.” Frontiers in Genetics 13 (2022): 867946. [Article]