Your domain: Toxicology data

Introduction

Toxicology is focused on the study of the adverse effects that occur in living organisms due to their interaction with chemicals. These chemicals range from substances found in nature to those made in the laboratory for many purposes (drugs, agrochemicals, pesticides, dyes, food additives, cosmetics, household products, etc.). A part of the toxicological research is devoted to the study of the adverse effects generated by chemicals in humans, while another part is devoted to the study of the noxious effects of the chemicals in the environment. The adversity is observed for a compound at a certain concentration. Consequently, hazard characterization should always consider exposure data. Toxicology was traditionally an observational science that obtained an important part of its data by means of experiments carried out on animals. However, the limitations of animal models to produce human-relevant data, as well as the implementation of the 3R policies (reduction, replacement, and refinement of the animal experimentation) have motivated a change of paradigm towards a more mechanistic view. Many international initiatives are promoting this change.

In this page, the relevant toxicological data management issues from in vitro, animal and human assays and ecotoxicology studies are explained, as well as the appropriate solutions for them. It has to be pointed out that most of the toxicology data is generated in a regulatory context, following guidelines for obtaining marketing approval and it constitutes an extremely valuable resource that should be made available to the scientific community. For that reason, efforts are being made for the systematic collection and storage of this data, as well as its standardization, which enables its integration and joint analysis.

Data from in vitro assays - Data analysis and modelling

Description

In vitro cell culture technologies are commonly used in toxicology. They provide an alternative to animal testing and allow to assess the response of the cells to toxicant exposure. They also provide unique access to biochemical, and morphological changes that can not be observed in vivo. The most commonly used systems are immortalized cell lines and primary cell cultures.

Although two-dimensional cell cultures are very popular, it has been shown that they do not represent the in vivo situation, as they are still far from the tissue organization and the cellular connections seen in an organism. Recent advances in three-dimensional cell culture technologies have allowed the widespread use of organoids. Organoids have been used for in vitro modelling of drug adverse effects, specifically in organs commonly susceptible to drug-induced toxicities (i.e. gastrointestinal tract, liver, kidney).

In vitro tests in toxicology typically consist of exposing in vitro cell cultures to growing concentrations of the substance under study and recording changes using a wide variety of techniques, from high-content imaging to cell death. Among the diverse sources of toxicological in vitro data it is worth mentioning the results of the Toxicology in the 21st Century program, or Tox21_Toolbox. The results of this project (data and tools) are publicly available.

Gene expression changes that occur in biological systems in response to exposure to xenobiotics may represent mechanistically relevant cellular events contributing to the onset and progression of xenobiotic-induced adverse health outcomes. Transcriptomics data can be used to identify changes to gene expression profiles that occur in response to drug treatment which might provide predictive and mechanistic insight into the mode of action of a drug, as well as the molecular clues linked to possible toxicity.

Considerations

Results of in vitro assays are typically collected as dose-response curves. These results should be processed to obtain indexes indicating the concentration at which relevant effects are observed like LC50, IC50, benchmark concentration (BMC). This procedure can involve non-linear curve fitting and outlier removal. It is advisable to report the details of the data processing in order to obtain reproducible results and standarization.

Solutions

ToxCast_data has published an R-package with the tools used to process the high throughput chemical screening data.
Benchmark concentrations (and doses) can be computed with free software as PROAST and BMDS.
For experiments where gene expression has been measured in response to a toxicant, R packages such as DESEq2 for RNA-Seq data, and limma for microarray data are used to find genes that are differentially expressed.
In silico prediction models can be developed starting from a series of compounds annotated with the results on in vitro methods. The quality of the predictions provided by these methods are often comparable with those obtained by experimental methods, particularly when the models are used within their applicability domain. Flame is an open-source modelling framework developed specifically for this purpose.
EDKB is a platform designed to foster the development of computational predictive toxicology. This platform allows direct access to ten libraries containing the following resources: a biological activity database, QSAR training sets, in vitro and in vivo experimental data for more than 3,000 chemicals, literature citations, chemical-structure search capabilities.
The T3DB is a bioinformatics resource that combines exhaustive toxin data with toxin target information. Currently it presents more than 42,000 toxin-target associations extracted from other databases, government documents, books and scientific literature. Each toxin record includes data on chemical properties and descriptors, toxicity values and medical information.
The Tox21_Toolbox is a unique collaboration between several federal agencies to develop new ways to rapidly test whether substances adversely affect human health. The Tox21 Toolbox contains data-analysis tools for accessing and visualizing Tox21 quantitative high-throughput screening (qHTS) 10K library data, as well as integrating with other publicly available data.

Data from animal assays - Existing data and vocabularies

Description

Assays are expensive. Most animal data come from compiling normative studies which are compulsory for obtaining the approval of diverse regulatory agencies prior to commercialization. The choice of the species and strains was determined by their representability for the studied endpoints, and often defined in this legislation and comprises from invertebrate (e.g., daphnia is commonly used to study aquatic toxicity) and fish (e.g., zebrafish), to rodents (mice, rats, rabbits, guinea pigs) and mammals (dogs, primates). The representability of animal data to predict human toxicity is questionable and a precautionary approach or the use of extrapolation factors is recommended.

In spite of their inconveniences (high costs, time consumption, requirements of significant amounts of the substance being tested, limited translatability of the observed results), in many cases, there is no suitable replacement for in vivo tests. The replacement of in vivo data for alternative approaches (often called NAM, New Approach methodologies) is an active research field.

Two important toxicogenomics resources containing animal data are TG-GATES, and Drug Matrix. These resources contain gene expression data in several rat tissues for a large number of compounds, in several doses and exposure times. They also include histopathology annotations and chemistry measurements.

Considerations

Data generated in normative studies were obtained under Good Laboratory Practices (GLP) conditions, and therefore the quality of the data is high. However, these studies were oriented to characterize a single compound, and not to carry out comparative analyses. Also, the doses used in the studies were designed to detect adversity and could be not representative of the exposure reached by consumers or patients of the marketed substances. Most of the time, data is not modelled using standards, for example, drugs are annotated using common names, and histopathology annotations are not coded in a controlled vocabulary.

Solutions

Use information about genes, and variants associated with human adverse effects, from platforms such as DisGeNET, CTD, and PharmGKB.
Histopathology data requires the use of a controlled vocabulary like CDISC/SEND.
The extension and curation of ontologies like CDISC/SEND to specific domains is facilitated by tools like ONTOBROWSER.
In order to reduce the number of animals used in toxicological studies, it has been suggested to replace control groups with historically collected data from studies carried out in comparable conditions (so-called Virtual Control Groups). VCGs are being developed by eTRANSAFE project.

Data from human assays - Existing data and vocabularies

Description

Human response to toxic agents is generally excluded from toxicity assays as it entails major ethical issues. Although relevant information on potential adverse effects is available from animal and in vitro assays, human data is crucial for accurate calibration of toxicity models based on these studies. Traditionally, it was frequent that exposure to an unknown or unexpected toxic agent was eventually identified as the trigger factor of a health problem for which an evident reason did not apparently exist. Thereby, unintentional human exposure to toxic agents yielded toxicological data. Two main types of sources exist in this regard:

Individual or group case reports are a fundamental source of information when no previously reported human toxicity information exists. They include exhaustive medical information on a single patient or a set of patients with similar symptomatology which is gathered from health care facilities to identify etiology.
Epidemiologic studies (ESs) are focused on the possible association between the exposure to a substance and the potential adverse reactions observed in a given human population. ESs are classified as occupational (individuals are exposed in the workplace), or environmental (individuals are exposed through daily living).

In the pharmaceutical context though, intentional human exposure to drug candidates is a necessary step in the development of medications. During the clinical trial stage, human exposure to substances is required to characterize efficacy and safety. This process consists of several phases which are exhaustively controlled and subjected to strict regulations and ethical review. Adverse-event monitoring and reporting is a key issue in the assessment of the risk-benefit balance associated with the medication which is established from the clinical trials data. After the medication is released to the market it is subjected to an exhaustive pharmacovigilance process focused on the identification of safety concerns. Serious and non-serious adverse effects reporting from several sources are collected during a period and medication risk-benefit balance is re-evaluated.

Considerations

Data from human assays are highly heterogeneous and integration with in vitro and animal data is a challenging task. There is a broad range of resources containing human data publicly available, but sometimes data access is limited. The nature of toxicological data has evolved in recent times and available resources and repositories comprise a variety of different types of data. On one hand, many data sources are nicely structured but, on the other hand, some others provide detailed information in an unstructured format. Data should be harmonized before integration. Disparate data sources are organized differently and also use different terminologies:

Resources providing access to occupational epidemiologic studies report health risks by using condition-centred vocabularies like (ICD9-CM and ICD10-CM) or just uncoded terms whereas databases reporting possible links between observed adverse reactions and medications are usually expressed according to the MedDRA ontology.
Different chemical identifiers are used depending on the toxic agent.
Similarly, medication identifiers are not always consistent among different sources. This is a challenging issue as many medicinal products have different denominations and available commercial presentations depending on the country/region where the product is commercialized.
Usually, structured resources present metadata explaining how the data is organized, thus enabling an easy data transformation process. Conversely, non-structured resources are not easy to harmonize as data organization is not consistent among the available documents.

Databases containing clinical toxicological data of drugs can contain the results of clinical studies ClinicalTrials.gov, frequent adversities (Medline), or collect pharmacovigilance data FAERS depending on the data being incorporated, the interpretation is different. For example, in the case of spontaneous reporting systems, the frequency with which an adverse event is reported should be considered relative to the time the compound has been in the market and the frequency of these adverse events in the population treated.

Solutions

Examples of databases containing drug toxicological data:

ClinicalTrials.gov is a resource depending on the National Library of medicine which makes available private and public-funded clinical trials.
The FDA Adverse Event Reporting System FAERS contains adverse event reports, medication error reports and product quality complaints submitted by healthcare professionals, consumers, and manufacturers.
The EudraVigilance is the European database of suspected adverse drug reaction reports is a public resource aimed to provide access to reported suspected side-effects of drugs. Side-effects are defined according to the MedDRA ontology.
The TXG-MAPr is a tool that contains weighted gene co-expression networks obtained from the Primary Human Hepatocytes, rat kidney, and liver TG-GATEs datasets.

Harmonization of terminologies can be achieved by using different resources:

The Unified Medical Language System UMLS provides mappings between different medical vocabularies. It includes common ontologies within the condition/diagnosis domain like SNOMED, ICD9CM, ICD10CM, and also the MedDRA ontology.
The OHDSI initiative for health data harmonization is an alternative solution for the mapping of vocabularies needed for the harmonization of different resources. This initiative maintains the ATHENA set of vocabularies which is in constant evolution and covers relevant domains in the realm of health care. The OHDSI community is paying special attention to the mappings between medication identifiers coming from national regulatory agencies of the countries of provenance of the institutions involved in the initiative, and the RxNorm identifier which is the standard vocabulary used by OHDSI.
Resources in the context of environmental (ITER, IRIS) or occupational (Haz-Map) toxicity using CAS Registry Number identifiers can be connected with those in the pharmaceutical field prone to use ChEMBL identifiers via molecular identifiers available in both resources like the standard InChI or standard InChI Key representations. Services like EBI’s UniChem can help to translate between different chemical identifiers.
The GHS Classification was developed by the United Nations in an attempt to align standards and chemical regulations in different countries. GHS includes criteria for the classification of health, physical and environmental hazards, and what information should be included on labels of hazardous chemicals and safety data sheets.

To import unstructured data sources into structured schemas is a really challenging task as it involves the application of natural language processing technologies. The development of these tools in the field of toxicology is still at the embryonic stage but several initiatives exist:

The LimTox system is a text mining approach devoted to the extraction of associations between chemical agents and hepatotoxicity.
The AOP4EUpest webserver is a resource for the identification of annotated pesticides-biological events involved in Adverse Outcome Pathways (AOPs) via text mining approaches.

Ecotoxicology data - Existing data

Description

Substances can also be characterized according to their potential to affect the environment. This data is collected by national and international regulatory agencies (e.g., ECHA in EU and, EPA in the USA) aiming to control the production, distribution, and use of potentially hazardous substances. Data collection is largely guided by legislation, which defines the test that should be carried out and the data that should be collected.

Considerations

When considering the effect of a substance on the environment, in addition to its hazard characterization, it is important to consider its environmental fate in terrestrial and aqueous environments, and its properties with respect to degradation by diverse routes (chemical, biodegradation, photodegradation).

Solutions

The ECOTOXicology Knowledgebase (ECOTOX) is a comprehensive, publicly available Knowledgebase providing single chemical environmental toxicity data on aquatic life, terrestrial plants, and wildlife.
The Comptox provides toxicological information for over 800.000 chemical compounds, including experimental and predicted fate information.
The NBP is a public resource that offers an assessment of nutritional status and the exposure of the U.S. population to environmental chemicals and toxic substances.
The NPDS is a resource that provides poisson exposure occurring in the US and some freely associated states.
Pharos provides hazard, use, and exposure information on 140,872 chemicals and 180 different kinds of building products.
The REACH registered substances is a portal with public data submitted to ECHA in REACH registration dossiers by substance manufacturers, importers, or their representatives, as laid out by the REACH Regulation (see Understanding REACH regulation).

Tool or resource	Description	Related pages	Registry
AOP4EUpest	AOP4EUpest web server is devoted to the identification of pesticides involved in an Adverse Outcome Pathway via text mining approaches.		Tool info
BMDS	EPA's Benchmark Dose Software (BMDS) collects and provides easy access to numerous mathematical models that help risk assessors estimate the quantitative relationship between a chemical dose and the test subject’s response.
CAS Registry	The CAS Registry (Chemical Abstracts Service Registry) includes more than 188 million unique chemicals. CAS Registry Numbers are broadly used as a unique identifier for chemical substances. The Registry is maintained by CAS, a subdivision of the American Chemical Society.		Standards/Databases
CDISC/SEND	CDISC SEND Controlled Terminology
ChEMBL	Database of bioactive drug-like small molecules, it contains 2-D structures, calculated properties and abstracted bioactivities.		Tool info Standards/Databases Training
ClinicalTrials.gov	ClinicalTrials.gov is a resource depending on the National Library of medicine which makes available private and public-funded clinical trials.		Standards/Databases
Comptox	The CompTox Chemicals Dashboard provides toxicological information for over 800.000 chemical compounds. It is a part of a suite of databases and web applications developed by the US Environmental Protection Agency's Chemical Safety for Sustainability Research Program. These databases and apps support EPA's computational toxicology research efforts to develop innovative methods to change how chemicals are currently evaluated for potential health risks.		Tool info Standards/Databases
CTD	A database that aims to advance understanding about how environmental exposures affect human health.		Tool info
DESEq2	Differential gene expression analysis based on the negative binomial distribution		Tool info Training
DisGeNET	A discovery platform containing collections of genes and variants associated to human diseases.	Human data	Tool info Standards/Databases Training
Drug Matrix	A toxicogenomic resource that provides access to the gene expression profiles of over 600 different compounds in several cell types from rats and primary rat hepatocytes.
ECOTOX	The ECOTOXicology Knowledgebase (ECOTOX) is a comprehensive, publicly available Knowledgebase providing single chemical environmental toxicity data on aquatic life, terrestrial plants, and wildlife.		Standards/Databases
EDKB	Endocrine Disruptor Knowledge Base is a platform designed to foster the development of computational predictive toxicology. This platform allows direct access to ten libraries containing the following resources: a biological activity database, QSAR training sets, in vitro and in vivo experimental data for more than 3,000 chemicals, literature citations, chemical-structure search capabilities.
EudraVigilance	The European database of suspected adverse drug reaction reports is a public resource aimed to provide access to reported suspected side-effects of drugs. Side-effects are defined according to the MedDRA ontology.
FAERS	The FDA Adverse Event Reporting System (FAERS) is an american resource that contains adverse event reports, medication error reports and product quality complaints submitted by healthcare professionals, consumers, and manufacturers. MedDRA ontology is used for coding adverse effects. Note that reports available in FAERS do not require a causal relationship between a product and an adverse event and further evaluations are conducted by FDA to monitor the safety of products.		Tool info
Flame	Flame is a flexible framework supporting predictive modeling and similarity search within the eTRANSAFE project.		Tool info
GHS Classification	GHS (Globally Harmonized System of Classification and Labelling of Chemicals) classification was developed by the United Nations in an attempt to align standards and chemical regulations in different countries. GHS includes criteria for the classification of health, physical and environmental hazards, and what information should be included on labels of hazardous chemicals and safety data sheets.
Haz-Map	Haz-Map is an occupational health database that makes available information about the adverse effects of exposures to chemical and biological agents at the workplace. These associations have been established using current scientific evidence.
IRIS	The Integrated Risk Information System (IRIS) resource evaluates information on health that might arise after exposure to environmental contaminants.		Tool info Training
ITER	ITER is an Internet database of human health risk values and cancer classifications for over 680 chemicals of environmental concern from multiple organizations worldwide.		Training
limma	Linear Models for Microarray Data		Tool info Training
LimTox	The LiMTox system is a text mining approach that tries to extract associations between compounds and a particular toxicological endpoint at various levels of granularity and evidence types, all inspired by the content of toxicology reports. It integrates direct ranking of associations between compounds and hepatotoxicity through combination of heterogeneous complementary strategies from term co-mention, rules, and patterns to machine learning-based text classification. It also provides indirect associations to hepatotoxicity through the extraction of relations reflecting the effect of compounds at the level of metabolism and liver enzymes.		Tool info
NBP	The National Biomonitoring Program (NBP) is a public resource that offers an assessment of nutritional status and the exposure of the U.S. population to environmental chemicals and toxic substances.
NPDS	The National Poison Data System (NPDS) is a resource that provides poisson exposure occurring in the US and some freely associated states.
OHDSI	Multi-stakeholder, interdisciplinary collaborative to bring out the value of health data through large-scale analytics. All our solutions are open-source.	TransMed Data quality	Tool info
ONTOBROWSER	The OntoBrowser tool was developed to manage ontologies and code lists.		Tool info
PharmGKB	A resource that curates knowledge about the impact of genetic variation on drug response.		Tool info
Pharos	Pharos provides hazard, use, and exposure information on 140,872 chemicals and 180 different kinds of building products.		Tool info
PROAST	PROAST (copyright RIVM National Institute for Public Health and the Environment) is a software package for the statistical analysis of dose-response data.
REACH registered substances	Portal with public data submitted to ECHA in REACH registration dossiers by substance manufacturers, importers, or their representatives, as laid out by the REACH Regulation (see Understanding REACH regulation).
RxNorm	RxNorm is a normalized naming system for medications that is maintained by the National Library of Medicine. Rxnorm provides unique identifiers and allows unambiguous communication of drug-related information across the American health computer systems.	Health data	Tool info Standards/Databases
T3DB	The Toxin and Toxin Target Database is a bioinformatics resource that combines exhaustive toxin data with toxin target information. Currently it presents more than 42,000 toxin-target associations extracted from other databases, government documents, books and scientific literature. Each toxin record includes data on chemical properties and descriptors, toxicity values and medical information.		Tool info
TG-GATES	A toxicogenomics database that stores gene expression data and biochemistry, hematology, and histopathology findings derived from in vivo (rat) and in vitro (primary rat hepatocytes, primary human hepatocytes) exposure to 170 compounds at multiple dosages and time points.		Tool info
Tox21_Toolbox	The Toxicology in the 21st Century program, or Tox21, is a unique collaboration between several federal agencies to develop new ways to rapidly test whether substances adversely affect human health. The Tox21 Toolbox contains data-analysis tools for accessing and visualizing Tox21 quantitative high-throughput screening (qHTS) 10K library data, as well as integrating with other publicly available data.
ToxCast_data	The Toxicology in the 21st Century program, or Tox21, is a unique collaboration between several federal agencies to develop new ways to rapidly test whether substances adversely affect human health. This portal contains diverse downloadable results of the ToxCast project.
TXG-MAPr	A tool that contains weighted gene co-expression networks obtained from the Primary Human Hepatocytes, rat kidney, and liver TG-GATEs dataset.		Tool info
UMLS	The Unified Medical Language System (UMLS) is a set of tools that establishes a mapping structure among different vocabularies in the biomedical sciences field to enable interoperativity between computer systems.
UniChem	UniChem is a very simple, large-scale non-redundant database of pointers between chemical structures and EMBL-EBI chemistry resources. Primarily, this service has been designed to maintain cross references between EBI chemistry resources. These include primary chemistry resources (ChEMBL and ChEBI), and other resources where the main focus is not small molecules, but which may nevertheless contain some small molecule information (eg: Gene Expression Atlas, PDBe).		Training