Data Sources

Data in Data Commons comes from the following sources. Each source often includes different surveys. Some sources/surveys include a very large number of variables, some of which might not yet have been imported into Data Commons. The sources are listed alphabetically.

Bureau of Economic Analysis (BEA)

Terms of Use.

Bureau of Labor Statistics (BLS)

Terms of Service.

Center for Disease Control and Prevention

CDC Data Terms of Service.

Census Bureau

  • American Community Survey covers a broad range of topics about social, economic, demographic, and housing characteristics of the U.S. population. The ACS 5-year (and 1-year) estimates are updated every year, based on the last 5 years (1 year) of collected data. Data Commons includes thousands of variables across the full range of ACS topics at the country, state, county, city, zip code tabulation area, school district, census tract levels, and more.
  • American Community Survey Education Tabulation (ACS-ED): The National Center for Education Statistics collaborates with the US Census Bureau to create a variety of custom data files that describe the condition of school-age children in the United States at the country, state, and school district level. ACS-ED is updated annually based on ACS five-year period estimates.
  • Cartographic Boundary Files: KML files counties, states, congressional districts, etc.
  • County Business Patterns: per-industry number of establishments, employment, payroll and annual payrol, by county, metropolital statistical area and zip code.
  • Economic Census: number of businesses and amount of revenue, by business payroll status, industry, operation type, and tax status.
  • Census Bureau - Gazetteer Files: basic geographic information about counties, county subdivisions, congressional districts, states, etc.
  • Census Bureau - Population Estimates Program: The Census Bureau's Population Estimates Program (PEP) produces yearly estimates of the population for the United States, its states, counties, cities, and towns, as well as for the Commonwealth of Puerto Rico and its municipios. Data Commons imports the total population estimate data for the US and its states, counties, and cities.
  • Census Bureau - Small Area Health Insurance Estimates (SAHIE): The Small Area Health Insurance Estimates program provides yearly estimates of health insurance coverage status for all counties and states. Data Commons includes all estimates, available by age, race, sex, and income.

US Census Terms of Service.

College Scorecard

  • University Data: data about all undergraduate degree-granting institutions of higher education.

Terms of Service.

Department of Labor

Terms of Service.

Drug Enforcement Agency

US Department of Justice Legal Policies and Disclaimers Terms of Use.

Federal Bureau of Investigation

US Department of Justice Legal Policies and Disclaimers Terms of Use.

Federal Election Commission

FEC Terms of Use.

Federal Reserve

The data is in the public domain.

National Center for Education Statistics

  • Public School and School District Data: general descriptive information such as name, address, and phone number; select demographic characteristics about students and staff; and fiscal data such as revenues and current expenditures. Data Commons includes school and school district level data about student populations by race, gender, lunch eligibility, and grade, as well as student-teacher ratio and teacher count statistics.

NCES Data Usage Agreement and US Department of Education Copyright Status Notice.

National Oceanic and Atmospheric Administration

  • National Climatic Data Center Storm Events Database: occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce; rare, unusual, weather phenomena that generate media attention, such as snow flurries in South Florida or the San Diego coastal area; and other significant meteorological events, such as record maximum or minimum temperatures or precipitation that occur in connection with another event.

National Weather Service Use of NOAA/NWS Data and Products Terms of Service.

National Wildfire Coordinating Group

US Forest Service Terms of Service.

United States Geological Service

USGS Copyrights and Credits Terms of Service.

The Dartmouth Atlas of Health Care

The Dartmouth Atlas Project "uses Medicare and Medicaid data to provide information and analysis about national, regional, and local markets, as well as hospitals and their affiliated physicians." Data Commons includes the Medicare Reimbursements, Medicare Mortality Rates, and Selected Primary Care Access and Quality Measures datasets.

Data is made available under the Dartmouth Atlas Project Terms of Use.

Opportunity Insights

  • Outcomes (social mobility and a variety of other outcomes from life expectancy to patent rates) by neighbourhood, college, parental income level and racial background. For Census tracts, county and commuting zone.
  • Neighbourhood characteristics for Census tracts, county and commuting zones.

Terms of Use.

Eurostat Regional Statistics by NUTS Classification

Terms of Use.

National Oceanic and Atmospheric Agency

Terms of Use.

UNdata

  • Population data: for countries, capital cities, urban and rural areas not covered by other sources.

Terms of Use.

Wikidata

Terms of Use.

World Bank

Terms of Use.

ChEMBL

"ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs." It includes information on drugs at all stages of drug discovery.

This data is made available by EMBL-EPI Terms of Use. This data was formatted for Data Commons in part through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

Disease Ontology

Disease Ontology was developed as a project by the Institute of Genome Sciences at the University of Maryland School of Medicine. It "is a community driven, open source ontology that is designed to link disparate datasets through disease concepts". It provides a "standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts".

The data is made available under C0 1.0 Universal (CC0 1.0) Public Domain Dedication. Data Commons includes the 3/7/19 update of the Disease Ontology. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

Encyclopedia of DNA Elements (ENCODE) - BED (Browser Extensible Data) Files

The ENCODE dataset contains information for approximately 7000 experiments along with 14,000 BED files collected by The Encyclopedia of DNA Elements (ENCODE) Consortium. Examples of experiment metadata captured include the target biosample, assay type, gene assembly, etc. Bed files link to individual bed lines, which state the genomic position of individual peaks. Data Commons ingested all experimental data in BED format.

Data made available under: ENCODE Data Use Policy for External Users. This data was formatted for Data Commons through a collaboration with Dr. Anthony Oro’s group at Stanford University.

FDA-Approved Drugs

"Drugs@FDA includes information about drugs, including biological products, approved for human use in the United States." Data Commons includes the information about the FDA application for the drug as well as the drug’s strength, active ingredients, dosage forms, administration routes, FDA therapeutic equivalence code, and marketing status.

This data is made available through openFDA terms of service.

FDA - Pharmacologic Class

The FDA established pharmacologic classes "associated with an approved indication of an active moiety that the FDA has determined to be scientifically valid and clinically meaningful". This includes the (1) description of pharmacologic class (2) active moiety code and description (3) compounds associated with each class.

This data is made available through openFDA terms of service. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

Genotype-Tissue Expression (GTEx)

The GTEx eGene and significant variant-gene association data were generated from samples "collected from 54 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq. Remaining samples are available from the GTEx Biobank." The single-tissue cis-eQTL data from the v8 release was used. Due to the size of the datasets only Skin - Not Sun Exposed and Skin - Sun Exposed are made available on the main graph. The data for all tissues can be accessed on the Biomedical Data Commons knowledge graph.

GTEx is an NIH human genomic data unrestricted-access data repository and the data was made available in compliance with GTEx Data Release and Publication Policy. GTEx outlines how to cite use of GTEx data in journal publication.

HUPO-PSI Working Groups and Outputs

The Molecular Interactions Controlled Vocabulary from the HUPO Proteomics Standards Initiative working groups is "a structured controlled vocabulary for the annotation of experiments concerned with protein-protein interactions". The ontologies dictionary is represented in a tree structure in the EMBL-EBI Ontology Lookup Service. Data Commons includes three subsets of the ontologies: "interaction detection method", "interaction type" and "database citation", which are commonly used in protein-protein interactions.

Data Made available under Apache License 2.0. The license information of HUPO PSI can be found at the Community Practice. See also EBI term of use.

Medical Subject Headings (MeSH)

MeSH is a "thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information". Data Commons includes the Descriptor, Concept, and Term elements of MeSH as described here.

This data is from the National Library of Medicine (NLM) and is not subject to copyright and is freely reproducible as stated in the NLM’s copyright policy. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

The Molecular INTeraction (MINT) Database

The MINT Database "focuses on experimentally verified protein-protein interactions mined from the scientific literature by expert curators."

MINT is a part of ELIXIR Core Data Resources, of which the resources are all committed to open access. Any use of this database should cite:

Licata, Luana, Leonardo Briganti, Daniele Peluso, Livia Perfetto, Marta Iannuccelli, Eugenia Galeota, Francesca Sacco et al. "MINT, the molecular interaction database: 2012 update." Nucleic acids research 40, no. D1 (2012): D857-D861.

NIH National Center for Biotechnology Information ClinVar

"ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence." It contains reports of genetic "variants found in patient samples, assertions made regarding their clinical significance, information about the submitter, and other supporting data." Data Commons includes the January 6, 2020 release of the ClinVar archive supporting both hg19 and hg38 genome assemblies.

This data is from an NIH human genome unrestricted-access data repository and made accessible under the NIH Genomic Data Sharing (GDS) Policy.

NIH National Center for Biotechnology Information Gene

The NIH NCBI gene info datasets from NCBI Gene for a subset of species contains "gene-specific content based on NCBI's RefSeq project, information from model organism databases, and links to other resources." The NCBI RefSeq project is "a comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein". The datasets included are from the February 19, 2020 update. The gene info files for the following species have been added:

  • Caenorhabditis elegans
  • Danio rerio
  • Drosophila melanogaster
  • Gallus gallus
  • Homo sapiens
  • Mus musculus
  • Saccharomyces cerevisiae
  • Xenepus laevis.

This data is from an NIH human genome unrestricted-access data repository and made accessible under the NIH Genomic Data Sharing (GDS) Policy.

Side Effect Resource (SIDER)

Sider is a database of adverse drug reactions curated by the EMBL collaboration. "SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and package inserts. The available information include side effect frequency, drug and side effect classifications as well as links to further information, for example drug–target relations." Data Commons hosts version 4.1 of SIDER released on October 21, 2015.

This data is made available under the Creative Commons Attribution-Noncommercial-Share Alike 4.0 License. Information about citing SIDER can be found here. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

SPOKE Disease Symptom Associations

These are statistical associations using a Fisher’s exact test co-occurrence of disease and symptom terms in Pubmed entries by performing as described in Himmelstein, et al (2017).

The data was previously hosted by UCSF Scalable Precision Medicine Knowledge Engine SPOKE. It was made available by the data’s owner, Sergio Baranzini, for use on Data Commons. This data was formatted for Data Commons through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

The Tissue Atlas

The Human Protein Tissue Atlas contains information about the distribution of proteins on human tissues derived from the antibody-based protein profiling from 44 normal human tissues types and mRNA expression data from 37 different normal tissue types.

This dataset is available under CC BY-SA 3.0. Please also see their Disclaimer and Licence & Citation

UCSC Genome Browser: Chromosome, Gene, RNA Transcript, and Genetic Variant Annotations

The UCSC Genome Browser originated from The Human Genome Project in 2000 to share and visualize genome data. It has grown to include an agglomeration of various genome assemblies and annotations. Data Commons includes data annotating chromosomes, genes, RNA transcripts, and genetic variants from the UCSC Genome Browser. The .chrom.sizes.txt files were downloaded from the UCSC Genome Browser Downloads page on August 13, 2019. The NCBI RefSeq files were downloaded from the UCSC Table Browser on August 2, 2019 for the following genome assemblies:

  • ce10
  • ce11
  • danRer10
  • danRer11
  • dm3
  • dm6
  • galGal5
  • galGal6
  • hg19
  • hg38
  • mm9
  • mm10
  • sacCer3
  • xenLae2

The All SNPs files were downloaded from the UCSC Table Browser on August 13, 2019 for the following genome assemblies and dbSNP builds:

  • gaGal5 (dbSNP Build 147)
  • hg19 (dbSNP Build 151)
  • hg38 (dbSNP Build 151)
  • mm9 (dbSNP Build 128)
  • mm10 (dbSNP Build 142)

The annotation data is made freely available under the UCSC Genome Browser terms of use. The UCSC Genome Browser states how to cite use of their data in a journal article publication.

UniProt

Data Commons includes protein sequence and functional information including protein interaction with chemical compounds maintained by the UniProt Consortium.

The data is made available by the Creative Commons Attribution (CC BY 4.0) License. Further information on UniProt License and Disclaimer can be found here. The UniProt Consortium states how to cite Uniprot data used in a journal article. This data was formatted for Data Commons in part through a collaboration with Dr. Sergio Baranzini’s group at UCSF.

UniProt Controlled Vocabulary of Species

UniProt’s Controlled Vocabulary of Species contains organism species UniProt identification codes, NCBI Taxonomy database identifiers, scientific names, common names, synonyms, and organism kingdoms.

The dataset is available under (CC BY 4.0) license as shown by the UniProt License and Disclaimer.

New York Botanical Garden (NYBG) - C. V. Starr Virtual Herbarium (Collaboration)

C. V. Starr Virtual Herbarium is a public specimen database with photos and detailed records about millions of plants, fungi, and algae.

The COVID Tracking Project

Terms of Use.

Google Community Mobility Reports

  • Google's COVID-19 Community Mobility Reports "chart movement trends over time by geography, across different categories of places such as retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential." Data Commons includes all statistics for countries, US states, and US counties.

Data made publicly available under the standard Google Terms of Service.

Google Health Reconciled COVID-19 Data

  • Google Health reconciles COVID data from The COVID Tracking Project, Johns Hopkins University, California Health and Human Services and makes the reconciled dataset publicly available for research and prediction use via Data Commons.

The New York Times Coronavirus (Covid-19) Data in the United States

  • The New York Times releases cumulative counts of coronavirus cases in the United States at the country, state, and county level, over time. The New York Times compiles this time series data from state and local governments and health departments in an attempt to provide a complete record of the ongoing outbreak. Data Commons imports this data and computes incremental counts for users.

Data made available for non-commercial purposes only with proper citation.

WHO Coronavirus Disease (COVID-19) Dashboard

The World Health Organization publishes national COVID-19 cases and death counts for countries across the world. Data Commons imports this data on a daily basis.

Data made available under CC BY-NC-SA 3.0 IGO.