MetaZooGene Barcode Database (MZGdb):   Version v3.0

The MZGdb database component is a work in progress.   Things are being added/changed as we realize what is useful, not useful, and missing.   During this development and fine-tuning process, your feedback/comments/suggestions are very important!   Please let Todd.OBrien@noaa.gov know if you have any suggestions, corrections, or comments.   We want the MZGdb to be usable, useful, and user-friendly.

As of now, the MZGdb is available in a CSV (comma separated value, what Excel and other programs prefer) format as well as a "pipe" ("|") separated values format (PSV).   The CSV format should open immediately into Excel for you, but be sure to check that the "MZGdb__EOL" column ("AI" or #35) is clean and only contains the string "MZGdb__EOL".   (If that column is empty or contains other text, it means there was a shift in the data columns for that record.   Report this to Todd immediately.)   Click here if you want to use the alternate PSV files and need instructions for loading this PSV file type.

Excel
Col
Col #Column Heading (v3p0)SourceContent Description (v3p0)
A1MZGdb Group:Region IDMZGThis data file source identifier (created by the MZGdb) is used to indicate the Taxonomic-Group and Ocean-Basin of the current data file and its records. For example "Crustacea-Copepoda-Calanoida:B02" indicates that the current data record is coming from the "calanoid copepods species from the North Atlantic" data file. This would distinguish it from the "*:B03" (South Atlantic copepoda) and/or "*:B07" (North Pacific copepoda) files.

A 00 = World, A/B 01 = Arctic, A/B 02 = North Atlantic, A/B 03 = South Atlantic, A/B 04 = Southern Ocean, A/B 05 = Indian Ocean, A/B 06 = South Pacific, A/B 07 = North Pacific, A/B 21 = A/Baltic Sea, A/B 22 = Mediterranean Sea, A/B 24 = Black Sea, A/B 25 = Red Sea, A/B 26 = North Sea
B2Original Species DescriptionGenBank/BOLDThis is the currently-recorded species name that is presented in the online GenBank or BOLD database. In the "Mode-C/Mode-D" files, if the exact species was not known, this GenBank or BOLD description may include "Genus sp." plus a sample or cruise identifiers (e.g., "Calanus sp. HLC-25836").

Work by MZG may provide an alternate species identification (e.g., based on original author updates not yet corrected in GenBank or BOLD) in the MZG-adjusted-Species-ID field below.
C3Original Description Taxonomic StatusWoRMSThe original species descriptions is checked in WoRMS. A status of "accepted" or "unaccepted" or "alternate_representation" indicates that WoRMS recognized full original description as a taxonomic name. If the species is "accepted", it will match the "coded-as" name below. If the species is a synonym (e.g. "unaccepted" or "alternate_representation"), it will be coded-as with the valid (WoRMS accepted) name below.

A status of "unknown" indicates the WoRMS did not recognize the species component of the original description. This could be because it contains non-taxonomic text (e.g. "Calanus sp. A12345"), or because the species name is not currently in the WoRMS database. In both cases, the description will be coded at the Genera level below.
D4MZGdb-coded-as TaxaNAMEWoRMS*This field shows the "coded-as" taxa name that the original description was coded-as after verification in WoRMS. In most cases it should be identical to the original description, but in cases where the original description was not a true species (e.g., "Calanus sp. HLC-25836"), this field would contain only the Genus name, and the coded-as TaxaCODE and TaxaHLEV would indicate the Genus-level translation.

Note that "Calanus sp. ABC" and "Calanus sp. XYZ" would both be coded only to their genus (Calanus) and would have matching MZGdb-coded-as TaxaCODEs. The full original description is used in the "MZG-UID (Mothur unique)" ID.
E5MZGdb-coded-as TaxaHLEVWoRMS*This the taxonomic level of the as-coded TaxaNAME. In most cases it should be species or subspecies. A "genus" in this field means a temporary species names was provided in the original GenBank record (e.g., "Calanus sp. HLC-25836").
F6MZGdb-coded-as TaxaCODECOPEPODThis is the COPEPEDIA taxonomic code for the coded-as TaxaNAME above.
G7Plankton Grouping CodeCOPEPODThe Plankton Grouping Code (PGC) is a smart code used by COPEPOD to identify major taxonomic groups (e.g., P42120 is a copepod, P42180 is a euphausiid).
H8MZGdb Holo/Mero/NektonMZGThis is an MZG-added field used to indicate of a species is considered a "holopankton", "meroplankton", or "nekton". This field will contain "unclassified" for taxonomic groups or species that are not reviewed yet.

This classification is currently incomplete.
I9GenBank AccessionGenBankThis is the GenBank Accession associated with this record. If this column is empty, it indicates that sequence was only found in BOLD.

If both a GenBank Accession and BOLD ID exist, the sequence was present in both databases but was downloaded from GenBank.
J10BOLD IDBOLDThis is the BOLD ID associated with this record. If this column is empty, it indicates the sequence was only found in GenBank.

If both a GenBank Accession and BOLD ID exist, the sequence was present in both databases but was downloaded from GenBank.
K11GenBank Record Inventory DateGenBankThis is the "GenBank Record Inventory Date" that is provided in the GenBank record. It is NOT the date of the zooplankton specimen sampling event (which is provided below).

This column is mainly used for internal original sequence version tracking.
L12MZGdb Version (source)MZGThis field tells the format version of MZGdb that this record is stored in, and the original information source and gene region (e.g., "MZGdb-v2023-m01-12 (GenBank:16s)" or "MZGdb-v2023-m01-2 (BOLD:coi)").

If you come across a file with mixed version dates, please report this error to Todd.Obrien@noaa.gov
M13MZG-adjusted TaxaNAMEMZGThis field may contain an alternative species name to what was provided in the original description or WoRMS encoding. This could be an author-submitted correction or a BLAST matching result.
N14MZG-adjusted TaxaCODECOPEPODThis is the COPEPEDIA Taxonomic Code assigned to the MZG-adjusted-Species-ID.
O15MZG Review/QC FlagMZGThis is a QC or REVIEW code provided by MZG. It may indicate a suspicious record (e.g., this species is not generally found in the claimed ocean, or BLAST comparison suggests it is a different species).

As of 2023-01-12, these flags are still being tested and may be absent.
P16Specimen Sample DateGenBankThis is the date for when the zooplankton specimen was caught in the ocean (if provided).
Q17Sample LongitudeGenBank*This is the longitude for where the zooplankton specimen was caught in the ocean (if provided).
R18Sample LatitudeGenBank*This is the latitude for where the zooplankton specimen was caught in the ocean (if provided).
S19iGEOsrcMZG" 1" = provided in original record. "0" = indicates lon/lat not provided "2" = translated from geographic region (e.g. "Gulf of Maine").
T20GEO-reference Barcode OCEANMZG / COPEPODThis smartcode indicates the ocean (_o##), subarea (_s##), and North Atlantic ICES sub-area (_i####) of the barcode IF geographic information was provided. Note that code "o99" means no geographic info while code "o00" means unknown ocean, possibly land. o01= Arctic, o02= NATL, o03= SATL, o04 = Indian Ocean, o05= Southern Ocean, o06= SPAC, o07= NPAC, o21= Baltic Sea, o22= Mediterranean Sea, o24= Black Sea.
U21Region (sometimes this is the collecting location)GenBank*This is a originator-provided regional/area description of the specimen sampling location (if/when provided). For example, the original record may not have provided a longitude/latitude, but listed a description location like "Gulf of Naples" or "Georges Bank". In these case, it may be possible for MZGdb to provide an estimated latitude/longitude for the sample.
V22Original Collected-by InfoGenBankThis is the content from the GenBank or BOLD "collected by" fields in the original record.
W23Original Identified-by InfoGenBankThis is the content from the GenBank or BOLD "identified by" fields in the original record.
X24Original Voucher InfoGenBankThis is the content from the GenBank or BOLD "voucher" fields in the original record.
Y25Original Tech/MethodGenBankThis is the content from the GenBank or BOLD methods fields in the original record.
Z26PCR_FWDnameGenBank*This is the PCR Primer Forward Name (if provided in GenBank or other source, as indicated by "PCR Info Source" column below).
AA27PCR_FWDseqGenBank*This is the PCR Primer Forward Sequence (if provided in GenBank or other source, as indicated by "PCR Info Source" column below).
AB28PCR-REVnameGenBank*This is the PCR Primer Reverse Name (if provided in GenBank or other source, as indicated by "PCR Info Source" column below).
AC29PCR-REVseqGenBank*This is the PCR Primer Reverse Sequence (if provided in GenBank or other source, as indicated by "PCR Info Source" column below).
AD30#BPGenBankThis is the number of letter characters present in the COX1 Sequence.

This is used for data file integrity and quality control.
AE31SequenceGenBankThis is the COX1 Sequence, with a length of "#BP" characters.
AF32MZG notes or commentsGenBankNotes or comments from MZG or from the original BOLD/GenBank record
AG33MZG-UID / Mothur-uniqueMZGThis unique sequence identifier (created by the MZGdb) is a combination of the GenBank Accession and the species name. It is intended to uniquely identify and connect each sequence its GenBank record, while also providing simple species information.

To make this safe for use in the ".fasta" and ".mothur" files, spaces have been replaced with "_" (e.g., "Calanus_finmarchicus") and special characters or accents removed/replaced.
AH34MZG TaxaMap for MothurMZGThis is a semicolon-separate taxonomic ranking map for use with the Mothur software. Each entry has exactly 20 semicolon-separated subfields, representing select/major taxonomic levels from from "Kingdom" to "Species". In this mapping, the 16th field is Family, the 18th field is Genus, and the 19th field is Subgenus. The 20th field contains the "Genus+species" (or "Genus+subgenus+species") name for the critter. If a standard taxonomic level is absent for a group, that field will contain the name of the previously present hierarchy level with an "_EXT" added to the end of it. For example, since "Calanus helgolandicus" does not have a subgenus, its subgenus field will be "Calanus_EXT". Note that in the "Mode-C/Mode-D" genus-only allowed data files, the 20th field (species) might contain Genus_EXT, to represent cases where identification was only down to the Genus level.

In Excel, it is possible to split this column into 20 individual columns. To do this, select the column, go to the Excel "Data" tab, go to its "Text to Columns" pulldown option, select Delimited and "Semicolon" then finish.
AI35MZG AlignCheckMZGThis column should ONLY contain the string "MZGdb_EOL". If this column is empty or contains something else, there is a column-shifting error in the data file (report it to Todd immediately. This is sadly a frequent error, due to glitches in the ever-changing GenBank file formats.
AJ36GenBank Ref1GenBankThis column contains GenBank-provided "Ref 1"

BOLD and UNPUBLISHED do not include these Reference fields.
AK37GenBank Ref2GenBankThis column contains GenBank-provided "Ref 2"

BOLD and UNPUBLISHED do not include these Reference fields.
AL38GenBank Ref3GenBankThis column contains GenBank-provided "Ref 3"

BOLD and UNPUBLISHED do not include these Reference fields.




Last auto-build on:   2023-Jan-13