Use the original files
The assessment of sample data quality only works on properly formatted files. Please, first refer to the format recommendations. If, for example, the original VCF file needs to be edited, it is possible but requires special care. The absolute minimum requirements are shown below. Note that the VCF file may contain much more, as long as it is according to the defined standards (see Standard file formats)
Comments on the structure of the VCF files
The VCF file is a kind of hybrid content file. It consists of three kinds of information, where any change needs to be synchronised in all sections.
- Section 1: At the beginning there is a header area, where every line starts with the characters “##”. The amount of rows in the header varies, but can be ca 100. The content of this section has to cover the data present in Section 3 . In the example Figure below, the FORMAT column (lines 4-5) contains the compulsory data “GT”, therefore section 1 must contain a line defining the “GT” data type (line 2) used in the FORMAT column.
- Section 2: As last line in the VCF file header comes one single line that starts with a single “#” (#CHROM). This line contains a list of “header cells” for the listing of variant data that follows in section 3. On this line, every “header cell” is separated from next with a tabulator character (green in the Figure).
- Section 3: This is the data section listing all variants. The data section, together with the section 2 line, is structured as a table and the amount of “cells” on each data row is connected to the number of “header cells” specified in section 2. Each line corresponds to one variant and the different pieces of information (the “cells”) concerning the variant, such as chromosome, position, observed allele, quality, etc., are separated by a tabulator character. If the INFO column contains data, corresponding header area definitions must be present in section 1. When the FORMAT column contains more than the GT (genotype) data, corresponding header area definitions must be present in section 1. Note about Figure (below): Do not add space characters, unless they are part of the data in the field. The spaces in the figure below are there solely to help visualise the coordinated table-like structure of sections 2 & 3.
Figure: this minimal VCF file has a header area (lines 1-3) that matches the content of the the variant data (lines 4-5). More complex, standard, files can also be accepted as input.
Typical errors when the VCF files are edited manually
Due to the structure of the VCF file, it is possible to open it in a normal text editor (e.g. Notepad++) or in a spreadsheet editor such as Excel. However, errors are typically introduced at any stage when the file is opened, edited, or when it is saved.
- Open
- Opening a VCF file in some spreadsheet editors, including Excel, might lead to automatic conversion of certain fields. For example numbers might be converted into date (1/1 —> Jan. 1). To avoid this the whole content of the file must be opened “as text”. This can be visually checked. If conversion has taken place, there will be no corrective conversion back to the original format when the file is saved.
- Edit
- There is critical information in section 1. Modifying anything in this section may introduce errors in the interpretation of the VCF file.
- The amount of headers in section 2 has to be the same as the amount of pieces of information on every row of section 3. In other words there has to be as many tabulator characters (separators) on every line in both sections.
- The special characters, specific for section 1 & 2, appearing at the beginning of each line, must not be modified, neither shall any similar character be used on that line.
- Save
- When saving from a spreadsheet editor, it is crucial to save the file as a CSV file , it must not be saved in the native format of the spreadsheet. To do so, use the “Save as…” option in the File menu.
- When saving from a spreadsheet editor, the column separator shall remain a tabulator character (TAB) and must not be converted into a comma (,), space, or some other character. The separator can be specified during saving.
- When saving, the cell content must not be included in quotation marks (“). (Error message: Extra “ character in meta-information line)
Post your comment on this topic.