AAV Data Dictionary

The provided definitions can be highly advantageous for professionals who are associated with or working in the field of adeno-associated virus (AAV) gene therapy research and development. AAV has become one of the most promising and widely used vectors for gene therapy due to its non-pathogenic nature, and these definitions will assist in better understanding the fundamentals of AAV gene therapy R&D.

Category
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

File Formats/Terms

Type
Alternate Terms
Subtype
Definition with references
BAM / uBAM
Read, alignment, mapped reads, read set

A file format for sequencing reads aligned to a reference genome, or unaligned (uBAM), including quality metrics. Stands for “Binary Alignment/Map” format; a compressed form of the plain-text “Sequence Alignment/Map” format (SAM).

BED

The BED (Browser Extensible Data) file format includes information about sequences that can be visualized in a genome browser; a feature called an annotation track.14 BED files are tabs-delimited and include 12 fields (columns) of data. The columns must be consistent throughout each file’s rows to be correctly read.

CRAM
Read, alignment, mapped reads, read set

A file format for sequencing reads aligned to a reference genome, including quality metrics. Stands for “Compressed Reference-oriented Alignment Map” format; an even more compact form of the BAM format.

CSV

CSV (.csv file format) files stands for comma separated value and is a text file, where each line is a row and columns are delimited with a comma.18 It can store different types of sequencing data and can be opened using common spreadsheet programs like Microsoft Excel.

Demultiplexing

Demultiplexing is the process of sorting sequenced reads into separate files for each sample in a sequenced run.

FastA
Sequencing reads

File containing sequencing reads or nucleotides sequences. The FastA file format is the simplest way of representing nucleic acid of protein sequences using single-letter codes for nucleotides or amino acids.

FastQ
(Demultiplexed) sequencing reads

File containing sequencing reads or nucleotides sequences and quality score for each base position. The result of base-calling and demultiplexing of raw data from a sequencing instrument.

GFF

A GFF (general feature format; file extension .gff2 or .gff3) describes the various sequence elements that make up a gene and is a standard way of annotating genomes.12 It defines the features present within a gene in the body of the GFF file, including transcripts, regulatory regions, untranslated regions, exons, introns, and coding sequences. As with the VCF, it uses a header region with a “##” string to include metadata.

GTF

The GTF (gene transfer format) file type shares the same format as GFF files, though it is used to define gene and transcript-related features exclusively.13 The attribute/group field, described above, is used in the GTF to include a gene_id or transcript_id value, a unique identifier for the genomic source or predicted transcript, respectively.

HiFi

High-fidelity reads extracted from raw PacBio long-read sequencing data. HiFi reads are produced using circular consensus sequencing (CCS) mode on PacBio long-read systems. HiFi reads provide base-level resolution with 99.9% single-molecule read accuracy.

JSON

JSON (JavaScript Object Notation) is a common file format for many other industries, but is used in a growing number of bioinformatics applications.

Library Preparation

Library preparation is the first step in sequencing. Before samples can be sequenced, nucleic acids must be isolated, fragmented, end-repaired, and covalently linked to adapters using ligation or tagmentation methods.

MAP

MAP (.map file extension) is a file format that accompanies the PED file format when using the PLINK program.17 It contains variant information.

PDB

PDB file formats contain atomic coordinates and are used for storing 3D protein structures by the Protein Data Bank. These PDB files can be opened in a program called pyMOL to visualize these protein structures. For more information on the format, see the full guide.15

PED

PED (.ped file extension) is a file format for pedigree analysis, which creates a familial relationship between different samples16 It’s used with the PLINK command-line bioinformatics program.

Primary Analysis
Demultiplexing, bcl2fastq, data prep

The first phase of a typical workflow to analyze sequencing datasets, in which raw, multiplexed sequencing instrument data is converted into nucleotide-level base calls and quality scores and separated into distinct samples (demultiplexed).

Reads
Sequencing reads
SAM
Alignment Reads

A text-based file format for sequencing reads aligned to a reference genome, or sometimes unaligned, including quality metrics. Stands for “Sequence Alignment/Map” format. Compressed versions of this form include BAM and CRAM.

Secondary Analysis

The second phase of a typical workflow to analyze genomic variants, in which each read set (usually from a FASTQ file or uBAM file) is filtered for quality and aligned to a reference genome.

Sequencing Machine
Sequencing instrument, sequencer

A physical lab machine that automates the process of detecting sequences of individual nucleotide bases from a given DNA or RNA library.

Subreads

In PacBio single-molecule, real-time (SMRT) sequencing technology, each polymerase read is partitioned to form one or more subreads, which contain sequence from a single pass of a polymerase on a single strand of an insert within a SMRTbell™ template and no adapter sequences. The subreads contain the full set of quality values and kinetic measurements. Subreads are useful for applications like de novo assembly, resequencing, base modification analysis, and so on.

Tertiary Analysis

The third phase of a typical workflow to analyze genomic variants, in which each variant call set is analyzed together with additional samples and/or auxiliary datasets such as functional and clinical annotations.

VCF

VCF (Variant Calling Format; file extension .vcf) files store gene sequence variation, such as single nucleotide polymorphisms (SNPs), and is used in genotyping projects. It contains a header with metadata preceded by a “##” string. Best practices with VCF files recommend describing INFO, FILTER, and FORMAT entries used in the body within the header.10,11

VCF
Variant calls

Variant Call Format, a specification for representing genomic variant calls in a text file; or a file in that format.

PacBio AAV Sequencing

Type
Alternate Terms
Subtype
Definition with references
AAV-Mode

On the PacBio Sequencing Machine, use the on-instrument AAV mode to generate HiFi reads for proper handling of scAAV and ssAAV structures.

Flip/Flop

One ITR is formed by two palindromic arms, called B–B′ and C–C′, embedded in a larger one, A–A′. The order of these palindromic sequences defines the flip or flop orientation of the ITR.

Recall Adapters

recalladapters provides a way to change the adapter calls in a PacBio BAM file. It accepts a subreads.bam and a scraps.bam, and emits a subreads.bam and scraps.bam.

chimeric

Read consists of fragments that map to one or more "genomes" (e.g. vector and host; helper and repcap)

host

Read originates from the host genome that is given (e.g. hg38, CHM13).

pRepCap
Repcap plasmid

Read originates from the repcap plasmid. The Rep gene encodes four proteins (Rep78, Rep68, Rep52, and Rep40), which are required for viral genome replication and packaging, while Cap expression gives rise to the viral capsid proteins (VP; VP1/VP2/VP3), which form the outer capsid shell that protects the viral genome, as well as being actively involved in cell binding and internalization.

phelper
Helper plasmid

Read originates from the helper plasmid. In addition to Rep and Cap, AAV requires a helper plasmid containing genes from adenovirus. These genes (E4, E2a and VA) mediate AAV replication.

scAAV

full

Read consists of a DNA sequence which includes the entire ITR-to-ITR target vector sequence. Read is present in both (+) and (-) polarities.

scAAV

backbone

Read consists of a fragment originating solely from the plasmid backbone sequence. Read is present in both (+) and (-) polarities.

scAAV

left-partial

Read consists of a fragment of the vector originating from the upstream (left ITR) of the vector while not covering the right ITR. Read is present in both (+) and (-) polarities.

scAAV

partial

Read consists of a fragment of the vector originating from within the ITR sequences. Read is present in both (+) and (-) polarities.

scAAV

right-partial

Read consists of a fragment of the vector originating from the downstream (right ITR) of the vector while not covering the left ITR. Read is present in both (+) and (-) polarities.

scAAV

vector+backbone

Read consists of a fragment including the vector as well as plasmid backbone sequence. Read is present in both (+) and (-) polarities.

ssAAV

full

Read consists of a DNA sequence which includes the entire ITR-to-ITR target vector sequence. Read is present only in either (+) or (-) polarity.

ssAAV

backbone

Read consists of a fragment originating solely from the plasmid backbone sequence. Read is present in both (+) or (-) polarity.

ssAAV

left partial

Read consists of a fragment of the vector originating from the upstream (left ITR) of the vector while not covering the right ITR. Read is present only in either (+) or (-) polarity.

ssAAV

right-partial

Read consists of a fragment of the vector originating from the downstream (right ITR) of the vector while not covering the left ITR. Read is present only in either (+) or (-) polarity.

ssAAV

partial

Read consists of a fragment of the vector originating from within the ITR sequences. Read is present only in either (+) or (-) polarity.

ssAAV

vector+backbone

Read consists of a fragment including the vector as well as plasmid backbone sequence. Read is present only in either (+) or (-) polarity.

unknown/other

Read consists of a fragment mapping to the vector but with unexpected polarities (e.g. +,-,- or +,+,+) and cannot be well-defined at the moment.

Viral Physiology & Molecular Biology

Type
Alternate Terms
Subtype
Definition with references
2° structure

double helices, stem-loop structures, pseudoknots

Secondary structures consist of regions stabilized by hydrogen bonds between atoms in the nucleotide chain.

3° structure

Coaxial structures, a-minor motif, ribose zipper, G-quadruplexes, triplexes

Tertiary structures are a complex nucleotide structure determined by regions stabilized by higher order interactions beyond secondary structures.

Adeno Associated Virus
NA

AAV, rAAV

AAV (wild-type) is a small non-enveloped virus from the Parvoviridae family with a genome size of about 4.7 kilobases long. It’s made of two parts: a 20-sided icosahedron, called a capsid, and up to 4,700 bases of single-stranded DNA.

Adenovirus
NA

Ad

A member of a family of viruses, non-enveloped, double-stranded deoxyribonucleic acid (DNA) viruses, that can cause infections in the respiratory tract, eye, and gastrointestinal tract.

Allogenic

allogenic

Cells or tissues that are genetically different because they are derived from separate individuals of the same species.

Capsid
NA

NA

A capsid refers to the complete protein shell that encloses and protects the genetic material (such as DNA or RNA) of a virus.

Capsid Fill Rate
Encapsidation rate
Empty Capsid
NA

NA

A viral capsid that encloses no vector genome product.

Endosomal escape

A process by which therapeutic biologics (such as DNA, siRNA or protein) escape from the endosome from within a cell and enter the cell cytoplasm.

Full Capsid
NA

NA

A full capsid refers to a viral capsid that encloses a complete vector genome.

GOI plasmid
Transfer plasmid, vector, construct

Plasmid containing the transgene of interest, GOI stands for gene of interest.

ITR
NA

AAV2, AAV8, etc

Inverted terminal repeat, contains the origin of replication (ori) for a particular viral serotype. DNA synthesis begins at one end and completes at the other.

Lox sites

Lox sites, in combination with the Cre recombinase, are used to spatially and temporally control transgene expression. Lox sites are directional 34 bp sequences that can flank the transgene and are located between the ITRs. Transfer plasmids often have two pairs of lox sites which are part of a dual lox system called FLEX or DIO.

MCS
NA

NA

The multiple cloning site (MCS) positioned at the immediate downstream of the promoter contains multiple restriction sites, each of which is unique to the plasmid.

Non-empty capsids
NA

Full or partial capsid

A non-empty capsid is a complete protein shell that encapsulates genetic material - DNA or RNA - within its structure.

Nuclear transfection efficiency
Nuclear delivery efficiency

Nuclear transfection efficiency refers to the effectiveness of delivering foreign material into the nucleus of a cell during transfection process.

Nuclear uptake

Nuclear uptake is the process by which substances are transported into the nucleus of a cell.

Payload
GOI, vector genome

A plasmid that encodes a gene designed to treat disease which gets coiled up and packaged into the capsid.

Post-translational elements
Post-translational modifications

WPRE, polyA tail

Post-translational elements increase the functional diversity of the proteome by the covalent addition of functional groups or proteins, proteolytic cleavage of regulatory subunits, or degradation of entire proteins.

Probability 2°/3° structure
NA

NA

The likelihood of a series of nucleotides in a gene sequence folding into a 2° or 3° structure, which could be double helices, stem-loop structures, pseudoknots, G-quadruplexes,

Producer Cell line

Producer cell lines serve as durable rAAV factories: upon infection with adenovirus, they will churn out fully functional gene therapy particles.

Recombinant Adeno-Associated Viral Vectors

rAAV

AAV is transformed from a naturally occurring virus into a delivery mechanism for gene therapy. The viral DNA is replaced with new DNA, and it becomes a precisely coded vector and is no longer considered a virus, as most of the viral components have been replaced.

SV40
NA

NA

Simian vacuolating virus 40, components of which are commonly used for transgene delivery.

Transfection efficiency

Transfection efficiency quantifies the amount of successful transfection of cells is calculated by the number of expressing cells divided by the total number of cells in culture.

Transfection rate
Transfection effieciency rate
Truncated Capsid
NA

NA

A viral capsid that encloses a truncated vector genome product.

Unlabeled Region
NA

NA

Regions of a viral vector that do not have a specific annotation (promoter, CDS, etc).

WPRE

The Woodchuck Hepatitis Virus (WHP) Posttranscriptional Regulatory Element (WPRE) forms a tertiary structure when transcribed which aids in nuclear export of mRNA and helps increase the expression of the transgene. The WPRE is located at the end of the AAV genome, just upstream of the final ITR.

enhancer
NA

NA

A regulatory element that promotes robust and long-term expression of the transgene (GOI)

intron
NA

SV40

An intron is a region that resides within a gene but does not remain in the final mature mRNA molecule following transcription of that gene and does not code for amino acids that make up the protein encoded by that gene Regulatory and other key elements could be incorporated into introns.

non-B DNA

A region of DNA with any secondary or tertiary molecular structure other than the “B form” right-handed double-helix structure. These include A-form and Z-form secondary DNA structures, as well as hairpin, G-quadruplex, and other tertiary structures.

non-coding region
NA

NA

Non-coding region, a sequence region outside the transgene in a viral vector.

pLannotate
NA

NA

pLannotate is an open-source web server for comprehensive annotation of plasmid features. It employs a filtering algorithm for relevant feature matches, reports feature fragments, displays a graphical map of the annotated plasmid, and allows results to be downloaded in various formats.

polyA tail
NA

bGH, SV40

This sequence is added to the end of a mRNA sequence to ensure mRNA stability during export from the nucleus. It can be coded in the DNA and may be optimized by viral serotype.

promoter
NA

CMV, CBA, CBh, SV40, JET

A regulatory element that controls expression of a gene. A critical component in a transgene delivery vehicle (virus, lipid nanoparticle, etc).

regulatory elements
NA

Promoter, enhancer

Part of a viral vector containing the promoter, enhancer, any auxiliary elements that control transgene expression.

transgene
GOI, CDS

NA

The gene of interest (GOI) or coding sequence (CDS) which will be expressed via rAAV delivery vehicle to the appropriate tissue in a gene therapeutic.

truncation propensity
NA

NA

A FormsightAI output, this is the predicted likelihood of a truncation in a rAAV gene therapy product during replication and packaging. Higher truncation propensities indicate a higher likelihood that the gene product (mRNA, protein) will be shortened and non-therapeutic.

viral vector
construct, plasmid

AAV, lentivirus

A viral vehicle to deliver genetic material to a cell/tissue/organism.

No results found. Please try a different term