Creating a Global Research Resource
The Pediatric Brain Tumor Atlas (PBTA) is a collaborative effort to accelerate discoveries for therapeutic intervention for children diagnosed with a brain tumor. The first PBTA dataset release occurred in September of 2018 and includes data from tumor types including matched tumor/normal, whole genome data (WGS), RNAseq, proteomics, longitudinal clinical data, imaging data including MRIs and radiology reports, histology slide images and pathology reports. Funding for this initiative was provided by more than 50 foundation sponsors.
The PBTA is currently available through the Gabriella Miller Kids First Data Resource Portal and enables users to identify genomic and other file types, view associated meta data and create workspaces in Cavatica, a cloud-based platform which allows researchers to view and analyze raw genomic data. PBTA summary data and data visualizations are available through PedcBioPortal. Researchers can request access to raw genomic data by submitting a CBTN Data Access Agreement and CBTN Data Access Request Form
A digital object identifier has been given to the PBTA and can be used for reference: https://doi.org/10.24370/SD_BHJXBDQK
Genomic Harmonization details for the PBTA
The Gabriella Miller Kids First Data Resource Center (DRC) has developed and applied alignment and joint genotyping workflows which follow the GATK best practice recommendations, all with the goal of being functionally-equivalent with other current, large, genomic research efforts. The data processing is done via the Cavatica platform within an Amazon Web Services (AWS) environment. The harmonized results are stored in AWS and made searchable via the DRC Portal and further analyzable on Cavatica.
In more detail, the harmonization process starts with an alignment workflow which accepts file formats of BAM, CRAM or FASTQ or mixed types, then converts them into uBAM (the unmapped BAM) by Picard RevertSam. After that, the uBAM is aligned by read-group to align with human genome reference hg38, which includes improved ALT contigs and HLA loci. Then Picard MarkDuplicates, SamSort and MergeBamAlignment will be applied in a scatter execution fashion where jobs will be parallelized by split chromosome intervals. BQSR (or the Base Quality Score Recalibration) process is then applied based on model from the known SNPs and InDels of HapMap, 1000 Genomes, dbSNP138 and Mills Gold Standard Calls. Finally, GATK4 HaplotypeCaller is applied to generate single sample gVCFs, along with the merged BAM which was converted into CRAM as final alignment outputs.
For the joint genotyping workflow, trio-based and cohort-based gVCFs are imported as genomicsDB by GATK4 and passed down for GenotypeGVCFs execution. VQSR is applied for SNP and InDel separately in a scatter fashion by calling intervals. A final VCF with a QC review is performed by GATK4 GatherVcfs and CollectVariantCallingMetrics. Then all the outputs are registered into the Gabriella Miller Kids First Data Service for tracking and final checking of results and after approval are released to the DRC Portal and Cavatica.
Kids First DRC pipelines are open source and made available to the public via GitHub:
Alignment workflow: https://github.com/kids-first/kf-alignment-workflow
Joint genotyping workflow: https://github.com/kids-first/kf-jointgenotyping-workflow