Data Processing

Main Workflow

File: main.nf The entry point controls all sub-workflows. It automatically determines read file locations based on whether the default example data or custom data is being used:

  • If the sample table matches the default example, reads are resolved relative to $baseDir.

  • If a custom sample table is provided, reads are resolved relative to $launchDir.

1. Alignment Workflow (ALIGN)

File: workflows/alignment.nf Processes raw FASTQ reads into aligned count matrices.

Steps:

  1. Validate Sample Table (validate-sample-table.py)

    • Ensures fastq_filepath column is present.

    • Auto-generates integer sample_id indices.

    • Validates that control_status column exists with ≥2 beads_only samples (when Z-score analysis is enabled).

  2. Validate Peptide Table (validate-peptide-table.py)

    • Ensures oligo column is present.

    • Auto-generates peptide_id indices if not provided.

  3. Generate FASTA Reference (generate-fasta.py)

    • Converts peptide oligo sequences to FASTA format (uppercase characters only).

  4. Generate Bowtie2 Index (bowtie2-build)

    • Builds a Bowtie2 index from the peptide FASTA reference.

  5. Nanopore Alignment (templates/nanopore_alignment.sh)

    • Tool: Bowtie2 with configurable options (default: --local --very-sensitive-local).

    • Aligns each sample’s FASTQ reads against the peptide index.

    • Output: SAM files per sample.

  6. SAM to Counts (templates/sam_to_counts.sh)

    • Tools: SAMtools (view, sort, index, idxstats).

    • Converts SAM → sorted BAM → indexed BAM → per-peptide read counts.

  7. SAM to Stats (templates/sam_to_stats.sh)

    • Tool: SAMtools stats.

    • Extracts alignment statistics (raw total sequences, reads mapped, etc.).

  8. Collect PhIP Data (merge-counts-stats.py)

    • Merges all per-sample counts and stats into a single phippery xarray dataset (.phip format).

    • Computes: percent_mapped, percent_peptides_detected, percent_peptides_between_10_and_100.

  9. Replicate Counts (replicate-counts.py, optional)

    • When replicate_sequence_counts = true, aggregates counts for peptides sharing the same oligo sequence.

2. Statistics Workflow (STATS)

File: workflows/statistics.nf Applies normalization and statistical enrichment modeling using phippery.

Steps:

  1. Counts Per Million (CPM)

    • Normalizes raw counts to counts per million using phippery.normalize.counts_per_million().

  2. Size Factors

    • Estimates size factors for each sample using phippery.normalize.size_factors().

  3. CPM Fold Enrichment (optional, run_cpm_enr_workflow = true)

    • Computes fold enrichment of each sample’s CPM relative to library controls using phippery.normalize.enrichment().

  4. Z-Score Fit-Predict (optional, run_zscore_fit_predict = true)

    • Tool: fit-predict-zscore.py using phippery.modeling.zscore().

    • Fits a regression model on bead-only controls and predicts Z-scores for all samples.

    • Parameters: min_Npeptides_per_bin=300, quantile limits 0.05–0.95.

  5. Merge Binary Datasets

    • Merges all enrichment layer outputs into a single consolidated .phip dataset using phippery merge.

3. Output Workflow (DSOUT)

File: workflows/output.nf Exports the merged dataset in multiple formats.

Output

Condition

Format

Binary pickle

output_pickle_xarray = true

.phip

Wide CSV

output_wide_csv = true

Gzipped CSV per enrichment layer

Tall CSV

output_tall_csv = true

Gzipped tall-format CSV

The wide CSV output produces files like dataset_counts.csv.gz, dataset_cpm.csv.gz, dataset_zscore.csv.gz, etc.

4. FHIR Report Workflow (FHIR)

File: workflows/fhir_report.nf

Generates HL7 FHIR R4 transaction bundles from PhIP-Seq Z-score data.

Steps:

  1. CREATE_FHIR (bin/phipseq_to_fhir.py)

    • Reads Z-score CSV and sample table.

    • For each sample, generates a FHIR Bundle containing:

      • Patient resource

      • Specimen resource (serum for PhIP-Seq)

      • Organization resource

      • Practitioner / PractitionerRole resources

      • Observation resources per peptide (Z-score values with positive/negative interpretation at threshold > 3.5)

    • Method coded as SNOMED CT 708049000 (Phage immunoprecipitation sequencing).

5. FDR Workflow (FDR)

File: workflows/fdr.nf Performs false discovery rate analysis on Z-scores.

Steps:

  1. Z-score FDR Analysis (zscore_fdr_analysis)

    • Converts Z-scores to two-sided p-values using the normal distribution CDF.

    • Applies Benjamini-Hochberg FDR correction per sample.

6. Virus Score Workflow (VIRUSSCORE)

File: workflows/virusscore.nf Calculates per-species virus exposure scores based on enriched peptide hits.

Steps:

  1. Split Hits — Extracts significant hit matrices per sample.

  2. Calculate Scores (bin/calc_scores_nofilter.py)

    • Groups peptides by species.

    • Applies novel epitope filtering: a peptide is counted only if it does not share a subsequence of length ≥ virus_score_epitope_len (default: 7 amino acids) with any previously assigned peptide.

    • Output: Per-sample virus score CSV files.

7. IEDB Annotation Workflow (IEDB)

File: workflows/iedb_annotation.nf Cross-references enriched peptides with the Immune Epitope Database.

Steps:

  1. Extract Sample Names — Parses column headers from Z-score file.

  2. IEDB Annotation (per sample)

    • Loads IEDB TSV database and extracts valid epitope sequences.

    • For each peptide, performs substring matching against the IEDB epitope set.

    • Classifies peptides as:

      • Known — matches at least one IEDB epitope.

      • Novel — no IEDB match found.

      • Significant — Z-score ≥ threshold (default: 3.5).

    • Outputs per sample:

      • {sample}_annotated_peptides.csv

      • {sample}_novel_peptides.csv

      • {sample}_significant_epitopes.csv

      • {sample}_annotation_summary.txt

8. Visualization Workflow (VISUALIZE)

File: workflows/visualization.nf Generates interactive Plotly heatmaps.

Outputs:

  1. Virus Score Heatmap (virus_score_heatmap.html)

    • Merges per-sample virus score files.

  2. Z-Score Heatmap (zscore_heatmap.html)

    • Per-organism dropdown selector.

    • Heatmap of Z-scores: Peptide Oligos × Samples.

    • Hover shows peptide name, sample ID, and Z-score value.

9. Streamlit Dashboard Workflow (STREAMLIT)

File: workflows/streamlit.nf Creates an interactive 3D protein visualization web application.

Steps:

  1. Generate JSmol HTML (GENERATE_JSMOL_HTML)

    • Parses Z-score data and peptide metadata with PDB IDs.

    • Copies local PDB/CIF structure files.

    • Generates per-structure HTML files with epitope annotations.

    • Exports epitope_data.csv for the Streamlit app.

  2. Create Streamlit App (CREATE_STREAMLIT_APP)

    • Generates a complete Streamlit application with:

      • Mol* 3D web component for interactive protein structure viewing.

      • Sequence-to-structure mapping.

      • Per-sample epitope selection with color-coded highlighting.

10. Neutralization Prediction Workflow (NEUTRALIZATION_PREDICTION)

File: workflows/neutralization_score.nf Predicts neutralizing antibody potential of detected epitopes using multi-factor scoring (In development).

Steps:

  1. Parse 3D Structures

    • Computes Solvent-Accessible Surface Area (SASA) using Shrake-Rupley algorithm.

    • Extracts B-factor normalization per chain.

    • Caches chain sequences and coordinates for matching epitopes.

  2. Calculate Peptide Scores

    • Conservation Score: Log-weighted IEDB epitope match count.

    • IEDB Evidence Score: Shannon entropy of source organisms.

    • Epitope Coverage Score: Fraction of protein positions with IEDB hits.

    • Neutralization DB Score: Tiered scoring for exact/substring/kmer matches in neutralization database.

    • Structural Features:

      • SASA normalization (0–1 scale per residue type).

      • B-factor (normalized per chain).

      • 3D centroid coordinates of matched epitope region.

    • Sequence Properties:

      • GRAVY (Kyte-Doolittle hydropathy): influences surface accessibility multiplier.

      • Flexibility Bonus: loop-forming residues (G, P, S, T).

      • N-Glycosylation Penalty: counts NxS/T motifs, reduces score up to 0.75×.

    • Context Factor: 1.12× for spike protein, 1.06× for envelope/membrane proteins.

  3. Sample-Level Scoring

    • For each Z-score hit ≥ threshold (default: 3.5) per sample:

      • Converts Z-score to PhIP signal.

      • Combines static scores with Q-value using weighted linear combination (default weights: 0.15 conservation, 0.25 PhIP, 0.1 IEDB, 0.25 neut DB, 0.15 coverage, 0.1 B-factor).

      • Applies multipliers: GRAVY, glycan, context, flexibility bonus.

      • Filters by SASA (hard threshold: <0.05 → score × 0.0).

    • Prediction category:

      • High: composite ≥ 3.0 (default threshold).

      • Moderate: composite ≥ 1.95 (65% threshold).

      • Low: composite < 1.95.

  4. Spatial Clustering

    • Groups peptides by sample + PDB ID; clusters epitope coordinates within 8.0 Å (configurable).

    • Annotates conformational epitope clusters.

  5. Outputs

    • neutralization_scores_per_sample.csv

    • high_confidence_candidates.csv

    • neutralization_summary.txt

    • detailed_analysis.json

    • conformational_epitope_clusters.csv

Workflow Parameters

All parameters are defined in nextflow.config.