Data Processing

Main Workflow

File: main.nf The entry point controls all sub-workflows. It automatically determines read file locations based on whether the default example data or custom data is being used:

If the sample table matches the default example, reads are resolved relative to $baseDir.
If a custom sample table is provided, reads are resolved relative to $launchDir.

1. Alignment Workflow (ALIGN)

File: workflows/alignment.nf Processes raw FASTQ reads into aligned count matrices.

Steps:

Validate Sample Table (validate-sample-table.py)
- Ensures fastq_filepath column is present.
- Auto-generates integer sample_id indices.
- Validates that control_status column exists with ≥2 beads_only samples (when Z-score analysis is enabled).
Validate Peptide Table (validate-peptide-table.py)
- Ensures oligo column is present.
- Auto-generates peptide_id indices if not provided.
Generate FASTA Reference (generate-fasta.py)
- Converts peptide oligo sequences to FASTA format (uppercase characters only).
Generate Bowtie2 Index (bowtie2-build)
- Builds a Bowtie2 index from the peptide FASTA reference.
Nanopore Alignment (templates/nanopore_alignment.sh)
- Tool: Bowtie2 with configurable options (default: --local --very-sensitive-local).
- Aligns each sample’s FASTQ reads against the peptide index.
- Output: SAM files per sample.
SAM to Counts (templates/sam_to_counts.sh)
- Tools: SAMtools (view, sort, index, idxstats).
- Converts SAM → sorted BAM → indexed BAM → per-peptide read counts.
SAM to Stats (templates/sam_to_stats.sh)
- Tool: SAMtools stats.
- Extracts alignment statistics (raw total sequences, reads mapped, etc.).
Collect PhIP Data (merge-counts-stats.py)
- Merges all per-sample counts and stats into a single phippery xarray dataset (.phip format).
- Computes: percent_mapped, percent_peptides_detected, percent_peptides_between_10_and_100.
Replicate Counts (replicate-counts.py, optional)
- When replicate_sequence_counts = true, aggregates counts for peptides sharing the same oligo sequence.

2. Statistics Workflow (STATS)

File: workflows/statistics.nf Applies normalization and statistical enrichment modeling using phippery.

Steps:

Counts Per Million (CPM)
- Normalizes raw counts to counts per million using phippery.normalize.counts_per_million().
Size Factors
- Estimates size factors for each sample using phippery.normalize.size_factors().
CPM Fold Enrichment (optional, run_cpm_enr_workflow = true)
- Computes fold enrichment of each sample’s CPM relative to library controls using phippery.normalize.enrichment().
Z-Score Fit-Predict (optional, run_zscore_fit_predict = true)
- Tool: fit-predict-zscore.py using phippery.modeling.zscore().
- Fits a regression model on bead-only controls and predicts Z-scores for all samples.
- Parameters: min_Npeptides_per_bin=300, quantile limits 0.05–0.95.
Merge Binary Datasets
- Merges all enrichment layer outputs into a single consolidated .phip dataset using phippery merge.

3. Output Workflow (DSOUT)

File: workflows/output.nf Exports the merged dataset in multiple formats.

Output	Condition	Format
Binary pickle	`output_pickle_xarray = true`	`.phip`
Wide CSV	`output_wide_csv = true`	Gzipped CSV per enrichment layer
Tall CSV	`output_tall_csv = true`	Gzipped tall-format CSV

The wide CSV output produces files like dataset_counts.csv.gz, dataset_cpm.csv.gz, dataset_zscore.csv.gz, etc.

4. FHIR Report Workflow (FHIR)

File: workflows/fhir_report.nf

Generates HL7 FHIR R4 transaction bundles from PhIP-Seq Z-score data.

Steps:

CREATE_FHIR (bin/phipseq_to_fhir.py)
- Reads Z-score CSV and sample table.
- For each sample, generates a FHIR Bundle containing:
  - Patient resource
  - Specimen resource (serum for PhIP-Seq)
  - Organization resource
  - Practitioner / PractitionerRole resources
  - Observation resources per peptide (Z-score values with positive/negative interpretation at threshold > 3.5)
- Method coded as SNOMED CT 708049000 (Phage immunoprecipitation sequencing).

5. FDR Workflow (FDR)

File: workflows/fdr.nf Performs false discovery rate analysis on Z-scores.

Steps:

Z-score FDR Analysis (zscore_fdr_analysis)
- Converts Z-scores to two-sided p-values using the normal distribution CDF.
- Applies Benjamini-Hochberg FDR correction per sample.

6. Virus Score Workflow (VIRUSSCORE)

File: workflows/virusscore.nf Calculates per-species virus exposure scores based on enriched peptide hits.

Steps:

Split Hits — Extracts significant hit matrices per sample.
Calculate Scores (bin/calc_scores_nofilter.py)
- Groups peptides by species.
- Applies novel epitope filtering: a peptide is counted only if it does not share a subsequence of length ≥ virus_score_epitope_len (default: 7 amino acids) with any previously assigned peptide.
- Output: Per-sample virus score CSV files.

7. IEDB Annotation Workflow (IEDB)

File: workflows/iedb_annotation.nf Cross-references enriched peptides with the Immune Epitope Database.

Steps:

Extract Sample Names — Parses column headers from Z-score file.
IEDB Annotation (per sample)
- Loads IEDB TSV database and extracts valid epitope sequences.
- For each peptide, performs substring matching against the IEDB epitope set.
- Classifies peptides as:
  - Known — matches at least one IEDB epitope.
  - Novel — no IEDB match found.
  - Significant — Z-score ≥ threshold (default: 3.5).
- Outputs per sample:
  - {sample}_annotated_peptides.csv
  - {sample}_novel_peptides.csv
  - {sample}_significant_epitopes.csv
  - {sample}_annotation_summary.txt

8. Visualization Workflow (VISUALIZE)

File: workflows/visualization.nf Generates interactive Plotly heatmaps.

Outputs:

Virus Score Heatmap (virus_score_heatmap.html)
- Merges per-sample virus score files.
Z-Score Heatmap (zscore_heatmap.html)
- Per-organism dropdown selector.
- Heatmap of Z-scores: Peptide Oligos × Samples.
- Hover shows peptide name, sample ID, and Z-score value.

9. Streamlit Dashboard Workflow (STREAMLIT)

File: workflows/streamlit.nf Creates an interactive 3D protein visualization web application.

Steps:

Generate JSmol HTML (GENERATE_JSMOL_HTML)
- Parses Z-score data and peptide metadata with PDB IDs.
- Copies local PDB/CIF structure files.
- Generates per-structure HTML files with epitope annotations.
- Exports epitope_data.csv for the Streamlit app.
Create Streamlit App (CREATE_STREAMLIT_APP)
- Generates a complete Streamlit application with:
  - Mol* 3D web component for interactive protein structure viewing.
  - Sequence-to-structure mapping.
  - Per-sample epitope selection with color-coded highlighting.

10. Neutralization Prediction Workflow (NEUTRALIZATION_PREDICTION)

File: workflows/neutralization_score.nf Predicts neutralizing antibody potential of detected epitopes using multi-factor scoring (In development).

Steps:

Parse 3D Structures
- Computes Solvent-Accessible Surface Area (SASA) using Shrake-Rupley algorithm.
- Extracts B-factor normalization per chain.
- Caches chain sequences and coordinates for matching epitopes.
Calculate Peptide Scores
- Conservation Score: Log-weighted IEDB epitope match count.
- IEDB Evidence Score: Shannon entropy of source organisms.
- Epitope Coverage Score: Fraction of protein positions with IEDB hits.
- Neutralization DB Score: Tiered scoring for exact/substring/kmer matches in neutralization database.
- Structural Features:
  - SASA normalization (0–1 scale per residue type).
  - B-factor (normalized per chain).
  - 3D centroid coordinates of matched epitope region.
- Sequence Properties:
  - GRAVY (Kyte-Doolittle hydropathy): influences surface accessibility multiplier.
  - Flexibility Bonus: loop-forming residues (G, P, S, T).
  - N-Glycosylation Penalty: counts NxS/T motifs, reduces score up to 0.75×.
- Context Factor: 1.12× for spike protein, 1.06× for envelope/membrane proteins.
Sample-Level Scoring
- For each Z-score hit ≥ threshold (default: 3.5) per sample:
  - Converts Z-score to PhIP signal.
  - Combines static scores with Q-value using weighted linear combination (default weights: 0.15 conservation, 0.25 PhIP, 0.1 IEDB, 0.25 neut DB, 0.15 coverage, 0.1 B-factor).
  - Applies multipliers: GRAVY, glycan, context, flexibility bonus.
  - Filters by SASA (hard threshold: <0.05 → score × 0.0).
- Prediction category:
  - High: composite ≥ 3.0 (default threshold).
  - Moderate: composite ≥ 1.95 (65% threshold).
  - Low: composite < 1.95.
Spatial Clustering
- Groups peptides by sample + PDB ID; clusters epitope coordinates within 8.0 Å (configurable).
- Annotates conformational epitope clusters.
Outputs
- neutralization_scores_per_sample.csv
- high_confidence_candidates.csv
- neutralization_summary.txt
- detailed_analysis.json
- conformational_epitope_clusters.csv

Workflow Parameters

All parameters are defined in nextflow.config.