Data Processing
Main Workflow
File: main.nf
The entry point controls all sub-workflows. It automatically determines read file locations based on whether the default example data or custom data is being used:
If the sample table matches the default example, reads are resolved relative to
$baseDir.If a custom sample table is provided, reads are resolved relative to
$launchDir.
1. Alignment Workflow (ALIGN)
File: workflows/alignment.nf
Processes raw FASTQ reads into aligned count matrices.
Steps:
Validate Sample Table (
validate-sample-table.py)Ensures
fastq_filepathcolumn is present.Auto-generates integer
sample_idindices.Validates that
control_statuscolumn exists with ≥2beads_onlysamples (when Z-score analysis is enabled).
Validate Peptide Table (
validate-peptide-table.py)Ensures
oligocolumn is present.Auto-generates
peptide_idindices if not provided.
Generate FASTA Reference (
generate-fasta.py)Converts peptide oligo sequences to FASTA format (uppercase characters only).
Generate Bowtie2 Index (
bowtie2-build)Builds a Bowtie2 index from the peptide FASTA reference.
Nanopore Alignment (
templates/nanopore_alignment.sh)Tool: Bowtie2 with configurable options (default:
--local --very-sensitive-local).Aligns each sample’s FASTQ reads against the peptide index.
Output: SAM files per sample.
SAM to Counts (
templates/sam_to_counts.sh)Tools: SAMtools (
view,sort,index,idxstats).Converts SAM → sorted BAM → indexed BAM → per-peptide read counts.
SAM to Stats (
templates/sam_to_stats.sh)Tool: SAMtools
stats.Extracts alignment statistics (raw total sequences, reads mapped, etc.).
Collect PhIP Data (
merge-counts-stats.py)Merges all per-sample counts and stats into a single
phipperyxarray dataset (.phipformat).Computes:
percent_mapped,percent_peptides_detected,percent_peptides_between_10_and_100.
Replicate Counts (
replicate-counts.py, optional)When
replicate_sequence_counts = true, aggregates counts for peptides sharing the same oligo sequence.
2. Statistics Workflow (STATS)
File: workflows/statistics.nf
Applies normalization and statistical enrichment modeling using phippery.
Steps:
Counts Per Million (CPM)
Normalizes raw counts to counts per million using
phippery.normalize.counts_per_million().
Size Factors
Estimates size factors for each sample using
phippery.normalize.size_factors().
CPM Fold Enrichment (optional,
run_cpm_enr_workflow = true)Computes fold enrichment of each sample’s CPM relative to library controls using
phippery.normalize.enrichment().
Z-Score Fit-Predict (optional,
run_zscore_fit_predict = true)Tool:
fit-predict-zscore.pyusingphippery.modeling.zscore().Fits a regression model on bead-only controls and predicts Z-scores for all samples.
Parameters:
min_Npeptides_per_bin=300, quantile limits0.05–0.95.
Merge Binary Datasets
Merges all enrichment layer outputs into a single consolidated
.phipdataset usingphippery merge.
3. Output Workflow (DSOUT)
File: workflows/output.nf
Exports the merged dataset in multiple formats.
Output |
Condition |
Format |
|---|---|---|
Binary pickle |
|
|
Wide CSV |
|
Gzipped CSV per enrichment layer |
Tall CSV |
|
Gzipped tall-format CSV |
The wide CSV output produces files like dataset_counts.csv.gz, dataset_cpm.csv.gz, dataset_zscore.csv.gz, etc.
4. FHIR Report Workflow (FHIR)
File: workflows/fhir_report.nf
Generates HL7 FHIR R4 transaction bundles from PhIP-Seq Z-score data.
Steps:
CREATE_FHIR (
bin/phipseq_to_fhir.py)Reads Z-score CSV and sample table.
For each sample, generates a FHIR Bundle containing:
Patient resource
Specimen resource (serum for PhIP-Seq)
Organization resource
Practitioner / PractitionerRole resources
Observation resources per peptide (Z-score values with positive/negative interpretation at threshold > 3.5)
Method coded as SNOMED CT
708049000(Phage immunoprecipitation sequencing).
5. FDR Workflow (FDR)
File: workflows/fdr.nf
Performs false discovery rate analysis on Z-scores.
Steps:
Z-score FDR Analysis (
zscore_fdr_analysis)Converts Z-scores to two-sided p-values using the normal distribution CDF.
Applies Benjamini-Hochberg FDR correction per sample.
6. Virus Score Workflow (VIRUSSCORE)
File: workflows/virusscore.nf
Calculates per-species virus exposure scores based on enriched peptide hits.
Steps:
Split Hits — Extracts significant hit matrices per sample.
Calculate Scores (
bin/calc_scores_nofilter.py)Groups peptides by species.
Applies novel epitope filtering: a peptide is counted only if it does not share a subsequence of length ≥
virus_score_epitope_len(default: 7 amino acids) with any previously assigned peptide.Output: Per-sample virus score CSV files.
7. IEDB Annotation Workflow (IEDB)
File: workflows/iedb_annotation.nf
Cross-references enriched peptides with the Immune Epitope Database.
Steps:
Extract Sample Names — Parses column headers from Z-score file.
IEDB Annotation (per sample)
Loads IEDB TSV database and extracts valid epitope sequences.
For each peptide, performs substring matching against the IEDB epitope set.
Classifies peptides as:
Known — matches at least one IEDB epitope.
Novel — no IEDB match found.
Significant — Z-score ≥ threshold (default: 3.5).
Outputs per sample:
{sample}_annotated_peptides.csv{sample}_novel_peptides.csv{sample}_significant_epitopes.csv{sample}_annotation_summary.txt
8. Visualization Workflow (VISUALIZE)
File: workflows/visualization.nf
Generates interactive Plotly heatmaps.
Outputs:
Virus Score Heatmap (
virus_score_heatmap.html)Merges per-sample virus score files.
Z-Score Heatmap (
zscore_heatmap.html)Per-organism dropdown selector.
Heatmap of Z-scores: Peptide Oligos × Samples.
Hover shows peptide name, sample ID, and Z-score value.
9. Streamlit Dashboard Workflow (STREAMLIT)
File: workflows/streamlit.nf
Creates an interactive 3D protein visualization web application.
Steps:
Generate JSmol HTML (
GENERATE_JSMOL_HTML)Parses Z-score data and peptide metadata with PDB IDs.
Copies local PDB/CIF structure files.
Generates per-structure HTML files with epitope annotations.
Exports
epitope_data.csvfor the Streamlit app.
Create Streamlit App (
CREATE_STREAMLIT_APP)Generates a complete Streamlit application with:
Mol* 3D web component for interactive protein structure viewing.
Sequence-to-structure mapping.
Per-sample epitope selection with color-coded highlighting.
10. Neutralization Prediction Workflow (NEUTRALIZATION_PREDICTION)
File: workflows/neutralization_score.nf
Predicts neutralizing antibody potential of detected epitopes using multi-factor scoring (In development).
Steps:
Parse 3D Structures
Computes Solvent-Accessible Surface Area (SASA) using Shrake-Rupley algorithm.
Extracts B-factor normalization per chain.
Caches chain sequences and coordinates for matching epitopes.
Calculate Peptide Scores
Conservation Score: Log-weighted IEDB epitope match count.
IEDB Evidence Score: Shannon entropy of source organisms.
Epitope Coverage Score: Fraction of protein positions with IEDB hits.
Neutralization DB Score: Tiered scoring for exact/substring/kmer matches in neutralization database.
Structural Features:
SASA normalization (0–1 scale per residue type).
B-factor (normalized per chain).
3D centroid coordinates of matched epitope region.
Sequence Properties:
GRAVY (Kyte-Doolittle hydropathy): influences surface accessibility multiplier.
Flexibility Bonus: loop-forming residues (G, P, S, T).
N-Glycosylation Penalty: counts NxS/T motifs, reduces score up to 0.75×.
Context Factor: 1.12× for spike protein, 1.06× for envelope/membrane proteins.
Sample-Level Scoring
For each Z-score hit ≥ threshold (default: 3.5) per sample:
Converts Z-score to PhIP signal.
Combines static scores with Q-value using weighted linear combination (default weights: 0.15 conservation, 0.25 PhIP, 0.1 IEDB, 0.25 neut DB, 0.15 coverage, 0.1 B-factor).
Applies multipliers: GRAVY, glycan, context, flexibility bonus.
Filters by SASA (hard threshold: <0.05 → score × 0.0).
Prediction category:
High: composite ≥ 3.0 (default threshold).
Moderate: composite ≥ 1.95 (65% threshold).
Low: composite < 1.95.
Spatial Clustering
Groups peptides by sample + PDB ID; clusters epitope coordinates within 8.0 Å (configurable).
Annotates conformational epitope clusters.
Outputs
neutralization_scores_per_sample.csvhigh_confidence_candidates.csvneutralization_summary.txtdetailed_analysis.jsonconformational_epitope_clusters.csv
Workflow Parameters
All parameters are defined in nextflow.config.