# Data Processing

## Main Workflow
**File:** `main.nf`
The entry point controls all sub-workflows. It automatically determines read file locations based on whether the default example data or custom data is being used:
- If the sample table matches the default example, reads are resolved relative to `$baseDir`.
- If a custom sample table is provided, reads are resolved relative to `$launchDir`.

## 1. Alignment Workflow (ALIGN)
**File:** `workflows/alignment.nf`
Processes raw FASTQ reads into aligned count matrices.
### Steps:
1. **Validate Sample Table** (`validate-sample-table.py`)
   - Ensures `fastq_filepath` column is present.
   - Auto-generates integer `sample_id` indices.
   - Validates that `control_status` column exists with ≥2 `beads_only` samples (when Z-score analysis is enabled).
2. **Validate Peptide Table** (`validate-peptide-table.py`)
   - Ensures `oligo` column is present.
   - Auto-generates `peptide_id` indices if not provided.
3. **Generate FASTA Reference** (`generate-fasta.py`)
   - Converts peptide oligo sequences to FASTA format (uppercase characters only).
4. **Generate Bowtie2 Index** (`bowtie2-build`)
   - Builds a Bowtie2 index from the peptide FASTA reference.
5. **Nanopore Alignment** (`templates/nanopore_alignment.sh`)
   - Tool: **Bowtie2** with configurable options (default: `--local --very-sensitive-local`).
   - Aligns each sample's FASTQ reads against the peptide index.
   - Output: SAM files per sample.
6. **SAM to Counts** (`templates/sam_to_counts.sh`)
   - Tools: **SAMtools** (`view`, `sort`, `index`, `idxstats`).
   - Converts SAM → sorted BAM → indexed BAM → per-peptide read counts.
7. **SAM to Stats** (`templates/sam_to_stats.sh`)
   - Tool: **SAMtools** `stats`.
   - Extracts alignment statistics (raw total sequences, reads mapped, etc.).
8. **Collect PhIP Data** (`merge-counts-stats.py`)
   - Merges all per-sample counts and stats into a single `phippery` xarray dataset (`.phip` format).
   - Computes: `percent_mapped`, `percent_peptides_detected`, `percent_peptides_between_10_and_100`.
9. **Replicate Counts** (`replicate-counts.py`, optional)
   - When `replicate_sequence_counts = true`, aggregates counts for peptides sharing the same oligo sequence.

## 2. Statistics Workflow (STATS)
**File:** `workflows/statistics.nf`
Applies normalization and statistical enrichment modeling using `phippery`.
### Steps:
1. **Counts Per Million (CPM)**
   - Normalizes raw counts to counts per million using `phippery.normalize.counts_per_million()`.
2. **Size Factors**
   - Estimates size factors for each sample using `phippery.normalize.size_factors()`.
3. **CPM Fold Enrichment** (optional, `run_cpm_enr_workflow = true`)
   - Computes fold enrichment of each sample's CPM relative to library controls using `phippery.normalize.enrichment()`.
4. **Z-Score Fit-Predict** (optional, `run_zscore_fit_predict = true`)
   - Tool: `fit-predict-zscore.py` using `phippery.modeling.zscore()`.
   - Fits a regression model on bead-only controls and predicts Z-scores for all samples.
   - Parameters: `min_Npeptides_per_bin=300`, quantile limits `0.05–0.95`.
7. **Merge Binary Datasets**
   - Merges all enrichment layer outputs into a single consolidated `.phip` dataset using `phippery merge`.

## 3. Output Workflow (DSOUT)
**File:** `workflows/output.nf`
Exports the merged dataset in multiple formats.
| Output | Condition | Format |
| :--- | :--- | :--- |
| Binary pickle | `output_pickle_xarray = true` | `.phip` |
| Wide CSV | `output_wide_csv = true` | Gzipped CSV per enrichment layer |
| Tall CSV | `output_tall_csv = true` | Gzipped tall-format CSV |

The wide CSV output produces files like `dataset_counts.csv.gz`, `dataset_cpm.csv.gz`, `dataset_zscore.csv.gz`, etc.

## 4. FHIR Report Workflow (FHIR)
**File:** `workflows/fhir_report.nf`

Generates HL7 FHIR R4 transaction bundles from PhIP-Seq Z-score data.
### Steps:
1. **CREATE_FHIR** (`bin/phipseq_to_fhir.py`)
   - Reads Z-score CSV and sample table.
   - For each sample, generates a FHIR Bundle containing:
     - **Patient** resource
     - **Specimen** resource (serum for PhIP-Seq)
     - **Organization** resource
     - **Practitioner** / **PractitionerRole** resources
     - **Observation** resources per peptide (Z-score values with positive/negative interpretation at threshold > 3.5)
   - Method coded as SNOMED CT `708049000` (Phage immunoprecipitation sequencing).

## 5. FDR Workflow (FDR)
**File:** `workflows/fdr.nf`
Performs false discovery rate analysis on Z-scores.
### Steps:
1. **Z-score FDR Analysis** (`zscore_fdr_analysis`)
   - Converts Z-scores to two-sided p-values using the normal distribution CDF.
   - Applies **Benjamini-Hochberg** FDR correction per sample.

## 6. Virus Score Workflow (VIRUSSCORE)
**File:** `workflows/virusscore.nf`
Calculates per-species virus exposure scores based on enriched peptide hits.
### Steps:
1. **Split Hits** — Extracts significant hit matrices per sample.
2. **Calculate Scores** (`bin/calc_scores_nofilter.py`)
   - Groups peptides by species.
   - Applies novel epitope filtering: a peptide is counted only if it does not share a subsequence of length ≥ `virus_score_epitope_len` (default: 7 amino acids) with any previously assigned peptide.
   - Output: Per-sample virus score CSV files.

## 7. IEDB Annotation Workflow (IEDB)
**File:** `workflows/iedb_annotation.nf`
Cross-references enriched peptides with the Immune Epitope Database.
### Steps:
1. **Extract Sample Names** — Parses column headers from Z-score file.
2. **IEDB Annotation** (per sample)
   - Loads IEDB TSV database and extracts valid epitope sequences.
   - For each peptide, performs substring matching against the IEDB epitope set.
   - Classifies peptides as:
     - **Known** — matches at least one IEDB epitope.
     - **Novel** — no IEDB match found.
     - **Significant** — Z-score ≥ threshold (default: 3.5).
   - Outputs per sample:
     - `{sample}_annotated_peptides.csv`
     - `{sample}_novel_peptides.csv`
     - `{sample}_significant_epitopes.csv`
     - `{sample}_annotation_summary.txt`

## 8. Visualization Workflow (VISUALIZE)
**File:** `workflows/visualization.nf`
Generates interactive Plotly heatmaps.
### Outputs:
1. **Virus Score Heatmap** (`virus_score_heatmap.html`)
   - Merges per-sample virus score files.
2. **Z-Score Heatmap** (`zscore_heatmap.html`)
   - Per-organism dropdown selector.
   - Heatmap of Z-scores: Peptide Oligos × Samples.
   - Hover shows peptide name, sample ID, and Z-score value.

## 9. Streamlit Dashboard Workflow (STREAMLIT)
**File:** `workflows/streamlit.nf`
Creates an interactive 3D protein visualization web application.
### Steps:
1. **Generate JSmol HTML** (`GENERATE_JSMOL_HTML`)
   - Parses Z-score data and peptide metadata with PDB IDs.
   - Copies local PDB/CIF structure files.
   - Generates per-structure HTML files with epitope annotations.
   - Exports `epitope_data.csv` for the Streamlit app.
2. **Create Streamlit App** (`CREATE_STREAMLIT_APP`)
   - Generates a complete Streamlit application with:
     - **Mol\*** 3D web component for interactive protein structure viewing.
     - Sequence-to-structure mapping.
     - Per-sample epitope selection with color-coded highlighting.

## 10. Neutralization Prediction Workflow (NEUTRALIZATION_PREDICTION)
**File:** `workflows/neutralization_score.nf`
Predicts neutralizing antibody potential of detected epitopes using multi-factor scoring (In development).
### Steps:
1. **Parse 3D Structures**
   - Computes **Solvent-Accessible Surface Area (SASA)** using Shrake-Rupley algorithm.
   - Extracts B-factor normalization per chain.
   - Caches chain sequences and coordinates for matching epitopes.
2. **Calculate Peptide Scores**
   - **Conservation Score**: Log-weighted IEDB epitope match count.
   - **IEDB Evidence Score**: Shannon entropy of source organisms.
   - **Epitope Coverage Score**: Fraction of protein positions with IEDB hits.
   - **Neutralization DB Score**: Tiered scoring for exact/substring/kmer matches in neutralization database.
   - **Structural Features**:
     - SASA normalization (0–1 scale per residue type).
     - B-factor (normalized per chain).
     - 3D centroid coordinates of matched epitope region.
   - **Sequence Properties**:
     - **GRAVY** (Kyte-Doolittle hydropathy): influences surface accessibility multiplier.
     - **Flexibility Bonus**: loop-forming residues (G, P, S, T).
     - **N-Glycosylation Penalty**: counts NxS/T motifs, reduces score up to 0.75×.
   - **Context Factor**: 1.12× for spike protein, 1.06× for envelope/membrane proteins.
4. **Sample-Level Scoring**
   - For each Z-score hit ≥ threshold (default: 3.5) per sample:
     - Converts Z-score to PhIP signal.
     - Combines static scores with Q-value using weighted linear combination (default weights: 0.15 conservation, 0.25 PhIP, 0.1 IEDB, 0.25 neut DB, 0.15 coverage, 0.1 B-factor).
     - Applies multipliers: GRAVY, glycan, context, flexibility bonus.
     - Filters by SASA (hard threshold: <0.05 → score × 0.0).
   - Prediction category:
     - **High**: composite ≥ 3.0 (default threshold).
     - **Moderate**: composite ≥ 1.95 (65% threshold).
     - **Low**: composite < 1.95.
5. **Spatial Clustering**
   - Groups peptides by sample + PDB ID; clusters epitope coordinates within 8.0 Å (configurable).
   - Annotates conformational epitope clusters.
6. **Outputs**
   - `neutralization_scores_per_sample.csv`
   - `high_confidence_candidates.csv`
   - `neutralization_summary.txt`
   - `detailed_analysis.json`
   - `conformational_epitope_clusters.csv`

## Workflow Parameters
All parameters are defined in `nextflow.config`.