# 🧬 PyHIV: A Python Package for Local HIV‑1 Sequence Alignment, Subtyping and Gene Splitting ![CI](https://github.com/anaapspereira/PyHIV/actions/workflows/ci.yml/badge.svg) [![codecov](https://codecov.io/gh/anaapspereira/PyHIV/branch/main/graph/badge.svg)](https://codecov.io/gh/anaapspereira/PyHIV) ![Python Version](https://img.shields.io/pypi/pyversions/pyhiv-tools) ![OS Supported](https://img.shields.io/badge/OS-Windows%20%7C%20Linux%20%7C%20macOS-blue) ![PyPI version](https://img.shields.io/pypi/v/pyhiv-tools) ![Documentation Status](https://readthedocs.org/projects/pyhiv/badge/?version=latest) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) ![GitHub issues](https://img.shields.io/github/issues/anaapspereira/PyHIV) --- ## πŸ“– Overview **PyHIV** is a Python tool that aligns HIV nucleotide sequences against reference genomes to determine the **most similar subtype** and optionally **split the aligned sequences into gene regions**. It produces: - Best reference alignment per sequence - Subtype and reference metadata - Gene-region–specific FASTA files (optional) - A final summary table (`final_table.tsv`) --- ## βš™οΈ How It Works ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ User FASTA sequences β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό Read and preprocess input β”‚ β–Ό Align sequences against reference genomes β”‚ β–Ό Identify best matching reference β”‚ β–Ό (Optional) Split by gene region β”‚ β–Ό Save results and summary table (.tsv) ``` --- ## πŸ“¦ Installation You can install PyHIV using pip: ```bash pip install pyhiv-tools ``` Alternatively, you can clone the repository and install it manually: ```bash git clone https://github.com/anaapspereira/PyHIV.git cd PyHIV python setup.py install ``` ## πŸš€ Getting Started Basic usage: ```python from pyhiv import PyHIV PyHIV( fastas_dir="path/to/fasta/files", subtyping=True, splitting=True, output_dir="results_folder", n_jobs=4 ) ``` ### Parameters: | Parameter | Type | Default | Description | | ------------ | ------ | ----------------- | -------------------------------------------------------------------------- | | `fastas_dir` | `str` | *Required* | Directory containing user FASTA files. | | `subtyping` | `bool` | `True` | Aligns against subtype reference genomes. If `False`, aligns only to HXB2. | | `splitting` | `bool` | `True` | Splits aligned sequences into gene regions. | | `output_dir` | `str` | `"PyHIV_results"` | Output directory for results. | | `n_jobs` | `int` | `None` | Number of parallel jobs for alignment. | ### πŸ“‚ Output Structure After running PyHIV, your output directory (default: PyHIV_results/) will contain: ``` PyHIV_results/ β”‚ β”œβ”€β”€ best_alignment_.fasta # Alignment to best reference β”œβ”€β”€ final_table.tsv # Summary of results β”‚ β”œβ”€β”€ gag/ β”‚ β”œβ”€β”€ _gag.fasta β”‚ └── ... β”œβ”€β”€ pol/ β”‚ β”œβ”€β”€ _pol.fasta β”‚ └── ... └── env/ β”œβ”€β”€ _env.fasta └── ... ``` ### Final Table Columns: | Column | Description | | ----------------------------- | ----------------------------------------------- | | **Sequence** | Input sequence name | | **Reference** | Best matching reference accession | | **Subtype** | Predicted HIV-1 subtype | | **Most Matching Gene Region** | Region with highest similarity | | **Present Gene Regions** | All detected gene regions with valid alignments | --- ## πŸ“Ÿ Command Line Interface PyHIV provides a user-friendly CLI for HIV-1 sequence analysis. ### πŸš€ Getting Started ```bash # Basic usage pyhiv run sequences/ # With custom options pyhiv run sequences/ -o results/ -j 4 -v # Validate inputs first pyhiv validate sequences/ ``` ### βš™οΈ Main Options | Option | Description | |--------|-------------| | `--subtyping` / `--no-subtyping` | Enable/disable HIV-1 subtyping (default: enabled) | | `--splitting` / `--no-splitting` | Enable/disable gene region splitting (default: enabled) | | `-o`, `--output-dir PATH` | Output directory (default: `PyHIV_results`) | | `-j`, `--n-jobs INTEGER` | Number of parallel jobs (default: all CPUs) | | `-v`, `--verbose` | Detailed output | | `-q`, `--quiet` | Suppress non-error output | ### πŸ’Ό Common Use Cases **Full analysis with subtyping and splitting:** ```bash pyhiv run data/sequences/ ``` **Alignment only:** ```bash pyhiv run data/sequences/ --no-subtyping --no-splitting ``` **Parallel processing:** ```bash pyhiv run data/sequences/ -j 8 -o results/batch1/ ``` **Validation:** ```bash pyhiv validate data/sequences/ ``` ### πŸ“€ Output PyHIV generates: - `final_table.tsv` - Summary with sequence IDs, references, subtypes, and gene regions - `best_alignment_*.fasta` - Best alignment for each sequence - Gene-specific folders (when `--splitting` is enabled) with extracted regions ### πŸ†˜ Getting Help ```bash pyhiv --help # Show all commands pyhiv run --help # Show options for run command pyhiv --version # Show version ``` For comprehensive documentation, see [CLI_README.md](CLI.md). --- ## πŸ—‚οΈ Citation Manuscript in preparation. Please cite this repository if you use PyHIV in your research. --- ## 🧾 License This project is licensed under the MIT License β€” see the LICENSE file for details.