๐งฌ PyHIV Command Line Interface๏
A comprehensive command-line interface for HIV-1 sequence alignment, subtyping, and gene region splitting.
๐ฆ Installation๏
Install PyHIV using pip:
pip install pyhiv-tools
Or install from source:
git clone https://github.com/anaapspereira/pyhiv.git
cd pyhiv
pip install -e .
Verify installation:
pyhiv --version
๐ Getting Started๏
Run PyHIV with default settings:
pyhiv run /path/to/fastas/
This will:
Align sequences with reference genomes
Perform HIV-1 subtyping
Split sequences into gene regions
Save results to
PyHIV_results/
๐งญ Commands๏
pyhiv run๏
Main command to process HIV-1 sequences.
pyhiv run [OPTIONS] FASTAS_DIR
Arguments:
FASTAS_DIR: Directory containing input FASTA files (required)
pyhiv validate๏
Validate input directory without processing.
pyhiv validate FASTAS_DIR
Checks:
Directory exists and is readable
FASTA files are present
Lists found files (up to 10)
โ๏ธ Options๏
Processing Options๏
Option |
Default |
Description |
|---|---|---|
|
|
Enable/disable HIV-1 subtyping |
|
|
Enable/disable gene region splitting |
Output Options๏
Option |
Default |
Description |
|---|---|---|
|
|
Output directory for results |
Performance Options๏
Option |
Default |
Description |
|---|---|---|
|
All CPUs |
Number of parallel jobs |
Display Options๏
Option |
Description |
|---|---|
|
Enable detailed output |
|
Suppress all non-error output |
|
Show version and exit |
|
Show help message and exit |
๐ผ Usage Examples๏
Basic Usage๏
Default processing:
pyhiv run sequences/
Custom output directory:
pyhiv run sequences/ -o my_results/
Parallel processing with 8 jobs:
pyhiv run sequences/ -j 8
Advanced Options๏
Alignment only (no subtyping or splitting):
pyhiv run sequences/ --no-subtyping --no-splitting
Subtyping without gene splitting:
pyhiv run sequences/ --no-splitting
Verbose output with timing:
pyhiv run sequences/ -v
Quiet mode for scripting:
pyhiv run sequences/ -q
Validation๏
Check input files before processing:
pyhiv validate sequences/
Example output:
โ Found 15 FASTA file(s)
Files:
โข sequence1.fasta
โข sequence2.fasta
โข sequence3.fa
...
Pipeline Examples๏
Complete workflow:
# 1. Validate inputs
pyhiv validate data/raw_sequences/
# 2. Process with verbose output
pyhiv run data/raw_sequences/ -o results/run1/ -v
# 3. Process subset without splitting
pyhiv run data/subset/ -o results/run2/ --no-splitting -j 4
Integration with shell scripts:
#!/bin/bash
INPUT_DIR="sequences/"
OUTPUT_DIR="results_$(date +%Y%m%d_%H%M%S)"
# Validate first
if pyhiv validate "$INPUT_DIR"; then
echo "Validation passed, starting processing..."
pyhiv run "$INPUT_DIR" -o "$OUTPUT_DIR" -j 8
else
echo "Validation failed!"
exit 1
fi
๐ฅ Input Requirements๏
Supported Formats๏
PyHIV accepts FASTA files with the following extensions:
.fasta.fa.fna(nucleic acid).ffn(nucleotide coding regions)
Directory Structure๏
sequences/
โโโ sample1.fasta
โโโ sample2.fa
โโโ sample3.fasta
โโโ subfolder/
โโโ sample4.fasta
PyHIV recursively searches for FASTA files in all subdirectories.
File Requirements๏
Valid FASTA format
HIV-1 sequences (DNA or RNA)
Sequence IDs should be unique
๐ Output Structure๏
Default Output (PyHIV_results/)๏
PyHIV_results/
โโโ final_table.tsv # Summary table
โโโ best_alignment_sample1.fasta # Best alignments
โโโ best_alignment_sample2.fasta
โโโ gag/ # Gene regions (if --splitting)
โ โโโ sample1_gag.fasta
โ โโโ sample2_gag.fasta
โโโ pol/
โ โโโ sample1_pol.fasta
โ โโโ sample2_pol.fasta
โโโ env/
โโโ ...
Output Files๏
final_table.tsv๏
Summary table with columns:
Column |
Description |
|---|---|
Sequence |
Input sequence ID |
Reference |
Best matching reference accession |
Subtype |
HIV-1 subtype (if |
Most Matching Gene Region |
Gene with most matches |
Present Gene Regions |
All detected gene regions |
Example:
Sequence Reference Subtype Most Matching Gene Region Present Gene Regions
seq001 K03455 B pol gag, pol, env
seq002 AF004885 C env pol, env
Alignment Files๏
best_alignment_<sequence_id>.fasta: Contains reference and query alignmentFormat: Multi-FASTA with reference sequence and aligned query
Gene Region Files๏
When --splitting is enabled:
Organized by gene (gag, pol, env, etc.)
One file per sequence per gene
Contains extracted gene region from alignment
โก Advanced Usage๏
Performance Tuning๏
Optimize for large datasets:
# Use all CPUs
pyhiv run sequences/ -j -1
# Limit to 4 cores to avoid memory issues
pyhiv run sequences/ -j 4
Memory considerations:
Each job loads reference sequences
Reduce
-jvalue if encountering memory errorsProcess in batches for very large datasets
Batch Processing๏
# Process multiple directories
for dir in batch1/ batch2/ batch3/; do
pyhiv run "$dir" -o "results_$(basename $dir)" -q
done
Integration with Other Tools๏
Export to CSV:
pyhiv run sequences/ -o results/
python -c "import pandas as pd; df = pd.read_csv('results/final_table.tsv', sep='\t'); df.to_csv('results.csv', index=False)"
Filter by subtype:
pyhiv run sequences/ -o results/ -v
awk -F'\t' '$3 == "B"' results/final_table.tsv > subtype_B.tsv
๐ ๏ธ Troubleshooting๏
Common Issues๏
No FASTA files found:
Error: No FASTA files found in the input directory.
Check file extensions (must be .fasta, .fa, etc.)
Verify directory path is correct
Use
pyhiv validateto diagnose
Output directory exists:
Warning: Output directory 'PyHIV_results' already exists. Files may be overwritten.
Files will be overwritten
Use
-oto specify a different directoryOr remove existing directory:
rm -rf PyHIV_results/
Memory errors:
MemoryError: Unable to allocate array
Reduce parallel jobs:
-j 2or-j 1Process sequences in smaller batches
Close other applications
Import errors:
Error: Could not import PyHIV module
Verify installation:
pip list | grep pyhiv-toolsReinstall:
pip install --force-reinstall pyhiv-toolsCheck Python version compatibility
Debug Mode๏
Enable verbose output for debugging:
pyhiv run sequences/ -v
This shows:
Version information
Number of input files
Processing parameters
Elapsed time
Generated output files
Full stack traces on errors
Getting Help๏
# Show all commands
pyhiv --help
# Show help for specific command
pyhiv run --help
pyhiv validate --help
Exit Codes๏
Code |
Meaning |
|---|---|
0 |
Success |
1 |
Error during processing |
130 |
Interrupted by user (Ctrl+C) |
๐ Performance Tips๏
Use validation first -
pyhiv validateis fast and catches input errorsAdjust parallelism - Start with default (all CPUs), reduce if memory is limited
Disable unused features - Use
--no-splittingif you only need alignmentsBatch processing - For thousands of sequences, split into smaller batches
SSD storage - Use SSD for output directory to improve I/O performance
๐ค Contributing๏
Found a bug or have a feature request? Please open an issue on GitHub.
๐งพ License๏
PyHIV is released under the MIT License. See LICENSE file for details.
๐๏ธ Citation๏
If you use PyHIV in your research, please cite:
Manuscript in preparation. Please cite this repository if you use PyHIV in your research.