Frustratometer Functions
Overview
The frustratometer package includes several submodules each designed to handle different aspects of the frustration analysis pipeline. Below are the main modules and their descriptions.
Protein Database Handling
pdb: PDB functions This module includes functions to download and manipulate PDB files.
If you want to download a PDB file from the RCSB database, use the “download” function.
If you want to extract the structure’s sequence, use the “get_sequence” function.
If you want to extract the distance matrix of the structure’s contacts, use the “get_distance_matrix” function.
If you want to map the full sequence residue positions to the aligned sequence residue positions (may be needed if applying a distance threshold in DCA-related calculations), use the “full_to_filtered_aligned_mapping” function.
- frustratometer.pdb.download(pdbID: str, directory: Path | str = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/frustratometer/checkouts/latest/docs')) Path[source]
Downloads a single pdb file
- Parameters:
pdbID (str,) – PDB ID
directory (Path or str,) – Directory where PDB file will be downloaded.
- Returns:
pdb_file – PDB file location.
- Return type:
Path
- frustratometer.pdb.get_sequence(pdb_file: str, chain: str) str[source]
Get a protein sequence from a pdb file
- Parameters:
pdb_file (str,) – PDB file location.
chain (str,) – Chain ID of the selected protein.
- Returns:
sequence – Protein sequence.
- Return type:
- frustratometer.pdb.get_distance_matrix(pdb_file: Path | str, chain: str, method: str = 'CB') array[source]
Calculate the distance matrix of the specified atoms in a PDB file.
- Parameters:
pdb_file (Path or str) – The path to the PDB file.
chain (str) – The chainID or chainIDs (space separated) of the protein.
method (str) – The method to use for calculating the distance matrix. Defaults to ‘CB’, which uses the CB atom for all residues except GLY, which uses the CA atom. Options:
‘CA’ for using only the CA atom, ‘minimum’ for using the minimum distance between all atoms in each residue, ‘CB_force’ computes a new coordinate for the CB atom based on the CA, C, and N atoms and uses CB distance even for glycine.
Returns – np.array: The distance matrix of the selected atoms.
Raises – IndexError: If the selection of atoms is empty. ValueError: If the method is not recognized.
- frustratometer.pdb.full_to_filtered_aligned_mapping(aligned_sequence: str, filtered_aligned_sequence: str) dict[source]
Get a dictionary mapping residue positions in the full pdb sequence to the aligned pdb sequence
- Parameters:
aligned_sequence (str,) – Raw aligned PDB sequence.
filtered_aligned_sequence (str,) – Filtered aligned PDB sequence (columns with insertions and deletions, i.e. dashes, that are typically filtered in MSA file processing are removed)
- Returns:
full_to_aligned_index_dict – Dictionary
- Return type:
- frustratometer.pdb.repair_pdb(pdb_file: str, chain: str, pdb_directory: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/frustratometer/checkouts/latest/docs')) <MagicMock id='131595671639376'>[source]
Repairs a pdb or cif file using pdbfixer. Note that a pdb file will be produced, regardless of input file format
- Parameters:
pdb_file (str,) – PDB file location.
chain (str,) – Chain ID
pdb_directory (str,) – PDB file location
- Returns:
fixer – Repaired PDB Object
- Return type:
Protein Family Analysis
pfam: PFAM functions
This module includes functions to download and manipulate PFAM alignments.
There are two options to obtain alignments from PFAM: downloading the database locally or downloading specific alignments.
If you want to download the database locally, use the “download_database” function to download or update the database. Then, use the “get_alignment” function to select a single alignment from the downloaded database.
If you want to retrieve a single alignment from the PFAM database, use the “download_alignment” function
Examples:
$ frustratometer.pfam.database(‘/media/databases’, https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.uniprot.gz)
- Todo:
Create a function to get single alignments
- frustratometer.pfam.download_database(path: Path | str, name: str = 'PFAM_current', url: Path | str = 'https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.uniprot.gz') Path[source]
Downloads and creates a pfam database in the Database folder
- Parameters:
path (str) – Adress of pfam database
name (str) – Name of the new database folder
- Returns:
alignments_path – Path of the local database of alignments
- Return type:
Path
- frustratometer.pfam.get_alignment(pfamid: str, database_path: Path | str) Path[source]
Retrieves a pfam family alignment from local database
- Parameters:
pfamid (str) – ID of PFAM family. ex: PF00001
database_path (str) – Address of local database
- Returns:
alignment_file – location of alignment
- Return type:
Path
- frustratometer.pfam.download_aligment(pfamid: str, output_file: Path | str, alignment_type: str = 'full') bytes[source]
Retrieves a pfam family alignment from interpro
- Parameters:
pfamid (str,) – ID of PFAM family. ex: PF00001
output_file (Path) – location of the output file. Default: Temporary file
alignment_type (str,) – alignment type to retrieve. Options: full, seed, uniprot
- Returns:
output_file – Output file (in stockholm format)
- Return type:
Path
Sequence Alignment
Align functions
- frustratometer.align.jackhmmer(sequence, output_file, database, log=-3, dry_run: bool = False, **kwargs) Path[source]
Generates alignment using jackhmmer
- Parameters:
sequence (str) – Protein sequence
output_file (str) – If True, will record alignment output messages; Arguments: True OR False (Default is False)
database (str) – Location of the sequence database used to generation MSA
log (File handle) – jackhmmer output Default:DEVNULL
dry_run (bool (default: False)) – Save the temporary fasta_file and print the command instead of running
**kwargs – Other arguments that can be passed to jackhmmer. More information can be found by executing jackhmmer -h arguments without a value such as –noali should be passed as noali=True Common kwargs:
N: number of iterations E: E-value threshold
- Returns:
output_file – Path of the alignment file
- Return type:
Path
MSA Filtering
Filter functions
- frustratometer.filter.filter_alignment(alignment_file: Path | str, output_file: str, alignment_format: str = 'stockholm') Path[source]
Filter stockholm alignment by removing unaligned sections. Filters by saving the alignment into memory.
- Parameters:
alignment_file (Path or str) – Alignment file name
output_file (str) – Output file name
alignment_format (str) – Biopython alignment format (Default: stockholm)
- Returns:
output_file – location of the output_file
- Return type:
Path
- frustratometer.filter.filter_alignment_lowmem(alignment_file: str, output_file: str, alignment_format: str = 'stockholm') Path[source]
Filter stockholm alignment by removing unaligned sections. Filters by reading the file without saving to memory.
- Parameters:
alignment_file (str) – location of file
output_file (str) – location of the output_file
alignment_format (str) – Biopython alignment format (Default: stockholm)
- Returns:
output_file – location of the output_file
- Return type:
Mapping Utilities
map: MAP functions
This module contains functions that identify the PFAM ID of the protein family associated with the given PDB.
Direct Coupling Analysis
DCA functions
- frustratometer.dca.pydca.plmdca(filtered_alignment_file: str, sequence_type: str = 'protein', seqid: float = 0.8, lambda_h: float = 1.0, lambda_J: float = 20.0, num_threads: int = 10, max_iterations: int = 500, verbose: bool = False, msa_file_format: str = 'fasta') dict[source]
Frustration Analysis
frustration: Frustration functions
This module includes functions related to frustration index, as well as native and decoy energy calculations. This module also features functions such as “write_tcl_script” and “call_vmd,” which can be used to visualize frustration patterns onto a PDB structure.
- frustratometer.frustration.compute_mask(distance_matrix: array, maximum_contact_distance: float | None = None, minimum_sequence_separation: int | None = None) array[source]
Computes a 2D Boolean mask from a given distance matrix based on a distance cutoff and a sequence separation cutoff.
- Parameters:
distance_matrix (np.array) – A 2D array where the element at index [i, j] represents the spatial distance between residues i and j. This matrix is assumed to be symmetric.
maximum_contact_distance (float, optional) – The maximum distance of a contact. Pairs of residues with distances less than this threshold are marked as True in the mask. If None, the spatial distance criterion is ignored and all distances are included. Default is None.
minimum_sequence_separation (int, optional) – A minimum sequence distance threshold. Pairs of residues with sequence indices differing by at least this value are marked as True in the mask. If None, the sequence distance criterion is ignored. Default is None.
- Returns:
mask – A 2D Boolean array of the same dimensions as distance_matrix. Elements of the mask are True where the residue pairs meet the specified distance_cutoff and sequence_distance_cutoff criteria.
- Return type:
np.array
Examples
>>> import numpy as np >>> dm = np.array([[0, 1, 2], [1, 0, 1], [2, 1, 0]]) >>> print(compute_mask(dm, distance_cutoff=1.5, sequence_distance_cutoff=1)) [[False True False] [ True False True] [False True False]]
Todo
Add chain information for sequence separation
- frustratometer.frustration.compute_native_energy(seq: str, potts_model: dict, mask: array, ignore_gap_couplings: bool = False, ignore_gap_fields: bool = False) float[source]
Computes the native energy of a protein sequence based on a given Potts model and an interaction mask.
\[E = \sum_i h_i + \frac{1}{2} \sum_{i,j} J_{ij} \Theta_{ij}\]- Parameters:
seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.
potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
ignore_couplings_of_gaps (bool, optional) – If True, couplings involving gaps (‘-’) in the sequence are set to 0 in the energy calculation. Default is False.
ignore_fields_of_gaps (bool, optional) – If True, fields corresponding to gaps (‘-’) in the sequence are set to 0 in the energy calculation. Default is False.
- Returns:
energy – The computed energy of the protein sequence based on the Potts model and the interaction mask.
- Return type:
Examples
>>> seq = "ACDEFGHIKLMNPQRSTVWY" >>> potts_model = { 'h': np.random.rand(20, 20), # Random fields 'J': np.random.rand(20, 20, 20, 20) # Random couplings } >>> mask = np.ones((len(seq), len(seq)), dtype=bool) # Include all pairs >>> energy = compute_native_energy(seq, potts_model, mask) >>> print(f"Computed energy: {energy:.2f}")
Notes
The energy is computed as the sum of the fields and the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.
Todo
Optimize the computation.
- frustratometer.frustration.compute_fields_energy(seq: str, potts_model: dict, ignore_fields_of_gaps: bool = False) float[source]
Computes the fields energy of a protein sequence based on a given Potts model.
\[E = \sum_i h_i\]- Parameters:
seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.
potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
ignore_fields_of_gaps (bool, optional) – If True, fields corresponding to gaps (‘-’) in the sequence are set to 0 in the energy calculation. Default is False.
- Returns:
fields_energy – The computed fields energy of the protein sequence based on the Potts model
- Return type:
Examples
>>> seq = "ACDEFGHIKLMNPQRSTVWY" >>> potts_model = { 'h': np.random.rand(20, 20), # Random fields 'J': np.random.rand(20, 20, 20, 20) # Random couplings } >>> fields_energy = compute_fields_energy(seq, potts_model) >>> print(f"Computed fields energy: {fields_energy:.2f}")
- frustratometer.frustration.compute_couplings_energy(seq: str, potts_model: dict, mask: array, ignore_couplings_of_gaps: bool = False) float[source]
Computes the couplings energy of a protein sequence based on a given Potts model and an interaction mask.
\[E = \frac{1}{2} \sum_{i,j} J_{ij} \Theta_{ij}\]- Parameters:
seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.
potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
ignore_couplings_of_gaps (bool, optional) – If True, couplings involving gaps (‘-’) in the sequence are set to 0 in the energy calculation. Default is False.
- Returns:
couplings_energy – The computed couplings energy of the protein sequence based on the Potts model and the interaction mask.
- Return type:
Examples
>>> seq = "ACDEFGHIKLMNPQRSTVWY" >>> potts_model = { 'h': np.random.rand(20, 20), # Random fields 'J': np.random.rand(20, 20, 20, 20) # Random couplings } >>> mask = np.ones((len(seq), len(seq)), dtype=bool) # Include all pairs >>> couplings_energy = compute_couplings_energy(seq, potts_model, mask) >>> print(f"Computed couplings energy: {couplings_energy:.2f}")
Notes
The couplings energy is computed as the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.
Todo
Optimize the computation.
- frustratometer.frustration.compute_sequences_energy(seqs: list, potts_model: dict, mask: array, split_couplings_and_fields=False) array[source]
Computes the energy of multiple protein sequences based on a given Potts model and an interaction mask.
\[E = \sum_i h_i + \frac{1}{2} \sum_{i,j} J_{ij} \Theta_{ij}\]- Parameters:
seqs (list) – List of amino acid sequences in string format, separated by commas. The sequences are assumed to be in one-letter code. Gaps are represented as ‘-’. The length of each sequence (L) should all match the dimensions of the Potts model.
potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequences and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequences.
split_couplings_and_fields (bool, optional) – If True, two lists of the sequences’ couplings and fields energies are returned. Default is False.
- Returns:
energy (if split_couplings_and_fields==False) (float) – The computed energies of the protein sequences based on the Potts model and the interaction mask.
fields_couplings_energy (if split_couplings_and_fields==True) (np.array) – Array containing computed fields and couplings energies of the protein sequences based on the Potts model and the interaction mask.
Examples
>>> seq_list = ["ACDEFGHIKLMNPQRSTVWY","AKLWYMNPQRSTCDEFGHIV"] >>> potts_model = { 'h': np.random.rand(20, 20), # Random fields 'J': np.random.rand(20, 20, 20, 20) # Random couplings } >>> mask = np.ones((len(seq_list[0]), len(seq_list[0])), dtype=bool) # Include all pairs >>> energies = compute_sequences_energy(seq_list, potts_model, mask) >>> print(f"Sequence 1 energy: {energies[0]:.2f}") >>> print(f"Sequence 2 energy: {energies[1]:.2f}")
Notes
The couplings energy is computed as the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.
Todo
Optimize the computation.
- frustratometer.frustration.compute_singleresidue_decoy_energy_fluctuation(seq: str, potts_model: dict, mask: array) array[source]
Computes a (Lx21) matrix for a sequence of length L. Row i contains all possible changes in energy upon mutating residue i.
\[\Delta H_i = \Delta h_i + \sum_k \Delta j_{ik}\]- Parameters:
seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.
potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
- Returns:
decoy_energy – (Lx21) matrix describing the energetic changes upon mutating a single residue.
- Return type:
np.array
Examples
>>> seq = "ACDEFGHIKLMNPQRSTVWY" >>> potts_model = { 'h': np.random.rand(20, 20), # Random fields 'J': np.random.rand(20, 20, 20, 20) # Random couplings } >>> mask = np.ones((len(seq), len(seq)), dtype=bool) # Include all pairs >>> decoy_energy = compute_singleresidue_decoy_energy_fluctuation(seq, potts_model, mask) >>> print(f"Matrix of Residue Decoy Energy Fluctuations: "); print(decoy_energy) >>> print(f"Matrix Size: "); print(shape(decoy_energy))
Notes
The couplings energy is computed as the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.
Todo
Optimize the computation.
- frustratometer.frustration.compute_mutational_decoy_energy_fluctuation(seq: str, potts_model: dict, mask: array) array[source]
Computes a (LxLx21x21) matrix for a sequence of length L. Matrix[i,j] describes all possible changes in energy upon mutating residue i and j simultaneously.
\[\Delta H_{ij} = H_i - H_{i'} + H_{j}-H_{j'} + J_{ij} -J_{ij'} + J_{i'j'} - J_{i'j} + \sum_k {J_{ik} - J_{i'k} + J_{jk} -J_{j'k}}\]- Parameters:
seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.
potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
- Returns:
decoy_energy2 – (LxLx21x21) matrix describing the energetic changes upon mutating two residues simultaneously.
- Return type:
np.array
Examples
>>> seq = "ACDEFGHIKLMNPQRSTVWY" >>> potts_model = { 'h': np.random.rand(20, 20), # Random fields 'J': np.random.rand(20, 20, 20, 20) # Random couplings } >>> mask = np.ones((len(seq), len(seq)), dtype=bool) # Include all pairs >>> decoy_energy2 = compute_mutational_decoy_energy_fluctuation(seq, potts_model, mask) >>> print(f"Matrix of Contact Mutational Decoy Energy Fluctuations: "); print(decoy_energy2) >>> print(f"Matrix Size: "); print(shape(decoy_energy2))
Notes
The couplings energy is computed as the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.
Todo
Optimize the computation.
- frustratometer.frustration.compute_configurational_decoy_energy_fluctuation(seq: str, potts_model: dict, mask: array) array[source]
Computes a (LxLx21x21) matrix for a sequence of length L. Matrix[i,j] describes all possible changes in energy upon mutating and altering the local densities of residue i and j simultaneously.
\[\Delta H_{ij} = H_i - H_{i'} + H_{j}-H_{j'} + J_{ij} -J_{ij'} + J_{i'j'} - J_{i'j} + \sum_k {J_{ik} - J_{i'k} + J_{jk} -J_{j'k}}\]- Parameters:
seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.
potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
- Returns:
decoy_energy2 – (LxLx21x21) matrix describing the energetic changes upon mutating and altering the local densities of two residues simultaneously.
- Return type:
np.array
Examples
>>> seq = "ACDEFGHIKLMNPQRSTVWY" >>> potts_model = { 'h': np.random.rand(20, 20), # Random fields 'J': np.random.rand(20, 20, 20, 20) # Random couplings } >>> mask = np.ones((len(seq), len(seq)), dtype=bool) # Include all pairs >>> decoy_energy2 = compute_configurational_decoy_energy_fluctuation(seq, potts_model, mask) >>> print(f"Matrix of Contact Configurational Decoy Energy Fluctuations: "); print(decoy_energy2) >>> print(f"Matrix Size: "); print(shape(decoy_energy2))
Notes
The couplings energy is computed as the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.
Todo
Optimize the computation.
- frustratometer.frustration.compute_contact_decoy_energy_fluctuation(seq: str, potts_model: dict, mask: array) array[source]
$$ Delta DCA_{ij} = Delta j_{ij} $$ :param seq: :param potts_model: :param mask: :return:
- frustratometer.frustration.compute_decoy_energy(seq: str, potts_model: dict, mask: array, kind='singleresidue') array[source]
Computes all possible decoy energies.
- Parameters:
seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.
potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
kind (str) – Kind of decoys generated. Options: “singleresidue,” “mutational,” “configurational,” and “contact.”
- Returns:
decoy_energy – Matrix describing all possible decoy energies.
- Return type:
np.array
Examples
>>> seq = "ACDEFGHIKLMNPQRSTVWY" >>> potts_model = { 'h': np.random.rand(20, 20), # Random fields 'J': np.random.rand(20, 20, 20, 20) # Random couplings } >>> mask = np.ones((len(seq), len(seq)), dtype=bool) # Include all pairs >>> kind = "singleresidue" >>> decoy_energy = compute_decoy_energy(seq, potts_model, mask, kind) >>> print(f"Matrix of Single Residue Decoy Energo: "); print(decoy_energy2) >>> print(f"Matrix Size: "); print(shape(decoy_energy2))
Notes
The couplings energy is computed as the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.
Todo
Optimize the computation.
- frustratometer.frustration.compute_aa_freq(seq, include_gaps=True)[source]
Calculates amino acid frequencies in given sequence
- Parameters:
seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-‘.
include_gaps (bool) – If True, frequencies of gaps (‘-’) in the sequence are set to 0. Default is True.
- Returns:
aa_freq – Array of frequencies of all 21 possible amino acids within sequence
- Return type:
np.array
- frustratometer.frustration.compute_contact_freq(seq)[source]
Calculates contact frequencies in given sequence
- Parameters:
seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-‘.
- Returns:
contact_freq – 21x21 array of frequencies of all possible contacts within sequence.
- Return type:
np.array
- frustratometer.frustration.compute_single_frustration(decoy_fluctuation, aa_freq=None, correction=0)[source]
Calculates single residue frustration indices
- Parameters:
decoy_fluctuation (np.array) – (Lx21) matrix for a sequence of length L, describing the energetic changes upon mutating a single residue.
aa_freq (np.array) – Array of frequencies of all 21 possible amino acids within sequence
- Returns:
frustration – Array of length L featuring single residue frustration indices.
- Return type:
np.array
- frustratometer.frustration.compute_pair_frustration(decoy_fluctuation, contact_freq: None | array, correction=0) array[source]
Calculates pair residue frustration indices
- Parameters:
decoy_fluctuation (np.array) – (LxLx21x21) matrix for a sequence of length L, describing the energetic changes upon mutating two residues simultaneously.
contact_freq (np.array) – 21x21 array of frequencies of all possible contacts within sequence.
- Returns:
contact_frustration – LxL array featuring pair frustration indices (mutational or configurational frustration, depending on decoy_fluctuation matrix provided)
- Return type:
np.array
- frustratometer.frustration.compute_scores(potts_model: dict) array[source]
Computes contact scores based on the Frobenius norm
\[CN[i,j] = \frac{F[i,j] - F[i,:] * F[:,j}{F[:,:]}\]- Parameters:
potts_model (dict) – Potts model containing the couplings in the “J” key
- Returns:
corr_norm – Contact score matrix (N x N)
- Return type:
np.array
- frustratometer.frustration.compute_roc(scores, distance_matrix, cutoff)[source]
Computes Receiver Operating Characteristic (ROC) curve of predicted and true contacts (identified from the distance matrix).
- Parameters:
scores (np.array) – Contact score matrix (N x N)
distance_matrix (np.array) – LxL array for sequence of length L, describing distances between contacts
cutoff (float) – Distance cutoff for contacts
- Returns:
roc_score – Array containing lists of false and true positive rates
- Return type:
np.array
- frustratometer.frustration.compute_auc(roc_score)[source]
Computes Area Under Curve (AUC) of calculated ROC distribution
- Parameters:
roc_score (np.array) – Array containing lists of false and true positive rates
- Returns:
auc – AUC value
- Return type:
- frustratometer.frustration.plot_roc(roc_score)[source]
Plot ROC distribution
- Parameters:
roc_score (np.array) – Array containing lists of false and true positive rates
- frustratometer.frustration.plot_singleresidue_decoy_energy(decoy_energy, native_energy, method='clustermap')[source]
Plot comparison of single residue decoy energies, relative to the native energy
- Parameters:
decoy_energy (np.array) – Lx21 array of decoy energies
native_energy (float) – Native energy value
method (str) – Options: “clustermap”, “heatmap”
- frustratometer.frustration.write_tcl_script(pdb_file: Path | str, chain: str, mask: array, distance_matrix: array, distance_cutoff: float, single_frustration: array, pair_frustration: array, tcl_script: Path | str = 'frustration.tcl', max_connections: int = None, movie_name: Path | str = None, still_image_name: Path | str = None) Path | str[source]
Writes a tcl script that can be run with VMD to superimpose the frustration patterns onto the corresponding PDB structure.
- Parameters:
pdb_file (Path or str) – pdb file name
chain (str) – Select chain from pdb
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
distance_matrix (np.array) – LxL array for sequence of length L, describing distances between contacts
distance_cutoff (float) – Maximum distance at which a contact occurs
single_frustration (np.array) – Array containing single residue frustration index values
pair_frustration (np.array) – Array containing pair (ex. configurational, mutational, contact) frustration index values
tcl_script (Path or str) – Output tcl script file with static structure
max_connections (int) – Maximum number of pair frustration values visualized in tcl file
movie_name (Path or str) – Output movie file with rotating structure
still_image_name (Path or str) – Output image file with still image
- Returns:
tcl_script – tcl script file
- Return type:
Path or str
- frustratometer.frustration.call_vmd(pdb_file: Path | str, tcl_script: Path | str)[source]
Calls VMD program with given pdb file and tcl script to visualize frustration patterns
- Parameters:
pdb_file (Path or str) – pdb file name
tcl_script (Path or str) – Output tcl script file with static structure
- frustratometer.frustration.canvas(with_attribution=True)[source]
Placeholder function to show example docstring (NumPy format).
Replace this function and doc string for your own project.
- Parameters:
with_attribution (bool, Optional, default: True) – Set whether or not to display who the quote is from.
- Returns:
quote – Compiled string including quote and optional attribution.
- Return type:
- frustratometer.frustration.make_decoy_seqs(seq, ndecoys=1000)[source]
Creates permutated, decoy sequences using a given sequence residue composition and length.
- Parameters:
seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.
n_decoys (int) – Number of sequence decoys to create
- Returns:
decoy_seqs – List of decoy sequences. The sequences are assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequences (L) should match the dimensions of the Potts model.
- Return type:
- frustratometer.frustration.compute_fragment_mask(mask: array, fragment_pos: array) array[source]
Creates a mask for a sequence fragment such that: - position i belongs to the fragment, all j - position j belongs to the fragment, all i
The new mask consider all the interactions within the fragment and also the interactions between the fragment and other sequence positions.
- masknp.array
A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
- fragment_pos: np:array
Array of sequence positions selected.
- fragment_mask: np.array
New 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
- frustratometer.frustration.compute_fragment_total_native_energy(seq: str, potts_model: dict, mask: array, fragment_pos: None | array = None, fragment_in_context=False) float[source]
Calculates the energy for the complete protein or for a fragment in context
- seqstr
The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.
- potts_modeldict
A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
- masknp.array
A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
- fragment_pos: np:array
Array of sequence positions selected.
- fragment_in_context: bool
If True, the energetics calculations take into account the interactions between the fragment and other sequence positions
- energy: float
Native energy of the protein
- frustratometer.frustration.compute_fragment_total_decoy_energy(decoy_seqs: list, potts_model: dict, mask: array, fragment_pos: None | array = None, fragment_in_context=False, split_couplings_and_fields=False, config_decoys=False, msa_mask=1) array[source]
Calculates decoy energies for the complete protein or for a fragment in context
- decoy_seqslist
List of decoy sequences. The sequences are assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequences (L) should match the dimensions of the Potts model.
- potts_modeldict
A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
- masknp.array
A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
- fragment_pos: np:array
Array of sequence positions selected.
- fragment_in_context: bool
If True, the energetics calculations take into account the interactions between the fragment and other sequence positions
- split_couplings_and_fields: bool
Separate output into coupling and local fields contribution to energy.
- config_decoys: bool
If True, use the configurational decoys approximation, shuffling index positions for configurational decoys energy calculation. If False, mutational decoys.
- msa_mask: np.array
Extra mask to use a Multiple Sequence Alignment that do not cover completely the reference PDB
- energy: np.array
Decoy energies
- frustratometer.frustration.compute_total_frustration(seq, potts_model, mask, ndecoys=1000, config_decoys=False, msa_mask=1, fragment_pos=None, fragment_in_context=False, output_kind='frustration')[source]
Calculates the total frustration of a protein fragment.
- Parameters:
seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.
potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
n_decoys (int) – Number of sequence decoys to create
config_decoys (bool) – If True, use the configurational decoys approximation, shuffling index positions for configurational decoys energy calculation. If False, mutational decoys.
msa_mask (np.array) – Extra mask to use a Multiple Sequence Alignment that do not cover completely the reference PDB
fragment_pos (np.array) – Fragment positions. If None, use the complete model
fragment_in_context (bool) – If True, the energetics calculations take into account the interactions between the fragment and other sequence positions
output_kind (str) – If ‘frustration’, returns frustration. If not, returns native energy, decoy energy average and decoy energy standard deviation.
- Returns:
total_frustration (float) – Total frustration of the fragment or complete protein
native_energy (float) – Native energy of the given sequence
decoy_energy_average (float) – Average of the decoy energy distribution
decoy_energy_std (float) – Standard deviation of the decoy energy distribution
- frustratometer.frustration.compute_native_h_J(seq: str, potts_model: dict, mask: array) tuple[source]
Computes the applied fields h_i(a_i) and J_{ij}(a_i,a_j) for the residues a_i of the sequence
- Parameters:
seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.
potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
- Returns:
h (np.array) – Values of the local field for the sequence.
j (np.array) – Values of the coupling field for the sequence.
- frustratometer.frustration.compute_decoy_h_J(decoy_seqs: list, potts_model: dict, mask: array, config_decoys: bool = False) tuple[source]
Computes the applied fields h_i(a_i) and J_{ij}(a_i,a_j) for the residues a_i of a set of decoy sequences
- Parameters:
decoy_seqs (list) – List of decoy sequences. The sequences are assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequences (L) should match the dimensions of the Potts model.
potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
config_decoys (bool) – If True, use the configurational decoys approximation, shuffling index positions for configurational decoys energy calculation. If False, mutational decoys.
- Returns:
h (np.array) – Values of the local field for each decoy sequence.
j (np.array) – Values of the coupling field for each decoy sequence.
- frustratometer.frustration.compute_native_fragment_energy_from_h_j(fragment_pos: array, h: array, j: array, mask: array) float[source]
Computes the energy from the applied fields h_i(a_i) and J_{ij}(a_i,a_j)
- Parameters:
fragment_pos (np:array) – Array of sequence positions selected.
h (np.array) – Values of the local field for each decoy sequence.
j (np.array) – Values of the coupling field for each decoy sequence.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
- Returns:
energy – Native energy of the protein
- Return type:
- frustratometer.frustration.compute_decoy_fragment_energy_from_h_j(fragment_pos: array, h: array, j: array, mask: array) tuple[source]
Computes the energy from the applied fields h_i(a_i) and J_{ij}(a_i,a_j) for each decoy sequence.
- Parameters:
fragment_pos (np:array) – Array of sequence positions selected.
h (np.array) – Values of the local field for each decoy sequence.
j (np.array) – Values of the coupling field for each decoy sequence.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
- Returns:
energy_average (float) – Average of the decoy energies
energy_std (float) – Standard deviation of the decoy energies
- frustratometer.frustration.compute_energy_sliding_window(seq: str, potts_model: dict, mask: array, win_size: int, ndecoys: int, config_decoys: bool) dict[source]
Computes the total frustration, the native energy, the decoy average energy and the decoy standard deviation for fragments on a sliding window
- Parameters:
seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.
potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
win_size (int) – Size of the sliding window
ndecoys (int) – Number of decoy sequences to use
config_decoys (bool) – If True, use the configurational decoys approximation, shuffling index positions for configurational decoys energy calculation. If False, mutational decoys.
- Returns:
results – Dictionary with the results, containing ‘fragment_center’: center position of each window ‘win_size’: size of the sliding windows ‘native_energy’: native energy for each window ‘decoy_energy_av’: decoy energy average for each window ‘decoy_energy_std’: decoy energy standard deviation for each window ‘frustration’: total frustration index for each window
- Return type: