Frustratometer Functions

Overview

The frustratometer package includes several submodules each designed to handle different aspects of the frustration analysis pipeline. Below are the main modules and their descriptions.

Protein Database Handling

pdb: PDB functions This module includes functions to download and manipulate PDB files.

If you want to download a PDB file from the RCSB database, use the “download” function.

If you want to extract the structure’s sequence, use the “get_sequence” function.

If you want to extract the distance matrix of the structure’s contacts, use the “get_distance_matrix” function.

If you want to map the full sequence residue positions to the aligned sequence residue positions (may be needed if applying a distance threshold in DCA-related calculations), use the “full_to_filtered_aligned_mapping” function.

frustratometer.pdb.download(pdbID: str, directory: Path | str = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/frustratometer/checkouts/latest/docs')) Path[source]

Downloads a single pdb file

Parameters:
  • pdbID (str,) – PDB ID

  • directory (Path or str,) – Directory where PDB file will be downloaded.

Returns:

pdb_file – PDB file location.

Return type:

Path

frustratometer.pdb.get_sequence(pdb_file: str, chain: str) str[source]

Get a protein sequence from a pdb file

Parameters:
  • pdb_file (str,) – PDB file location.

  • chain (str,) – Chain ID of the selected protein.

Returns:

sequence – Protein sequence.

Return type:

str

frustratometer.pdb.get_distance_matrix(pdb_file: Path | str, chain: str, method: str = 'CB') array[source]

Calculate the distance matrix of the specified atoms in a PDB file.

Parameters:
  • pdb_file (Path or str) – The path to the PDB file.

  • chain (str) – The chainID or chainIDs (space separated) of the protein.

  • method (str) – The method to use for calculating the distance matrix. Defaults to ‘CB’, which uses the CB atom for all residues except GLY, which uses the CA atom. Options:

    ‘CA’ for using only the CA atom, ‘minimum’ for using the minimum distance between all atoms in each residue, ‘CB_force’ computes a new coordinate for the CB atom based on the CA, C, and N atoms and uses CB distance even for glycine.

  • Returns – np.array: The distance matrix of the selected atoms.

  • Raises – IndexError: If the selection of atoms is empty. ValueError: If the method is not recognized.

frustratometer.pdb.full_to_filtered_aligned_mapping(aligned_sequence: str, filtered_aligned_sequence: str) dict[source]

Get a dictionary mapping residue positions in the full pdb sequence to the aligned pdb sequence

Parameters:
  • aligned_sequence (str,) – Raw aligned PDB sequence.

  • filtered_aligned_sequence (str,) – Filtered aligned PDB sequence (columns with insertions and deletions, i.e. dashes, that are typically filtered in MSA file processing are removed)

Returns:

full_to_aligned_index_dict – Dictionary

Return type:

dict

frustratometer.pdb.repair_pdb(pdb_file: str, chain: str, pdb_directory: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/frustratometer/checkouts/latest/docs')) <MagicMock id='131595671639376'>[source]

Repairs a pdb or cif file using pdbfixer. Note that a pdb file will be produced, regardless of input file format

Parameters:
  • pdb_file (str,) – PDB file location.

  • chain (str,) – Chain ID

  • pdb_directory (str,) – PDB file location

Returns:

fixer – Repaired PDB Object

Return type:

object

Protein Family Analysis

pfam: PFAM functions

This module includes functions to download and manipulate PFAM alignments.

There are two options to obtain alignments from PFAM: downloading the database locally or downloading specific alignments.

If you want to download the database locally, use the “download_database” function to download or update the database. Then, use the “get_alignment” function to select a single alignment from the downloaded database.

If you want to retrieve a single alignment from the PFAM database, use the “download_alignment” function

Examples:

$ frustratometer.pfam.database(‘/media/databases’, https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.uniprot.gz)

Todo:
  • Create a function to get single alignments

frustratometer.pfam.download_database(path: Path | str, name: str = 'PFAM_current', url: Path | str = 'https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.uniprot.gz') Path[source]

Downloads and creates a pfam database in the Database folder

Parameters:
  • path (str) – Adress of pfam database

  • name (str) – Name of the new database folder

Returns:

alignments_path – Path of the local database of alignments

Return type:

Path

frustratometer.pfam.get_alignment(pfamid: str, database_path: Path | str) Path[source]

Retrieves a pfam family alignment from local database

Parameters:
  • pfamid (str) – ID of PFAM family. ex: PF00001

  • database_path (str) – Address of local database

Returns:

alignment_file – location of alignment

Return type:

Path

frustratometer.pfam.download_aligment(pfamid: str, output_file: Path | str, alignment_type: str = 'full') bytes[source]

Retrieves a pfam family alignment from interpro

Parameters:
  • pfamid (str,) – ID of PFAM family. ex: PF00001

  • output_file (Path) – location of the output file. Default: Temporary file

  • alignment_type (str,) – alignment type to retrieve. Options: full, seed, uniprot

Returns:

output_file – Output file (in stockholm format)

Return type:

Path

Sequence Alignment

Align functions

frustratometer.align.jackhmmer(sequence, output_file, database, log=-3, dry_run: bool = False, **kwargs) Path[source]

Generates alignment using jackhmmer

Parameters:
  • sequence (str) – Protein sequence

  • output_file (str) – If True, will record alignment output messages; Arguments: True OR False (Default is False)

  • database (str) – Location of the sequence database used to generation MSA

  • log (File handle) – jackhmmer output Default:DEVNULL

  • dry_run (bool (default: False)) – Save the temporary fasta_file and print the command instead of running

  • **kwargs – Other arguments that can be passed to jackhmmer. More information can be found by executing jackhmmer -h arguments without a value such as –noali should be passed as noali=True Common kwargs:

    N: number of iterations E: E-value threshold

Returns:

output_file – Path of the alignment file

Return type:

Path

frustratometer.align.extract_sequences_from_alignment(alignment_file, output_file, database)[source]

Extracts the complete sequences of an stockholm alignment from a database to refine the alignment if needed

MSA Filtering

Filter functions

frustratometer.filter.filter_alignment(alignment_file: Path | str, output_file: str, alignment_format: str = 'stockholm') Path[source]

Filter stockholm alignment by removing unaligned sections. Filters by saving the alignment into memory.

Parameters:
  • alignment_file (Path or str) – Alignment file name

  • output_file (str) – Output file name

  • alignment_format (str) – Biopython alignment format (Default: stockholm)

Returns:

output_file – location of the output_file

Return type:

Path

frustratometer.filter.filter_alignment_lowmem(alignment_file: str, output_file: str, alignment_format: str = 'stockholm') Path[source]

Filter stockholm alignment by removing unaligned sections. Filters by reading the file without saving to memory.

Parameters:
  • alignment_file (str) – location of file

  • output_file (str) – location of the output_file

  • alignment_format (str) – Biopython alignment format (Default: stockholm)

Returns:

output_file – location of the output_file

Return type:

str

Mapping Utilities

map: MAP functions

This module contains functions that identify the PFAM ID of the protein family associated with the given PDB.

frustratometer.map.get_pfamID(pdb_file, chain)[source]

Returns PFAM and Uniprot IDs

Parameters:
  • pdb_file (str) – pdb file name

  • chain (str) – Select chain from pdb

Returns:

pfamID – PFAM family ID

Return type:

str

Direct Coupling Analysis

DCA functions

frustratometer.dca.pydca.plmdca(filtered_alignment_file: str, sequence_type: str = 'protein', seqid: float = 0.8, lambda_h: float = 1.0, lambda_J: float = 20.0, num_threads: int = 10, max_iterations: int = 500, verbose: bool = False, msa_file_format: str = 'fasta') dict[source]
frustratometer.dca.pydca.mfdca(filtered_alignment_file: str) dict[source]
frustratometer.dca.matlab.compute_plm(protein_name, reweighting_threshold=0.1, nr_of_cores=1)[source]

Calculate Potts Model Fields and Couplings Terms :param protein_name :param reweighting_threshold :param nr_of_cores

frustratometer.dca.matlab.load_potts_model(potts_model_file)[source]

Frustration Analysis

frustration: Frustration functions

This module includes functions related to frustration index, as well as native and decoy energy calculations. This module also features functions such as “write_tcl_script” and “call_vmd,” which can be used to visualize frustration patterns onto a PDB structure.

frustratometer.frustration.compute_mask(distance_matrix: array, maximum_contact_distance: float | None = None, minimum_sequence_separation: int | None = None) array[source]

Computes a 2D Boolean mask from a given distance matrix based on a distance cutoff and a sequence separation cutoff.

Parameters:
  • distance_matrix (np.array) – A 2D array where the element at index [i, j] represents the spatial distance between residues i and j. This matrix is assumed to be symmetric.

  • maximum_contact_distance (float, optional) – The maximum distance of a contact. Pairs of residues with distances less than this threshold are marked as True in the mask. If None, the spatial distance criterion is ignored and all distances are included. Default is None.

  • minimum_sequence_separation (int, optional) – A minimum sequence distance threshold. Pairs of residues with sequence indices differing by at least this value are marked as True in the mask. If None, the sequence distance criterion is ignored. Default is None.

Returns:

mask – A 2D Boolean array of the same dimensions as distance_matrix. Elements of the mask are True where the residue pairs meet the specified distance_cutoff and sequence_distance_cutoff criteria.

Return type:

np.array

Examples

>>> import numpy as np
>>> dm = np.array([[0, 1, 2], [1, 0, 1], [2, 1, 0]])
>>> print(compute_mask(dm, distance_cutoff=1.5, sequence_distance_cutoff=1))
[[False  True False]
 [ True False  True]
 [False  True False]]

Todo

Add chain information for sequence separation

frustratometer.frustration.compute_native_energy(seq: str, potts_model: dict, mask: array, ignore_gap_couplings: bool = False, ignore_gap_fields: bool = False) float[source]

Computes the native energy of a protein sequence based on a given Potts model and an interaction mask.

\[E = \sum_i h_i + \frac{1}{2} \sum_{i,j} J_{ij} \Theta_{ij}\]
Parameters:
  • seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.

  • potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

  • ignore_couplings_of_gaps (bool, optional) – If True, couplings involving gaps (‘-’) in the sequence are set to 0 in the energy calculation. Default is False.

  • ignore_fields_of_gaps (bool, optional) – If True, fields corresponding to gaps (‘-’) in the sequence are set to 0 in the energy calculation. Default is False.

Returns:

energy – The computed energy of the protein sequence based on the Potts model and the interaction mask.

Return type:

float

Examples

>>> seq = "ACDEFGHIKLMNPQRSTVWY"
>>> potts_model = {
    'h': np.random.rand(20, 20),  # Random fields
    'J': np.random.rand(20, 20, 20, 20)  # Random couplings
}
>>> mask = np.ones((len(seq), len(seq)), dtype=bool) # Include all pairs
>>> energy = compute_native_energy(seq, potts_model, mask)
>>> print(f"Computed energy: {energy:.2f}")

Notes

The energy is computed as the sum of the fields and the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.

Todo

Optimize the computation.

frustratometer.frustration.compute_fields_energy(seq: str, potts_model: dict, ignore_fields_of_gaps: bool = False) float[source]

Computes the fields energy of a protein sequence based on a given Potts model.

\[E = \sum_i h_i\]
Parameters:
  • seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.

  • potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

  • ignore_fields_of_gaps (bool, optional) – If True, fields corresponding to gaps (‘-’) in the sequence are set to 0 in the energy calculation. Default is False.

Returns:

fields_energy – The computed fields energy of the protein sequence based on the Potts model

Return type:

float

Examples

>>> seq = "ACDEFGHIKLMNPQRSTVWY"
>>> potts_model = {
    'h': np.random.rand(20, 20),  # Random fields
    'J': np.random.rand(20, 20, 20, 20)  # Random couplings
}
>>> fields_energy = compute_fields_energy(seq, potts_model)
>>> print(f"Computed fields energy: {fields_energy:.2f}")
frustratometer.frustration.compute_couplings_energy(seq: str, potts_model: dict, mask: array, ignore_couplings_of_gaps: bool = False) float[source]

Computes the couplings energy of a protein sequence based on a given Potts model and an interaction mask.

\[E = \frac{1}{2} \sum_{i,j} J_{ij} \Theta_{ij}\]
Parameters:
  • seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.

  • potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

  • ignore_couplings_of_gaps (bool, optional) – If True, couplings involving gaps (‘-’) in the sequence are set to 0 in the energy calculation. Default is False.

Returns:

couplings_energy – The computed couplings energy of the protein sequence based on the Potts model and the interaction mask.

Return type:

float

Examples

>>> seq = "ACDEFGHIKLMNPQRSTVWY"
>>> potts_model = {
    'h': np.random.rand(20, 20),  # Random fields
    'J': np.random.rand(20, 20, 20, 20)  # Random couplings
}
>>> mask = np.ones((len(seq), len(seq)), dtype=bool) # Include all pairs
>>> couplings_energy = compute_couplings_energy(seq, potts_model, mask)
>>> print(f"Computed couplings energy: {couplings_energy:.2f}")

Notes

The couplings energy is computed as the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.

Todo

Optimize the computation.

frustratometer.frustration.compute_sequences_energy(seqs: list, potts_model: dict, mask: array, split_couplings_and_fields=False) array[source]

Computes the energy of multiple protein sequences based on a given Potts model and an interaction mask.

\[E = \sum_i h_i + \frac{1}{2} \sum_{i,j} J_{ij} \Theta_{ij}\]
Parameters:
  • seqs (list) – List of amino acid sequences in string format, separated by commas. The sequences are assumed to be in one-letter code. Gaps are represented as ‘-’. The length of each sequence (L) should all match the dimensions of the Potts model.

  • potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequences and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequences.

  • split_couplings_and_fields (bool, optional) – If True, two lists of the sequences’ couplings and fields energies are returned. Default is False.

Returns:

  • energy (if split_couplings_and_fields==False) (float) – The computed energies of the protein sequences based on the Potts model and the interaction mask.

  • fields_couplings_energy (if split_couplings_and_fields==True) (np.array) – Array containing computed fields and couplings energies of the protein sequences based on the Potts model and the interaction mask.

Examples

>>> seq_list = ["ACDEFGHIKLMNPQRSTVWY","AKLWYMNPQRSTCDEFGHIV"]
>>> potts_model = {
    'h': np.random.rand(20, 20),  # Random fields
    'J': np.random.rand(20, 20, 20, 20)  # Random couplings
}
>>> mask = np.ones((len(seq_list[0]), len(seq_list[0])), dtype=bool) # Include all pairs
>>> energies = compute_sequences_energy(seq_list, potts_model, mask)
>>> print(f"Sequence 1 energy: {energies[0]:.2f}")
>>> print(f"Sequence 2 energy: {energies[1]:.2f}")

Notes

The couplings energy is computed as the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.

Todo

Optimize the computation.

frustratometer.frustration.compute_singleresidue_decoy_energy_fluctuation(seq: str, potts_model: dict, mask: array) array[source]

Computes a (Lx21) matrix for a sequence of length L. Row i contains all possible changes in energy upon mutating residue i.

\[\Delta H_i = \Delta h_i + \sum_k \Delta j_{ik}\]
Parameters:
  • seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.

  • potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

Returns:

decoy_energy – (Lx21) matrix describing the energetic changes upon mutating a single residue.

Return type:

np.array

Examples

>>> seq = "ACDEFGHIKLMNPQRSTVWY"
>>> potts_model = {
    'h': np.random.rand(20, 20),  # Random fields
    'J': np.random.rand(20, 20, 20, 20)  # Random couplings
}
>>> mask = np.ones((len(seq), len(seq)), dtype=bool) # Include all pairs
>>> decoy_energy = compute_singleresidue_decoy_energy_fluctuation(seq, potts_model, mask)
>>> print(f"Matrix of Residue Decoy Energy Fluctuations: "); print(decoy_energy)
>>> print(f"Matrix Size: "); print(shape(decoy_energy))

Notes

The couplings energy is computed as the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.

Todo

Optimize the computation.

frustratometer.frustration.compute_mutational_decoy_energy_fluctuation(seq: str, potts_model: dict, mask: array) array[source]

Computes a (LxLx21x21) matrix for a sequence of length L. Matrix[i,j] describes all possible changes in energy upon mutating residue i and j simultaneously.

\[\Delta H_{ij} = H_i - H_{i'} + H_{j}-H_{j'} + J_{ij} -J_{ij'} + J_{i'j'} - J_{i'j} + \sum_k {J_{ik} - J_{i'k} + J_{jk} -J_{j'k}}\]
Parameters:
  • seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.

  • potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

Returns:

decoy_energy2 – (LxLx21x21) matrix describing the energetic changes upon mutating two residues simultaneously.

Return type:

np.array

Examples

>>> seq = "ACDEFGHIKLMNPQRSTVWY"
>>> potts_model = {
    'h': np.random.rand(20, 20),  # Random fields
    'J': np.random.rand(20, 20, 20, 20)  # Random couplings
}
>>> mask = np.ones((len(seq), len(seq)), dtype=bool) # Include all pairs
>>> decoy_energy2 = compute_mutational_decoy_energy_fluctuation(seq, potts_model, mask)
>>> print(f"Matrix of Contact Mutational Decoy Energy Fluctuations: "); print(decoy_energy2)
>>> print(f"Matrix Size: "); print(shape(decoy_energy2))

Notes

The couplings energy is computed as the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.

Todo

Optimize the computation.

frustratometer.frustration.compute_configurational_decoy_energy_fluctuation(seq: str, potts_model: dict, mask: array) array[source]

Computes a (LxLx21x21) matrix for a sequence of length L. Matrix[i,j] describes all possible changes in energy upon mutating and altering the local densities of residue i and j simultaneously.

\[\Delta H_{ij} = H_i - H_{i'} + H_{j}-H_{j'} + J_{ij} -J_{ij'} + J_{i'j'} - J_{i'j} + \sum_k {J_{ik} - J_{i'k} + J_{jk} -J_{j'k}}\]
Parameters:
  • seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.

  • potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

Returns:

decoy_energy2 – (LxLx21x21) matrix describing the energetic changes upon mutating and altering the local densities of two residues simultaneously.

Return type:

np.array

Examples

>>> seq = "ACDEFGHIKLMNPQRSTVWY"
>>> potts_model = {
    'h': np.random.rand(20, 20),  # Random fields
    'J': np.random.rand(20, 20, 20, 20)  # Random couplings
}
>>> mask = np.ones((len(seq), len(seq)), dtype=bool) # Include all pairs
>>> decoy_energy2 = compute_configurational_decoy_energy_fluctuation(seq, potts_model, mask)
>>> print(f"Matrix of Contact Configurational Decoy Energy Fluctuations: "); print(decoy_energy2)
>>> print(f"Matrix Size: "); print(shape(decoy_energy2))

Notes

The couplings energy is computed as the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.

Todo

Optimize the computation.

frustratometer.frustration.compute_contact_decoy_energy_fluctuation(seq: str, potts_model: dict, mask: array) array[source]

$$ Delta DCA_{ij} = Delta j_{ij} $$ :param seq: :param potts_model: :param mask: :return:

frustratometer.frustration.compute_decoy_energy(seq: str, potts_model: dict, mask: array, kind='singleresidue') array[source]

Computes all possible decoy energies.

Parameters:
  • seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.

  • potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

  • kind (str) – Kind of decoys generated. Options: “singleresidue,” “mutational,” “configurational,” and “contact.”

Returns:

decoy_energy – Matrix describing all possible decoy energies.

Return type:

np.array

Examples

>>> seq = "ACDEFGHIKLMNPQRSTVWY"
>>> potts_model = {
    'h': np.random.rand(20, 20),  # Random fields
    'J': np.random.rand(20, 20, 20, 20)  # Random couplings
}
>>> mask = np.ones((len(seq), len(seq)), dtype=bool) # Include all pairs
>>> kind = "singleresidue"
>>> decoy_energy = compute_decoy_energy(seq, potts_model, mask, kind)
>>> print(f"Matrix of Single Residue Decoy Energo: "); print(decoy_energy2)
>>> print(f"Matrix Size: "); print(shape(decoy_energy2))

Notes

The couplings energy is computed as the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.

Todo

Optimize the computation.

frustratometer.frustration.compute_aa_freq(seq, include_gaps=True)[source]

Calculates amino acid frequencies in given sequence

Parameters:
  • seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-‘.

  • include_gaps (bool) – If True, frequencies of gaps (‘-’) in the sequence are set to 0. Default is True.

Returns:

aa_freq – Array of frequencies of all 21 possible amino acids within sequence

Return type:

np.array

frustratometer.frustration.compute_contact_freq(seq)[source]

Calculates contact frequencies in given sequence

Parameters:

seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-‘.

Returns:

contact_freq – 21x21 array of frequencies of all possible contacts within sequence.

Return type:

np.array

frustratometer.frustration.compute_single_frustration(decoy_fluctuation, aa_freq=None, correction=0)[source]

Calculates single residue frustration indices

Parameters:
  • decoy_fluctuation (np.array) – (Lx21) matrix for a sequence of length L, describing the energetic changes upon mutating a single residue.

  • aa_freq (np.array) – Array of frequencies of all 21 possible amino acids within sequence

Returns:

frustration – Array of length L featuring single residue frustration indices.

Return type:

np.array

frustratometer.frustration.compute_pair_frustration(decoy_fluctuation, contact_freq: None | array, correction=0) array[source]

Calculates pair residue frustration indices

Parameters:
  • decoy_fluctuation (np.array) – (LxLx21x21) matrix for a sequence of length L, describing the energetic changes upon mutating two residues simultaneously.

  • contact_freq (np.array) – 21x21 array of frequencies of all possible contacts within sequence.

Returns:

contact_frustration – LxL array featuring pair frustration indices (mutational or configurational frustration, depending on decoy_fluctuation matrix provided)

Return type:

np.array

frustratometer.frustration.compute_scores(potts_model: dict) array[source]

Computes contact scores based on the Frobenius norm

\[CN[i,j] = \frac{F[i,j] - F[i,:] * F[:,j}{F[:,:]}\]
Parameters:

potts_model (dict) – Potts model containing the couplings in the “J” key

Returns:

corr_norm – Contact score matrix (N x N)

Return type:

np.array

frustratometer.frustration.compute_roc(scores, distance_matrix, cutoff)[source]

Computes Receiver Operating Characteristic (ROC) curve of predicted and true contacts (identified from the distance matrix).

Parameters:
  • scores (np.array) – Contact score matrix (N x N)

  • distance_matrix (np.array) – LxL array for sequence of length L, describing distances between contacts

  • cutoff (float) – Distance cutoff for contacts

Returns:

roc_score – Array containing lists of false and true positive rates

Return type:

np.array

frustratometer.frustration.compute_auc(roc_score)[source]

Computes Area Under Curve (AUC) of calculated ROC distribution

Parameters:

roc_score (np.array) – Array containing lists of false and true positive rates

Returns:

auc – AUC value

Return type:

float

frustratometer.frustration.plot_roc(roc_score)[source]

Plot ROC distribution

Parameters:

roc_score (np.array) – Array containing lists of false and true positive rates

frustratometer.frustration.plot_singleresidue_decoy_energy(decoy_energy, native_energy, method='clustermap')[source]

Plot comparison of single residue decoy energies, relative to the native energy

Parameters:
  • decoy_energy (np.array) – Lx21 array of decoy energies

  • native_energy (float) – Native energy value

  • method (str) – Options: “clustermap”, “heatmap”

frustratometer.frustration.write_tcl_script(pdb_file: Path | str, chain: str, mask: array, distance_matrix: array, distance_cutoff: float, single_frustration: array, pair_frustration: array, tcl_script: Path | str = 'frustration.tcl', max_connections: int = None, movie_name: Path | str = None, still_image_name: Path | str = None) Path | str[source]

Writes a tcl script that can be run with VMD to superimpose the frustration patterns onto the corresponding PDB structure.

Parameters:
  • pdb_file (Path or str) – pdb file name

  • chain (str) – Select chain from pdb

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

  • distance_matrix (np.array) – LxL array for sequence of length L, describing distances between contacts

  • distance_cutoff (float) – Maximum distance at which a contact occurs

  • single_frustration (np.array) – Array containing single residue frustration index values

  • pair_frustration (np.array) – Array containing pair (ex. configurational, mutational, contact) frustration index values

  • tcl_script (Path or str) – Output tcl script file with static structure

  • max_connections (int) – Maximum number of pair frustration values visualized in tcl file

  • movie_name (Path or str) – Output movie file with rotating structure

  • still_image_name (Path or str) – Output image file with still image

Returns:

tcl_script – tcl script file

Return type:

Path or str

frustratometer.frustration.call_vmd(pdb_file: Path | str, tcl_script: Path | str)[source]

Calls VMD program with given pdb file and tcl script to visualize frustration patterns

Parameters:
  • pdb_file (Path or str) – pdb file name

  • tcl_script (Path or str) – Output tcl script file with static structure

frustratometer.frustration.canvas(with_attribution=True)[source]

Placeholder function to show example docstring (NumPy format).

Replace this function and doc string for your own project.

Parameters:

with_attribution (bool, Optional, default: True) – Set whether or not to display who the quote is from.

Returns:

quote – Compiled string including quote and optional attribution.

Return type:

str

frustratometer.frustration.make_decoy_seqs(seq, ndecoys=1000)[source]

Creates permutated, decoy sequences using a given sequence residue composition and length.

Parameters:
  • seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.

  • n_decoys (int) – Number of sequence decoys to create

Returns:

decoy_seqs – List of decoy sequences. The sequences are assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequences (L) should match the dimensions of the Potts model.

Return type:

list

frustratometer.frustration.compute_fragment_mask(mask: array, fragment_pos: array) array[source]

Creates a mask for a sequence fragment such that: - position i belongs to the fragment, all j - position j belongs to the fragment, all i

The new mask consider all the interactions within the fragment and also the interactions between the fragment and other sequence positions.

masknp.array

A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

fragment_pos: np:array

Array of sequence positions selected.

fragment_mask: np.array

New 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

frustratometer.frustration.compute_fragment_total_native_energy(seq: str, potts_model: dict, mask: array, fragment_pos: None | array = None, fragment_in_context=False) float[source]

Calculates the energy for the complete protein or for a fragment in context

seqstr

The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.

potts_modeldict

A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

masknp.array

A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

fragment_pos: np:array

Array of sequence positions selected.

fragment_in_context: bool

If True, the energetics calculations take into account the interactions between the fragment and other sequence positions

energy: float

Native energy of the protein

frustratometer.frustration.compute_fragment_total_decoy_energy(decoy_seqs: list, potts_model: dict, mask: array, fragment_pos: None | array = None, fragment_in_context=False, split_couplings_and_fields=False, config_decoys=False, msa_mask=1) array[source]

Calculates decoy energies for the complete protein or for a fragment in context

decoy_seqslist

List of decoy sequences. The sequences are assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequences (L) should match the dimensions of the Potts model.

potts_modeldict

A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

masknp.array

A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

fragment_pos: np:array

Array of sequence positions selected.

fragment_in_context: bool

If True, the energetics calculations take into account the interactions between the fragment and other sequence positions

split_couplings_and_fields: bool

Separate output into coupling and local fields contribution to energy.

config_decoys: bool

If True, use the configurational decoys approximation, shuffling index positions for configurational decoys energy calculation. If False, mutational decoys.

msa_mask: np.array

Extra mask to use a Multiple Sequence Alignment that do not cover completely the reference PDB

energy: np.array

Decoy energies

frustratometer.frustration.compute_total_frustration(seq, potts_model, mask, ndecoys=1000, config_decoys=False, msa_mask=1, fragment_pos=None, fragment_in_context=False, output_kind='frustration')[source]

Calculates the total frustration of a protein fragment.

Parameters:
  • seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.

  • potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

  • n_decoys (int) – Number of sequence decoys to create

  • config_decoys (bool) – If True, use the configurational decoys approximation, shuffling index positions for configurational decoys energy calculation. If False, mutational decoys.

  • msa_mask (np.array) – Extra mask to use a Multiple Sequence Alignment that do not cover completely the reference PDB

  • fragment_pos (np.array) – Fragment positions. If None, use the complete model

  • fragment_in_context (bool) – If True, the energetics calculations take into account the interactions between the fragment and other sequence positions

  • output_kind (str) – If ‘frustration’, returns frustration. If not, returns native energy, decoy energy average and decoy energy standard deviation.

Returns:

  • total_frustration (float) – Total frustration of the fragment or complete protein

  • native_energy (float) – Native energy of the given sequence

  • decoy_energy_average (float) – Average of the decoy energy distribution

  • decoy_energy_std (float) – Standard deviation of the decoy energy distribution

frustratometer.frustration.compute_native_h_J(seq: str, potts_model: dict, mask: array) tuple[source]

Computes the applied fields h_i(a_i) and J_{ij}(a_i,a_j) for the residues a_i of the sequence

Parameters:
  • seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.

  • potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

Returns:

  • h (np.array) – Values of the local field for the sequence.

  • j (np.array) – Values of the coupling field for the sequence.

frustratometer.frustration.compute_decoy_h_J(decoy_seqs: list, potts_model: dict, mask: array, config_decoys: bool = False) tuple[source]

Computes the applied fields h_i(a_i) and J_{ij}(a_i,a_j) for the residues a_i of a set of decoy sequences

Parameters:
  • decoy_seqs (list) – List of decoy sequences. The sequences are assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequences (L) should match the dimensions of the Potts model.

  • potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

  • config_decoys (bool) – If True, use the configurational decoys approximation, shuffling index positions for configurational decoys energy calculation. If False, mutational decoys.

Returns:

  • h (np.array) – Values of the local field for each decoy sequence.

  • j (np.array) – Values of the coupling field for each decoy sequence.

frustratometer.frustration.compute_native_fragment_energy_from_h_j(fragment_pos: array, h: array, j: array, mask: array) float[source]

Computes the energy from the applied fields h_i(a_i) and J_{ij}(a_i,a_j)

Parameters:
  • fragment_pos (np:array) – Array of sequence positions selected.

  • h (np.array) – Values of the local field for each decoy sequence.

  • j (np.array) – Values of the coupling field for each decoy sequence.

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

Returns:

energy – Native energy of the protein

Return type:

float

frustratometer.frustration.compute_decoy_fragment_energy_from_h_j(fragment_pos: array, h: array, j: array, mask: array) tuple[source]

Computes the energy from the applied fields h_i(a_i) and J_{ij}(a_i,a_j) for each decoy sequence.

Parameters:
  • fragment_pos (np:array) – Array of sequence positions selected.

  • h (np.array) – Values of the local field for each decoy sequence.

  • j (np.array) – Values of the coupling field for each decoy sequence.

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

Returns:

  • energy_average (float) – Average of the decoy energies

  • energy_std (float) – Standard deviation of the decoy energies

frustratometer.frustration.compute_energy_sliding_window(seq: str, potts_model: dict, mask: array, win_size: int, ndecoys: int, config_decoys: bool) dict[source]

Computes the total frustration, the native energy, the decoy average energy and the decoy standard deviation for fragments on a sliding window

Parameters:
  • seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.

  • potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.

  • mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.

  • win_size (int) – Size of the sliding window

  • ndecoys (int) – Number of decoy sequences to use

  • config_decoys (bool) – If True, use the configurational decoys approximation, shuffling index positions for configurational decoys energy calculation. If False, mutational decoys.

Returns:

results – Dictionary with the results, containing ‘fragment_center’: center position of each window ‘win_size’: size of the sliding windows ‘native_energy’: native energy for each window ‘decoy_energy_av’: decoy energy average for each window ‘decoy_energy_std’: decoy energy standard deviation for each window ‘frustration’: total frustration index for each window

Return type:

dict

Utility Functions

frustratometer.utils.create_directory(path)[source]

Creates a directory after checking it does not exists

Parameters:

path (str or Path) – Location of the new directory

Returns:

path – Path of the new directory

Return type:

Path

frustratometer.utils.get_pfamID(pdbID, chain)[source]

Returns PFAM and Uniprot IDs

Parameters:
  • pdbID (str) – pdbID (4 characters)

  • chain (str) – Select chain from pdb

Returns:

pfamID – PFAM family ID

Return type:

str