Frustratometer Functions

Overview

The frustratometer package includes several submodules each designed to handle different aspects of the frustration analysis pipeline. Below are the main modules and their descriptions.

Protein Database Handling

PDB functions

frustratometer.pdb.download(pdbID: str, directory: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/frustratometer/checkouts/latest/docs')) → Path[source]: Downloads a single pdb file

frustratometer.pdb.get_sequence(pdb_file: str, chain: str) → str[source]

Get a protein sequence from a pdb file

Parameters:

pdb (str,) – PDB file location.
chain (str,) – Chain ID of the selected protein.

Returns:

sequence – Protein sequence.

Return type:

str

frustratometer.pdb.get_distance_matrix(pdb_file: Path, chain: str, method: str = 'CB') → array[source]

Calculate the distance matrix of the specified atoms in a PDB file.

Parameters:: pdb_file (str): The path to the PDB file. chain (str): The chainID or chainIDs (space separated) of the protein. method (str): The method to use for calculating the distance matrix.

Defaults to ‘CB’, which uses the CB atom for all residues except GLY, which uses the CA atom. Options:

‘CA’ for using only the CA atom, ‘minimum’ for using the minimum distance between all atoms in each residue, ‘CB_force’ computes a new coordinate for the CB atom based on the CA, C, and N atoms and uses CB distance even for glycine.
Returns:: np.array: The distance matrix of the selected atoms.
Raises:: IndexError: If the selection of atoms is empty. ValueError: If the method is not recognized.

frustratometer.pdb.full_to_filtered_aligned_mapping(aligned_sequence: str, filtered_aligned_sequence: str)[source]

frustratometer.pdb.repair_pdb(pdb_file: str, chain: str, pdb_directory: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/frustratometer/checkouts/latest/docs')) → <MagicMock id='139837922944144'>[source]: Repairs a pdb file using pdbfixer

Protein Family Analysis

PFAM functions

This module includes functions to download and manipulate PFAM alignments. There are two options to obtain alignments from PFAM: downloading the database locally or downloading specific alignments.

If you want to download the database locally use the download function to download or update the database and the get function to select a single alignment.

If you want to retrieve a single alignment use the alignemnt function

Examples:

$ frustratometer.pfam.database(‘/media/databases’, https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.uniprot.gz)

Todo:

Create a function to get single alignments

frustratometer.pfam.download_database(path, name='PFAM_current', url='https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.uniprot.gz')[source]

Downloads and creates a pfam database in the Database folder

Parameters:

url (str) – Adress of pfam database
name (str) – Name of the new database folder

Returns:

alignment_path – Path of the alignments

Return type:

Path

frustratometer.pfam.get_alignment(pfamid, database_path)[source]

frustratometer.pfam.download_aligment(pfamid, output_file, alignment_type='full')[source]

‘ Retrieves a pfam family alignment from interpro

Parameters:

pfamid (str,) – ID of PFAM family. ex: PF00001
alignment_type (str,) – alignment type to retrieve. Options: full, seed, uniprot
output_file (str) – location of the output file. Default: Temporary file

Returns:

output – location of alignment

Return type:

Path

Sequence Alignment

Align functions

frustratometer.align.jackhmmer(sequence, output_file, database, log=-3, dry_run: bool = False, **kwargs)[source]

Generates alignment using jackhmmer

Parameters:

sequence (str) – Protein sequence
output_file (str) – If True, will record alignment output messages; Arguments: True OR False (Default is False)
database (str) – Location of the sequence database used to generation MSA
log (File handle) – jackhmmer output Default:DEVNULL
dry_run (bool (default: False)) – Save the temporary fasta_file and print the command instead of running
**kwargs – Other arguments that can be passed to jackhmmer. More information can be found by executing jackhmmer -h arguments without a value such as –noali should be passed as noali=True Common kwargs:

N: number of iterations E: E-value threshold

Returns:

output_file – Path of the alignment file

Return type:

Path

frustratometer.align.extract_sequences_from_alignment(alignment_file, output_file, database)[source]: Extracts the complete sequences of an stockholm alignment from a database to refine the alignment if needed

MSA Filtering

Filter functions

frustratometer.filter.filter_alignment(alignment_file, output_file=None, alignment_format='stockholm')[source]

Filter stockholm alignment by removing unaligned sections. Filters by saving the alignment into memory.

Parameters:

alignment_file (Path) – location of file
output_file (Path) – location of the output_file
alignment_format (str) – Biopython alignment format (Default: stockholm)

Returns:

output_file – location of the output_file

Return type:

Path

frustratometer.filter.filter_alignment_lowmem(alignment_file, output_file, alignment_format='stockholm')[source]

Filter stockholm alignment by removing unaligned sections. Filters by reading the file without saving to memory.

Parameters:

alignment_file (Path) – location of file
output_file (Path) – location of the output_file
alignment_format (str) – Biopython alignment format (Default: stockholm)

Returns:

output_file – location of the output_file

Return type:

Path

Mapping Utilities

MAP functions

frustratometer.map.get_pfamID(pdb_file, chain)[source]

Returns PFAM and Uniprot IDs

Parameters:

pdb_file (str) – pdb file name
chain (str) – Select chain from pdb

Returns:

pfamID – PFAM family ID

Return type:

str

frustratometer.map.get_pfam_map(pdbID, chain)[source]

Direct Coupling Analysis

DCA functions

frustratometer.dca.pydca.plmdca(filtered_alignment_file, sequence_type='protein', seqid=0.8, lambda_h=1.0, lambda_J=20.0, num_threads=10, max_iterations=500, verbose=False, msa_file_format='fasta')[source]

frustratometer.dca.pydca.mfdca(filtered_alignment_file)[source]

frustratometer.dca.matlab.compute_plm(protein_name, reweighting_threshold=0.1, nr_of_cores=1)[source]: Calculate Potts Model Fields and Couplings Terms :param protein_name :param reweighting_threshold :param nr_of_cores

frustratometer.dca.matlab.load_potts_model(potts_model_file)[source]

Frustration Analysis

Frustration functions

frustratometer.frustration.compute_mask(distance_matrix: array, maximum_contact_distance: float | None = None, minimum_sequence_separation: int | None = None) → array[source]

Computes a 2D Boolean mask from a given distance matrix based on a distance cutoff and a sequence separation cutoff.

Parameters:

distance_matrix (np.array) – A 2D array where the element at index [i, j] represents the spatial distance between residues i and j. This matrix is assumed to be symmetric.
maximum_contact_distance (float, optional) – The maximum distance of a contact. Pairs of residues with distances less than this threshold are marked as True in the mask. If None, the spatial distance criterion is ignored and all distances are included. Default is None.
minimum_sequence_separation (int, optional) – A minimum sequence distance threshold. Pairs of residues with sequence indices differing by at least this value are marked as True in the mask. If None, the sequence distance criterion is ignored. Default is None.

Returns:

mask – A 2D Boolean array of the same dimensions as distance_matrix. Elements of the mask are True where the residue pairs meet the specified distance_cutoff and sequence_distance_cutoff criteria.

Return type:

np.array

Examples

>>> import numpy as np
>>> dm = np.array([[0, 1, 2], [1, 0, 1], [2, 1, 0]])
>>> print(compute_mask(dm, distance_cutoff=1.5, sequence_distance_cutoff=1))
[[False  True False]
 [ True False  True]
 [False  True False]]

Todo

Add chain information for sequence separation

frustratometer.frustration.compute_native_energy(seq: str, potts_model: dict, mask: array, ignore_gap_couplings: bool = False, ignore_gap_fields: bool = False) → float[source]

Computes the native energy of a protein sequence based on a given Potts model and an interaction mask.

\[E = \sum_i h_i + \frac{1}{2} \sum_{i,j} J_{ij} \Theta_{ij}\]

Parameters:

seq (str) – The amino acid sequence of the protein. The sequence is assumed to be in one-letter code. Gaps are represented as ‘-’. The length of the sequence (L) should match the dimensions of the Potts model.
potts_model (dict) – A dictionary containing the Potts model parameters ‘h’ (fields) and ‘J’ (couplings). The fields are a 2D array of shape (L, 20), where L is the length of the sequence and 20 is the number of amino acids. The couplings are a 4D array of shape (L, L, 20, 20). The fields and couplings are assumed to be in units of energy.
mask (np.array) – A 2D Boolean array that determines which residue pairs should be considered in the energy computation. The mask should have dimensions (L, L), where L is the length of the sequence.
ignore_couplings_of_gaps (bool, optional) – If True, couplings involving gaps (‘-’) in the sequence are set to 0 in the energy calculation. Default is False.
ignore_fields_of_gaps (bool, optional) – If True, fields corresponding to gaps (‘-’) in the sequence are set to 0 in the energy calculation. Default is False.

Returns:

energy – The computed energy of the protein sequence based on the Potts model and the interaction mask.

Return type:

float

Examples

>>> seq = "ACDEFGHIKLMNPQRSTVWY"
>>> potts_model = {
    'h': np.random.rand(20, 20),  # Random fields
    'J': np.random.rand(20, 20, 20, 20)  # Random couplings
}
>>> mask = np.ones((len(seq), len(seq)), dtype=bool) # Include all pairs
>>> energy = compute_native_energy(seq, potts_model, mask)
>>> print(f"Computed energy: {energy:.2f}")

Notes

The energy is computed as the sum of the fields and the half-sum of the couplings for all pairs of residues where the mask is True. The division by 2 for the couplings accounts for double-counting in symmetric matrices.

Todo

Optimize the computation.

frustratometer.frustration.compute_fields_energy(seq: str, potts_model: dict, ignore_fields_of_gaps: bool = False) → float[source]

frustratometer.frustration.compute_couplings_energy(seq: str, potts_model: dict, mask: array, ignore_couplings_of_gaps: bool = False) → float[source]

frustratometer.frustration.compute_sequences_energy(seqs: list, potts_model: dict, mask: array, split_couplings_and_fields=False) → array[source]

frustratometer.frustration.compute_singleresidue_decoy_energy_fluctuation(seq: str, potts_model: dict, mask: array) → array[source]

$ Delta H_i = Delta h_i + sum_kDelta j_{ik} $

Parameters:

seq
potts_model
mask

Returns:

frustratometer.frustration.compute_mutational_decoy_energy_fluctuation(seq: str, potts_model: dict, mask: array) → array[source]: $$ Delta DCA_{ij} = H_i - H_{i’} + H_{j}-H_{j’} + J_{ij} -J_{ij’} + J_{i’j’} - J_{i’j} + sum_k {J_{ik} - J_{i’k} + J_{jk} -J_{j’k}} $$ :param seq: :param potts_model: :param mask: :return:

frustratometer.frustration.compute_configurational_decoy_energy_fluctuation(seq: str, potts_model: dict, mask: array) → array[source]: $$ Delta DCA_{ij} = H_i - H_{i’} + H_{j}-H_{j’} + J_{ij} -J_{ij’} + J_{i’j’} - J_{i’j} + sum_k {J_{ik} - J_{i’k} + J_{jk} -J_{j’k}} $$ :param seq: :param potts_model: :param mask: :return:

frustratometer.frustration.compute_contact_decoy_energy_fluctuation(seq: str, potts_model: dict, mask: array) → array[source]: $$ Delta DCA_{ij} = Delta j_{ij} $$ :param seq: :param potts_model: :param mask: :return:

frustratometer.frustration.compute_decoy_energy(seq: str, potts_model: dict, mask: array, kind='singleresidue')[source]: Calculates the decoy energy (Obsolete) :param seq: :param potts_model: :param mask: :param kind: :return:

frustratometer.frustration.compute_aa_freq(sequence, include_gaps=True)[source]

frustratometer.frustration.compute_contact_freq(sequence)[source]

frustratometer.frustration.compute_single_frustration(decoy_fluctuation, aa_freq=None, correction=0)[source]

frustratometer.frustration.compute_pair_frustration(decoy_fluctuation, contact_freq: None | array, correction=0) → array[source]

frustratometer.frustration.compute_scores(potts_model: dict) → array[source]

Computes contact scores based on the Frobenius norm

CN[i,j] = F[i,j] - F[i,:] * F[:,j] / F[:,:]

Parameters:: potts_model (dict) – Potts model containing the couplings in the “J” key
Returns:: scores – Score matrix (N x N)
Return type:: np.array

frustratometer.frustration.compute_roc(scores, distance_matrix, cutoff)[source]

frustratometer.frustration.compute_auc(roc)[source]

frustratometer.frustration.plot_roc(roc)[source]

frustratometer.frustration.plot_singleresidue_decoy_energy(decoy_energy, native_energy, method='clustermap')[source]

frustratometer.frustration.write_tcl_script(pdb_file, chain, single_frustration, pair_frustration, tcl_script='frustration.tcl', max_connections=100)[source]

frustratometer.frustration.call_vmd(pdb_file, tcl_script)[source]

frustratometer.frustration.canvas(with_attribution=True)[source]

Placeholder function to show example docstring (NumPy format).

Replace this function and doc string for your own project.

Parameters:: with_attribution (bool, Optional, default: True) – Set whether or not to display who the quote is from.
Returns:: quote – Compiled string including quote and optional attribution.
Return type:: str

Utility Functions

frustratometer.utils.create_directory(path)[source]

Creates a directory after checking it does not exists

Parameters:: path (str or Path) – Location of the new directory
Returns:: path – Path of the new directory
Return type:: Path

frustratometer.utils.get_pfamID(pdbID, chain)[source]

Returns PFAM and Uniprot IDs

Parameters:

pdbID (str) – pdbID (4 characters)
chain (str) – Select chain from pdb

Returns:

pfamID – PFAM family ID

Return type:

str