xenonpy.inverse.iqspr package

Submodules

xenonpy.inverse.iqspr.estimator module

class xenonpy.inverse.iqspr.estimator.GaussianLogLikelihood(descriptor, *, targets={}, **estimators)[source]

Bases: BaseLogLikelihood

Gaussian loglikelihood.

Parameters:
  • descriptor (BaseFeaturizer or BaseDescriptor) – Descriptor calculator.

  • estimators (BaseEstimator) –

    Gaussian estimators follow the scikit-learn style. These estimators must provide a method named predict which accesses descriptors as input and returns (mean, std) in order. By default, BayesianRidge will be used.

  • targets (dictionary) – Upper and lower bounds for each property to calculate the Gaussian CDF probability

fit(smiles, y=None, *, X_scaler=None, y_scaler=None, **kwargs)[source]

Default - automatically remove NaN data rows

Parameters:
  • smiles (list[str]) – SMILES for training.

  • y (pandas.DataFrame) – Target properties for training.

  • X_scaler (Scaler (optional, not implement)) – Scaler for transform X.

  • y_scaler (Scaler (optional, not implement)) – Scaler for transform y.

  • kwargs (dict) – Parameters pass to BayesianRidge initialization.

log_likelihood(smis, *, log_0=-1000.0, **targets)[source]

Log likelihood

Parameters:
  • X (list[object]) – Input samples for likelihood calculation.

  • targets (tuple[float, float]) – Target area. Should be a tuple which have down and up boundary. e.g: target1=(10, 20) equal to target1 should in range [10, 20].

Returns:

log_likelihood – Estimated log-likelihood of each sample’s property values. Cannot be pd.Series!

Return type:

pd.Dataframe of float (col - properties, row - samples)

predict(smiles, **kwargs)[source]
remove_estimator(*properties)[source]

Remove estimators from estimator set.

Parameters:

properties (str) – The name of properties will be removed from estimator set.

update_targets(*, reset=False, **targets)[source]

Update/set the target area.

Parameters:
  • reset (bool) – If true, reset target area.

  • targets (tuple[float, float]) – Target area. Should be a tuple which have down and up boundary. e.g: target1=(10, 20) equal to target1 should in range [10, 20].

property timer

xenonpy.inverse.iqspr.iqspr module

class xenonpy.inverse.iqspr.iqspr.IQSPR(*, estimator, modifier, r_ESS=1)[source]

Bases: BaseSMC

SMC iqspr runner (assume data type of samples = list or np.array).

Parameters:
  • estimator (BaseLogLikelihood or BaseLogLikelihoodSet) – Log likelihood estimator for given input samples.

  • modifier (BaseProposal) – Modify given input samples to new ones.

  • r_ESS (float) – r_ESS*sample_size = Upper threshold of ESS (effective sample size) using in SMC resampling. Resample will happen only if calculated ESS is smaller or equal to the upper threshold. As 1 <= ESS <= sample_size, picking any r_ESS < 1/sample_size will lead to never resample; picking any r_ESS >= 1 will lead to always resample. Default is 1, i.e., resample at each step of SMC.

resample(sims, freq, size, p)[source]

Re-sample from given samples.

Parameters:
  • X (list[object]) – Input samples for likelihood calculation. Can be changed to accept other data types.

  • freq (list[int]) – Frequency of each input sample.

  • size (int) – Resample size.

  • p (numpy.ndarray[float]) – The probabilities associated with each entry in X. If not given the sample assumes a uniform distribution over all entries.

Returns:

re-sample – Re-sampling result.

Return type:

list of object

property estimator
property modifier
property timer

xenonpy.inverse.iqspr.iqspr4df module

class xenonpy.inverse.iqspr.iqspr4df.IQSPR4DF(*, estimator, modifier, r_ESS=1, sample_col=None)[source]

Bases: BaseSMC

SMC iqspr runner (assume data type of samples = pd.DataFrame).

Parameters:
  • estimator (BaseLogLikelihood or BaseLogLikelihoodSet) – Log likelihood estimator for given input samples.

  • modifier (BaseProposal) – Modify given input samples to new ones.

  • r_ESS (float) – r_ESS*sample_size = Upper threshold of ESS (effective sample size) using in SMC resampling. Resample will happen only if calculated ESS is smaller or equal to the upper threshold. As 1 <= ESS <= sample_size, picking any r_ESS < 1/sample_size will lead to never resample; picking any r_ESS >= 1 will lead to always resample. Default is 1, i.e., resample at each step of SMC.

  • sample_col (list or str) – Name(s) of columns that will be used to extract unique samples in the unique function. Default is None, which means all columns are used.

resample(sims, freq, size, p)[source]

Re-sample from given samples.

Parameters:
  • X (list[object]) – Input samples for likelihood calculation. Can be changed to accept other data types.

  • freq (list[int]) – Frequency of each input sample.

  • size (int) – Resample size.

  • p (numpy.ndarray[float]) – The probabilities associated with each entry in X. If not given the sample assumes a uniform distribution over all entries.

Returns:

re-sample – Re-sampling result.

Return type:

list of object

unique(x)[source]
Parameters:

X (pd.DataFrame) – Input samples.

Returns:

  • unique (pd.DataFrame) – The sorted unique samples.

  • unique_counts (np.ndarray of int) – The number of times each of the unique values comes up in the original array

property estimator
property modifier
property timer

xenonpy.inverse.iqspr.modifier module

exception xenonpy.inverse.iqspr.modifier.GetProbError(tmp_str, i_b, i_r)[source]

Bases: ProposalError

exception xenonpy.inverse.iqspr.modifier.MolConvertError(new_smi)[source]

Bases: ProposalError

exception xenonpy.inverse.iqspr.modifier.NGramTrainingError(error, smi)[source]

Bases: ProposalError

class xenonpy.inverse.iqspr.modifier.NGram(*, ngram_table=None, sample_order=(1, 10), del_range=(1, 10), min_len=1, max_len=1000, reorder_prob=0)[source]

Bases: BaseProposal

N-Garm

Parameters:
  • ngram_table (NGram table) – NGram table for modify SMILES.

  • sample_order (tuple[int, int] or int) – range of order of ngram table used during proposal, when given int, sample_order = (1, int)

  • del_range (tuple[int, int] or int) – range of random deletion of SMILES string during proposal, when given int, del_range = (1, int)

  • min_len (int) – minimum length of the extended SMILES, shall be smaller than the lower bound of the sample_order

  • max_len (int) – max length of the extended SMILES to be terminated from continuing modification

  • reorder_prob (float) – probability of the SMILES being reordered during proposal

classmethod add_char(ext_smi, next_char)[source]
classmethod del_char(ext_smi, n_char)[source]
classmethod esmi2smi(ext_smi)[source]
fit(smiles, *, train_order=(1, 10))[source]
Parameters:
  • smiles (list[str]) – SMILES for training.

  • train_order (tuple[int, int] or int) – range of order when train a NGram table, when given int, train_order = (1, int), and train_order[0] must be > 0

get_prob(tmp_str, iB, iR)[source]
merge_table(*ngram_tab, weight=1, overwrite=True)[source]

Merge with a given NGram table

Parameters:
  • ngram_tab (NGram) – the table(s) in the given NGram class variable(s) will be merged to the table in self

  • weight (int/float or list/tuple/np.array/pd.Series[int/float]) – a scalar/vector to scale the frequency in the given NGram table to be merged, must have the same length as ngram_tab

  • overwrite (boolean) – overwrite the original table (self) or not, do not recommend to be False (may have memory issue)

Returns:

tmp_n_gram – merged NGram tables

Return type:

NGram

modify(ext_smi)[source]
on_errors(error)[source]
Parameters:

error (ProposalError) – Error object.

proposal(smiles)[source]

Propose new SMILES based on the given SMILES. Make sure you always check the train_order against sample_order before using the proposal!

Parameters:

smiles (list of SMILES) – Given SMILES for modification.

Returns:

new_smiles – The proposed SMILES from the given SMILES.

Return type:

list of SMILES

remove_table(max_order=None)[source]

Remove estimators from estimator set.

Parameters:

max_order (int) – max order to be left in the table, the rest is removed.

classmethod reorder_esmi(ext_smi)[source]
sample_next_char(ext_smi)[source]
classmethod smi2esmi(smi)[source]
classmethod smi2list(smiles)[source]
split_table(cut_order)[source]

Split NGram table into two

Parameters:

cut_order (int) – split NGram table between cut_order and cut_order+1

Returns:

  • n_gram1 (NGram)

  • n_gram2 (NGram)

validator(ext_smi)[source]
property del_range
property max_len
property min_len
property ngram_table
property reorder_prob
property sample_order
property timer

Module contents

exception xenonpy.inverse.iqspr.GetProbError(tmp_str, i_b, i_r)[source]

Bases: ProposalError

exception xenonpy.inverse.iqspr.MolConvertError(new_smi)[source]

Bases: ProposalError

exception xenonpy.inverse.iqspr.NGramTrainingError(error, smi)[source]

Bases: ProposalError

class xenonpy.inverse.iqspr.GaussianLogLikelihood(descriptor, *, targets={}, **estimators)[source]

Bases: BaseLogLikelihood

Gaussian loglikelihood.

Parameters:
  • descriptor (BaseFeaturizer or BaseDescriptor) – Descriptor calculator.

  • estimators (BaseEstimator) –

    Gaussian estimators follow the scikit-learn style. These estimators must provide a method named predict which accesses descriptors as input and returns (mean, std) in order. By default, BayesianRidge will be used.

  • targets (dictionary) – Upper and lower bounds for each property to calculate the Gaussian CDF probability

fit(smiles, y=None, *, X_scaler=None, y_scaler=None, **kwargs)[source]

Default - automatically remove NaN data rows

Parameters:
  • smiles (list[str]) – SMILES for training.

  • y (pandas.DataFrame) – Target properties for training.

  • X_scaler (Scaler (optional, not implement)) – Scaler for transform X.

  • y_scaler (Scaler (optional, not implement)) – Scaler for transform y.

  • kwargs (dict) – Parameters pass to BayesianRidge initialization.

log_likelihood(smis, *, log_0=-1000.0, **targets)[source]

Log likelihood

Parameters:
  • X (list[object]) – Input samples for likelihood calculation.

  • targets (tuple[float, float]) – Target area. Should be a tuple which have down and up boundary. e.g: target1=(10, 20) equal to target1 should in range [10, 20].

Returns:

log_likelihood – Estimated log-likelihood of each sample’s property values. Cannot be pd.Series!

Return type:

pd.Dataframe of float (col - properties, row - samples)

predict(smiles, **kwargs)[source]
remove_estimator(*properties)[source]

Remove estimators from estimator set.

Parameters:

properties (str) – The name of properties will be removed from estimator set.

update_targets(*, reset=False, **targets)[source]

Update/set the target area.

Parameters:
  • reset (bool) – If true, reset target area.

  • targets (tuple[float, float]) – Target area. Should be a tuple which have down and up boundary. e.g: target1=(10, 20) equal to target1 should in range [10, 20].

property timer
class xenonpy.inverse.iqspr.IQSPR(*, estimator, modifier, r_ESS=1)[source]

Bases: BaseSMC

SMC iqspr runner (assume data type of samples = list or np.array).

Parameters:
  • estimator (BaseLogLikelihood or BaseLogLikelihoodSet) – Log likelihood estimator for given input samples.

  • modifier (BaseProposal) – Modify given input samples to new ones.

  • r_ESS (float) – r_ESS*sample_size = Upper threshold of ESS (effective sample size) using in SMC resampling. Resample will happen only if calculated ESS is smaller or equal to the upper threshold. As 1 <= ESS <= sample_size, picking any r_ESS < 1/sample_size will lead to never resample; picking any r_ESS >= 1 will lead to always resample. Default is 1, i.e., resample at each step of SMC.

resample(sims, freq, size, p)[source]

Re-sample from given samples.

Parameters:
  • X (list[object]) – Input samples for likelihood calculation. Can be changed to accept other data types.

  • freq (list[int]) – Frequency of each input sample.

  • size (int) – Resample size.

  • p (numpy.ndarray[float]) – The probabilities associated with each entry in X. If not given the sample assumes a uniform distribution over all entries.

Returns:

re-sample – Re-sampling result.

Return type:

list of object

property estimator
property modifier
property timer
class xenonpy.inverse.iqspr.IQSPR4DF(*, estimator, modifier, r_ESS=1, sample_col=None)[source]

Bases: BaseSMC

SMC iqspr runner (assume data type of samples = pd.DataFrame).

Parameters:
  • estimator (BaseLogLikelihood or BaseLogLikelihoodSet) – Log likelihood estimator for given input samples.

  • modifier (BaseProposal) – Modify given input samples to new ones.

  • r_ESS (float) – r_ESS*sample_size = Upper threshold of ESS (effective sample size) using in SMC resampling. Resample will happen only if calculated ESS is smaller or equal to the upper threshold. As 1 <= ESS <= sample_size, picking any r_ESS < 1/sample_size will lead to never resample; picking any r_ESS >= 1 will lead to always resample. Default is 1, i.e., resample at each step of SMC.

  • sample_col (list or str) – Name(s) of columns that will be used to extract unique samples in the unique function. Default is None, which means all columns are used.

resample(sims, freq, size, p)[source]

Re-sample from given samples.

Parameters:
  • X (list[object]) – Input samples for likelihood calculation. Can be changed to accept other data types.

  • freq (list[int]) – Frequency of each input sample.

  • size (int) – Resample size.

  • p (numpy.ndarray[float]) – The probabilities associated with each entry in X. If not given the sample assumes a uniform distribution over all entries.

Returns:

re-sample – Re-sampling result.

Return type:

list of object

unique(x)[source]
Parameters:

X (pd.DataFrame) – Input samples.

Returns:

  • unique (pd.DataFrame) – The sorted unique samples.

  • unique_counts (np.ndarray of int) – The number of times each of the unique values comes up in the original array

property estimator
property modifier
property timer
class xenonpy.inverse.iqspr.NGram(*, ngram_table=None, sample_order=(1, 10), del_range=(1, 10), min_len=1, max_len=1000, reorder_prob=0)[source]

Bases: BaseProposal

N-Garm

Parameters:
  • ngram_table (NGram table) – NGram table for modify SMILES.

  • sample_order (tuple[int, int] or int) – range of order of ngram table used during proposal, when given int, sample_order = (1, int)

  • del_range (tuple[int, int] or int) – range of random deletion of SMILES string during proposal, when given int, del_range = (1, int)

  • min_len (int) – minimum length of the extended SMILES, shall be smaller than the lower bound of the sample_order

  • max_len (int) – max length of the extended SMILES to be terminated from continuing modification

  • reorder_prob (float) – probability of the SMILES being reordered during proposal

classmethod add_char(ext_smi, next_char)[source]
classmethod del_char(ext_smi, n_char)[source]
classmethod esmi2smi(ext_smi)[source]
fit(smiles, *, train_order=(1, 10))[source]
Parameters:
  • smiles (list[str]) – SMILES for training.

  • train_order (tuple[int, int] or int) – range of order when train a NGram table, when given int, train_order = (1, int), and train_order[0] must be > 0

get_prob(tmp_str, iB, iR)[source]
merge_table(*ngram_tab, weight=1, overwrite=True)[source]

Merge with a given NGram table

Parameters:
  • ngram_tab (NGram) – the table(s) in the given NGram class variable(s) will be merged to the table in self

  • weight (int/float or list/tuple/np.array/pd.Series[int/float]) – a scalar/vector to scale the frequency in the given NGram table to be merged, must have the same length as ngram_tab

  • overwrite (boolean) – overwrite the original table (self) or not, do not recommend to be False (may have memory issue)

Returns:

tmp_n_gram – merged NGram tables

Return type:

NGram

modify(ext_smi)[source]
on_errors(error)[source]
Parameters:

error (ProposalError) – Error object.

proposal(smiles)[source]

Propose new SMILES based on the given SMILES. Make sure you always check the train_order against sample_order before using the proposal!

Parameters:

smiles (list of SMILES) – Given SMILES for modification.

Returns:

new_smiles – The proposed SMILES from the given SMILES.

Return type:

list of SMILES

remove_table(max_order=None)[source]

Remove estimators from estimator set.

Parameters:

max_order (int) – max order to be left in the table, the rest is removed.

classmethod reorder_esmi(ext_smi)[source]
sample_next_char(ext_smi)[source]
classmethod smi2esmi(smi)[source]
classmethod smi2list(smiles)[source]
split_table(cut_order)[source]

Split NGram table into two

Parameters:

cut_order (int) – split NGram table between cut_order and cut_order+1

Returns:

  • n_gram1 (NGram)

  • n_gram2 (NGram)

validator(ext_smi)[source]
property del_range
property max_len
property min_len
property ngram_table
property reorder_prob
property sample_order
property timer