xenonpy.inverse.iqspr package

Submodules

xenonpy.inverse.iqspr.estimator module

class xenonpy.inverse.iqspr.estimator.GaussianLogLikelihood(descriptor, *, targets={}, **estimators)[source]

Bases: BaseLogLikelihood

Gaussian loglikelihood.

Parameters:

descriptor (BaseFeaturizer or BaseDescriptor) – Descriptor calculator.
estimators (BaseEstimator) –
Gaussian estimators follow the scikit-learn style. These estimators must provide a method named predict which accesses descriptors as input and returns (mean, std) in order. By default, BayesianRidge will be used.
targets (dictionary) – Upper and lower bounds for each property to calculate the Gaussian CDF probability

fit(smiles, y=None, *, X_scaler=None, y_scaler=None, **kwargs)[source]

Default - automatically remove NaN data rows

Parameters:

smiles (list[str]) – SMILES for training.
y (pandas.DataFrame) – Target properties for training.
X_scaler (Scaler (optional, not implement)) – Scaler for transform X.
y_scaler (Scaler (optional, not implement)) – Scaler for transform y.
kwargs (dict) – Parameters pass to BayesianRidge initialization.

log_likelihood(smis, *, log_0=-1000.0, **targets)[source]

Log likelihood

Parameters:

X (list[object]) – Input samples for likelihood calculation.
targets (tuple[float, float]) – Target area. Should be a tuple which have down and up boundary. e.g: target1=(10, 20) equal to target1 should in range [10, 20].

Returns:

log_likelihood – Estimated log-likelihood of each sample’s property values. Cannot be pd.Series!

Return type:

pd.Dataframe of float (col - properties, row - samples)

predict(smiles, **kwargs)[source]

remove_estimator(*properties)[source]

Remove estimators from estimator set.

Parameters:: properties (str) – The name of properties will be removed from estimator set.

update_targets(*, reset=False, **targets)[source]

Update/set the target area.

Parameters:

reset (bool) – If true, reset target area.
targets (tuple[float, float]) – Target area. Should be a tuple which have down and up boundary. e.g: target1=(10, 20) equal to target1 should in range [10, 20].

property timer

xenonpy.inverse.iqspr.iqspr module

class xenonpy.inverse.iqspr.iqspr.IQSPR(*, estimator, modifier, r_ESS=1)[source]

Bases: BaseSMC

SMC iqspr runner (assume data type of samples = list or np.array).

Parameters:

estimator (BaseLogLikelihood or BaseLogLikelihoodSet) – Log likelihood estimator for given input samples.
modifier (BaseProposal) – Modify given input samples to new ones.
r_ESS (float) – r_ESS*sample_size = Upper threshold of ESS (effective sample size) using in SMC resampling. Resample will happen only if calculated ESS is smaller or equal to the upper threshold. As 1 <= ESS <= sample_size, picking any r_ESS < 1/sample_size will lead to never resample; picking any r_ESS >= 1 will lead to always resample. Default is 1, i.e., resample at each step of SMC.

resample(sims, freq, size, p)[source]

Re-sample from given samples.

Parameters:

X (list[object]) – Input samples for likelihood calculation. Can be changed to accept other data types.
freq (list[int]) – Frequency of each input sample.
size (int) – Resample size.
p (numpy.ndarray[float]) – The probabilities associated with each entry in X. If not given the sample assumes a uniform distribution over all entries.

Returns:

re-sample – Re-sampling result.

Return type:

list of object

property estimator

property modifier

property timer

xenonpy.inverse.iqspr.iqspr4df module

class xenonpy.inverse.iqspr.iqspr4df.IQSPR4DF(*, estimator, modifier, r_ESS=1, sample_col=None)[source]

Bases: BaseSMC

SMC iqspr runner (assume data type of samples = pd.DataFrame).

Parameters:

estimator (BaseLogLikelihood or BaseLogLikelihoodSet) – Log likelihood estimator for given input samples.
modifier (BaseProposal) – Modify given input samples to new ones.
r_ESS (float) – r_ESS*sample_size = Upper threshold of ESS (effective sample size) using in SMC resampling. Resample will happen only if calculated ESS is smaller or equal to the upper threshold. As 1 <= ESS <= sample_size, picking any r_ESS < 1/sample_size will lead to never resample; picking any r_ESS >= 1 will lead to always resample. Default is 1, i.e., resample at each step of SMC.
sample_col (list or str) – Name(s) of columns that will be used to extract unique samples in the unique function. Default is None, which means all columns are used.

resample(sims, freq, size, p)[source]

Re-sample from given samples.

Parameters:

X (list[object]) – Input samples for likelihood calculation. Can be changed to accept other data types.
freq (list[int]) – Frequency of each input sample.
size (int) – Resample size.
p (numpy.ndarray[float]) – The probabilities associated with each entry in X. If not given the sample assumes a uniform distribution over all entries.

Returns:

re-sample – Re-sampling result.

Return type:

list of object

unique(x)[source]

Parameters:

X (pd.DataFrame) – Input samples.

Returns:

unique (pd.DataFrame) – The sorted unique samples.
unique_counts (np.ndarray of int) – The number of times each of the unique values comes up in the original array

property estimator

property modifier

property timer

xenonpy.inverse.iqspr.modifier module

exception xenonpy.inverse.iqspr.modifier.GetProbError(tmp_str, i_b, i_r)[source]: Bases: ProposalError

exception xenonpy.inverse.iqspr.modifier.MolConvertError(new_smi)[source]: Bases: ProposalError

exception xenonpy.inverse.iqspr.modifier.NGramTrainingError(error, smi)[source]: Bases: ProposalError

class xenonpy.inverse.iqspr.modifier.NGram(*, ngram_table=None, sample_order=(1, 10), del_range=(1, 10), min_len=1, max_len=1000, reorder_prob=0)[source]

Bases: BaseProposal

N-Garm

Parameters:

ngram_table (NGram table) – NGram table for modify SMILES.
sample_order (tuple[int, int] or int) – range of order of ngram table used during proposal, when given int, sample_order = (1, int)
del_range (tuple[int, int] or int) – range of random deletion of SMILES string during proposal, when given int, del_range = (1, int)
min_len (int) – minimum length of the extended SMILES, shall be smaller than the lower bound of the sample_order
max_len (int) – max length of the extended SMILES to be terminated from continuing modification
reorder_prob (float) – probability of the SMILES being reordered during proposal

classmethod add_char(ext_smi, next_char)[source]

classmethod del_char(ext_smi, n_char)[source]

classmethod esmi2smi(ext_smi)[source]

fit(smiles, *, train_order=(1, 10))[source]

Parameters:

smiles (list[str]) – SMILES for training.
train_order (tuple[int, int] or int) – range of order when train a NGram table, when given int, train_order = (1, int), and train_order[0] must be > 0

get_prob(tmp_str, iB, iR)[source]

merge_table(*ngram_tab, weight=1, overwrite=True)[source]

Merge with a given NGram table

Parameters:

ngram_tab (NGram) – the table(s) in the given NGram class variable(s) will be merged to the table in self
weight (int/float or list/tuple/np.array/pd.Series[int/float]) – a scalar/vector to scale the frequency in the given NGram table to be merged, must have the same length as ngram_tab
overwrite (boolean) – overwrite the original table (self) or not, do not recommend to be False (may have memory issue)

Returns:

tmp_n_gram – merged NGram tables

Return type:

NGram

modify(ext_smi)[source]

on_errors(error)[source]

Parameters:: error (ProposalError) – Error object.

proposal(smiles)[source]

Propose new SMILES based on the given SMILES. Make sure you always check the train_order against sample_order before using the proposal!

Parameters:: smiles (list of SMILES) – Given SMILES for modification.
Returns:: new_smiles – The proposed SMILES from the given SMILES.
Return type:: list of SMILES

remove_table(max_order=None)[source]

Remove estimators from estimator set.

Parameters:: max_order (int) – max order to be left in the table, the rest is removed.

classmethod reorder_esmi(ext_smi)[source]

sample_next_char(ext_smi)[source]

classmethod smi2esmi(smi)[source]

classmethod smi2list(smiles)[source]

split_table(cut_order)[source]

Split NGram table into two

Parameters:

cut_order (int) – split NGram table between cut_order and cut_order+1

Returns:

n_gram1 (NGram)
n_gram2 (NGram)

validator(ext_smi)[source]

property del_range

property max_len

property min_len

property ngram_table

property reorder_prob

property sample_order

property timer

Module contents

exception xenonpy.inverse.iqspr.GetProbError(tmp_str, i_b, i_r)[source]: Bases: ProposalError

exception xenonpy.inverse.iqspr.MolConvertError(new_smi)[source]: Bases: ProposalError

exception xenonpy.inverse.iqspr.NGramTrainingError(error, smi)[source]: Bases: ProposalError

class xenonpy.inverse.iqspr.GaussianLogLikelihood(descriptor, *, targets={}, **estimators)[source]

Bases: BaseLogLikelihood

Gaussian loglikelihood.

Parameters:

descriptor (BaseFeaturizer or BaseDescriptor) – Descriptor calculator.
estimators (BaseEstimator) –
Gaussian estimators follow the scikit-learn style. These estimators must provide a method named predict which accesses descriptors as input and returns (mean, std) in order. By default, BayesianRidge will be used.
targets (dictionary) – Upper and lower bounds for each property to calculate the Gaussian CDF probability

fit(smiles, y=None, *, X_scaler=None, y_scaler=None, **kwargs)[source]

Default - automatically remove NaN data rows

Parameters:

smiles (list[str]) – SMILES for training.
y (pandas.DataFrame) – Target properties for training.
X_scaler (Scaler (optional, not implement)) – Scaler for transform X.
y_scaler (Scaler (optional, not implement)) – Scaler for transform y.
kwargs (dict) – Parameters pass to BayesianRidge initialization.

log_likelihood(smis, *, log_0=-1000.0, **targets)[source]

Log likelihood

Parameters:

X (list[object]) – Input samples for likelihood calculation.
targets (tuple[float, float]) – Target area. Should be a tuple which have down and up boundary. e.g: target1=(10, 20) equal to target1 should in range [10, 20].

Returns:

log_likelihood – Estimated log-likelihood of each sample’s property values. Cannot be pd.Series!

Return type:

pd.Dataframe of float (col - properties, row - samples)

predict(smiles, **kwargs)[source]

remove_estimator(*properties)[source]

Remove estimators from estimator set.

Parameters:: properties (str) – The name of properties will be removed from estimator set.

update_targets(*, reset=False, **targets)[source]

Update/set the target area.

Parameters:

reset (bool) – If true, reset target area.
targets (tuple[float, float]) – Target area. Should be a tuple which have down and up boundary. e.g: target1=(10, 20) equal to target1 should in range [10, 20].

property timer

class xenonpy.inverse.iqspr.IQSPR(*, estimator, modifier, r_ESS=1)[source]

Bases: BaseSMC

SMC iqspr runner (assume data type of samples = list or np.array).

Parameters:

estimator (BaseLogLikelihood or BaseLogLikelihoodSet) – Log likelihood estimator for given input samples.
modifier (BaseProposal) – Modify given input samples to new ones.
r_ESS (float) – r_ESS*sample_size = Upper threshold of ESS (effective sample size) using in SMC resampling. Resample will happen only if calculated ESS is smaller or equal to the upper threshold. As 1 <= ESS <= sample_size, picking any r_ESS < 1/sample_size will lead to never resample; picking any r_ESS >= 1 will lead to always resample. Default is 1, i.e., resample at each step of SMC.

resample(sims, freq, size, p)[source]

Re-sample from given samples.

Parameters:

X (list[object]) – Input samples for likelihood calculation. Can be changed to accept other data types.
freq (list[int]) – Frequency of each input sample.
size (int) – Resample size.
p (numpy.ndarray[float]) – The probabilities associated with each entry in X. If not given the sample assumes a uniform distribution over all entries.

Returns:

re-sample – Re-sampling result.

Return type:

list of object

property estimator

property modifier

property timer

class xenonpy.inverse.iqspr.IQSPR4DF(*, estimator, modifier, r_ESS=1, sample_col=None)[source]

Bases: BaseSMC

SMC iqspr runner (assume data type of samples = pd.DataFrame).

Parameters:

estimator (BaseLogLikelihood or BaseLogLikelihoodSet) – Log likelihood estimator for given input samples.
modifier (BaseProposal) – Modify given input samples to new ones.
r_ESS (float) – r_ESS*sample_size = Upper threshold of ESS (effective sample size) using in SMC resampling. Resample will happen only if calculated ESS is smaller or equal to the upper threshold. As 1 <= ESS <= sample_size, picking any r_ESS < 1/sample_size will lead to never resample; picking any r_ESS >= 1 will lead to always resample. Default is 1, i.e., resample at each step of SMC.
sample_col (list or str) – Name(s) of columns that will be used to extract unique samples in the unique function. Default is None, which means all columns are used.

resample(sims, freq, size, p)[source]

Re-sample from given samples.

Parameters:

X (list[object]) – Input samples for likelihood calculation. Can be changed to accept other data types.
freq (list[int]) – Frequency of each input sample.
size (int) – Resample size.
p (numpy.ndarray[float]) – The probabilities associated with each entry in X. If not given the sample assumes a uniform distribution over all entries.

Returns:

re-sample – Re-sampling result.

Return type:

list of object

unique(x)[source]

Parameters:

X (pd.DataFrame) – Input samples.

Returns:

unique (pd.DataFrame) – The sorted unique samples.
unique_counts (np.ndarray of int) – The number of times each of the unique values comes up in the original array

property estimator

property modifier

property timer

class xenonpy.inverse.iqspr.NGram(*, ngram_table=None, sample_order=(1, 10), del_range=(1, 10), min_len=1, max_len=1000, reorder_prob=0)[source]

Bases: BaseProposal

N-Garm

Parameters:

ngram_table (NGram table) – NGram table for modify SMILES.
sample_order (tuple[int, int] or int) – range of order of ngram table used during proposal, when given int, sample_order = (1, int)
del_range (tuple[int, int] or int) – range of random deletion of SMILES string during proposal, when given int, del_range = (1, int)
min_len (int) – minimum length of the extended SMILES, shall be smaller than the lower bound of the sample_order
max_len (int) – max length of the extended SMILES to be terminated from continuing modification
reorder_prob (float) – probability of the SMILES being reordered during proposal

classmethod add_char(ext_smi, next_char)[source]

classmethod del_char(ext_smi, n_char)[source]

classmethod esmi2smi(ext_smi)[source]

fit(smiles, *, train_order=(1, 10))[source]

Parameters:

smiles (list[str]) – SMILES for training.
train_order (tuple[int, int] or int) – range of order when train a NGram table, when given int, train_order = (1, int), and train_order[0] must be > 0

get_prob(tmp_str, iB, iR)[source]

merge_table(*ngram_tab, weight=1, overwrite=True)[source]

Merge with a given NGram table

Parameters:

ngram_tab (NGram) – the table(s) in the given NGram class variable(s) will be merged to the table in self
weight (int/float or list/tuple/np.array/pd.Series[int/float]) – a scalar/vector to scale the frequency in the given NGram table to be merged, must have the same length as ngram_tab
overwrite (boolean) – overwrite the original table (self) or not, do not recommend to be False (may have memory issue)

Returns:

tmp_n_gram – merged NGram tables

Return type:

NGram

modify(ext_smi)[source]

on_errors(error)[source]

Parameters:: error (ProposalError) – Error object.

proposal(smiles)[source]

Propose new SMILES based on the given SMILES. Make sure you always check the train_order against sample_order before using the proposal!

Parameters:: smiles (list of SMILES) – Given SMILES for modification.
Returns:: new_smiles – The proposed SMILES from the given SMILES.
Return type:: list of SMILES

remove_table(max_order=None)[source]

Remove estimators from estimator set.

Parameters:: max_order (int) – max order to be left in the table, the rest is removed.

classmethod reorder_esmi(ext_smi)[source]

sample_next_char(ext_smi)[source]

classmethod smi2esmi(smi)[source]

classmethod smi2list(smiles)[source]

split_table(cut_order)[source]

Split NGram table into two

Parameters:

cut_order (int) – split NGram table between cut_order and cut_order+1

Returns:

n_gram1 (NGram)
n_gram2 (NGram)

validator(ext_smi)[source]

property del_range

property max_len

property min_len

property ngram_table

property reorder_prob

property sample_order

property timer