xenonpy.descriptor package
Submodules
xenonpy.descriptor.base module
- class xenonpy.descriptor.base.BaseCompositionFeaturizer(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseFeaturizer
Base class for composition feature.
- featurize(comp)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- class xenonpy.descriptor.base.BaseDescriptor(*, featurizers='all', on_errors='raise')[source]
Bases:
BaseEstimator
,TransformerMixin
Abstract class to organize featurizers. This class can take list-like[object] or pd.DataFrame as input for transformation or fitting. For pd.DataFrame, if any column name matches any group name, the matched group(s) will be calculated with corresponding column(s); otherwise, the pd.DataFrame will be passed on as is.
Examples
class MyDescriptor(BaseDescriptor): def __init__(self, n_jobs=-1): self.descriptor = SomeFeature1(n_jobs) self.descriptor = SomeFeature2(n_jobs) self.descriptor = SomeFeature3(n_jobs) self.descriptor = SomeFeature4(n_jobs)
- Parameters:
featurizers (
Union
[List
[str
],str
]) – Specify which Featurizer(s) will be used. Default is ‘all’.on_errors (
str
) – How to handle the exceptions in a feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column withnp.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.
- property all_featurizers
- property elapsed
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- property featurizers
- property on_errors
- property timer
- class xenonpy.descriptor.base.BaseFeaturizer(n_jobs=-1, *, on_errors='raise', return_type='any', target_col=None, parallel_verbose=0)[source]
Bases:
BaseEstimator
,TransformerMixin
Abstract class to calculate features from
pandas.Series
input data. Each entry can be any format such a compound formula or a pymatgen crystal structure dependent on the featurizer implementation.This class have similar structure with matminer BaseFeaturizer but follow more strict convention. That means you can embed this feature directly into matminer BaseFeaturizer class implement.:
class MatFeature(BaseFeaturizer): def featurize(self, *x): return <xenonpy_featurizer>.featurize(*x)
Using a BaseFeaturizer Class
BaseFeaturizer()
implementsklearn.base.BaseEstimator
andsklearn.base.TransformerMixin
that means you can use it in a scikit-learn way.:featurizer = SomeFeaturizer() features = featurizer.fit_transform(X)
You can also employ the featurizer as part of a ScikitLearn Pipeline object. You would then provide your input data as an array to the Pipeline, which would output the featurers as an
pandas.DataFrame
.BaseFeaturizer
also provide you to retrieving proper references for a featurizer. The__citations__
returns a list of papers that should be cited. The__authors__
returns a list of people who wrote the featurizer. Also can be accessed from propertycitations
andcitations
.Implementing a New BaseFeaturizer Class
These operations must be implemented for each new featurizer:
featurize
- Takes a single material as input, returns the features of that material.feature_labels
- Generates a human-meaningful name for each of the features. Implement this as property.
Also suggest to implement these two properties:
citations
- Returns a list of citations in BibTeX format.implementors
- Returns a list of people who contributed writing a paper.
All options of the featurizer must be set by the
__init__
function. All options must be listed as keyword arguments with default values, and the value must be saved as a class attribute with the same name or as a property (e.g., argument n should be stored in self.n). These requirements are necessary for compatibility with theget_params
andset_params
methods ofBaseEstimator
, which enable easy interoperability with scikit-learn.featurize()
must return a list of features innumpy.ndarray
.Note
None of these operations should change the state of the featurizer. I.e., running each method twice should no produce different results, no class attributes should be changed, running one operation should not affect the output of another.
- Parameters:
n_jobs (
int
) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). InputsX
will be split into some blocks then run on each cpu cores. When set to 0, input X will be treated as a block and pass toFeaturizer.featurize
directly. This default parallel implementation does not support pd.DataFrame input, so please make sure you set n_jobs=0 if the input will be pd.DataFrame.on_errors (
str
) – How to handle the exceptions in a feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column withnp.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.return_type (
str
) – Specify the return type. Can beany
,custom
,array
anddf
.array
anddf
force return type tonp.ndarray
andpd.DataFrame
respectively. Ifany
orcustom
, the return type depends on multiple factors (see transform function). Default isany
target_col (
Union
[List
[str
],str
,None
]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. IfNone
, all columns of the pd.DataFrame is used. Default is None.parallel_verbose (
int
) – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. Default0
.
- abstract featurize(*x, **kwargs)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- fit(X, y=None, **fit_kwargs)[source]
Update the parameters of this featurizer based on available data :param X - [list of tuples]: :param training data:
- Returns:
self
- fit_transform(X, y=None, **fit_params)[source]
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (numpy array of shape [n_samples, n_features]) – Training set.
y (numpy array of shape [n_samples]) – Target values.
- Returns:
X_new – Transformed array.
- Return type:
numpy array of shape [n_samples, n_features_new]
- transform(entries, *, return_type=None, target_col=None, **kwargs)[source]
Featurize a list of entries. If featurize takes multiple inputs, supply inputs as a list of tuples, or use pd.DataFrame with parameter
target_col
to specify the column name(s).- Parameters:
entries (list-like or pd.DataFrame) – A list of entries to be featurized or pd.DataFrame with one specified column. See detail of target_col if entries is pd.DataFrame. Also, make sure n_jobs=0 for pd.DataFrame.
return_type (str) – Specify the return type. Can be
any
,custom
,array
ordf
.array
ordf
forces return type tonp.ndarray
orpd.DataFrame
, respectively. Ifany
, the return type follow prefixed rules: (1) if input type is pd.Series or pd.DataFrame, returns pd.DataFrame; (2) else if input type is np.array, returns np.array; (3) else if other input type and n_jobs=0, follows the featurize function return; (4) otherwise, return a list of objects (output of featurize function). Ifcustom
, the return type depends on the featurize function if n_jobs=0, or the return type is a list of objects (output of featurize function) for other n_jobs values. This is a one-time change that only have effect in the current transformation. Default isNone
for using the setting at initialization step.target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. Default is
None
for using the setting at initialization step. (see __init__ for more information)
- Returns:
- DataFrame
features for each entry.
- property authors
List of implementors of the feature. :returns:
- (list) each element should either be a string with author name (e.g.,
“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).
- property citations
Citation(s) and reference(s) for this feature. :returns:
- (list) each element should be a string citation,
ideally in BibTeX format.
- abstract property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- property n_jobs
- property on_errors
- property parallel_verbose
- property return_type
xenonpy.descriptor.cgcnn module
- class xenonpy.descriptor.cgcnn.CrystalGraphFeaturizer(*, max_num_nbr=12, radius=8, atom_feature='origin', n_jobs=-1, on_errors='raise', return_type='any')[source]
Bases:
BaseFeaturizer
This featurizer is a port of the original paper [CGCNN].
- Parameters:
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs
X
will be split into some blocks then run on each cpu cores.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.return_type (str) – Specific the return type. Can be
any
,array
anddf
.array
anddf
force return type tonumpy.ndarray
andpandas.DataFrame
respectively. Ifany
, the return type dependent on the input type. Default isany
- featurize(structure)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
xenonpy.descriptor.compositions module
- class xenonpy.descriptor.compositions.Compositions(*, elemental_info=None, n_jobs=-1, featurizers='classic', on_errors='nan')[source]
Bases:
BaseDescriptor
Calculate elemental descriptors from compound’s composition.
- Parameters:
elemental_info (
Optional
[DataFrame
]) – Elemental level information for each element. For example, theatomic number
,atomic radius
, and etc. By default (None
), will use the XenonPy embedded information.n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs
X
will be split into some blocks then run on each cpu cores.featurizers (Union[str, List[str]]) – Name of featurizers that will be used. Set to classic to be compatible with the old version. This is equal to set
featurizers=['WeightedAverage', 'WeightedSum', 'WeightedVariance', 'MaxPooling', 'MinPooling']
. Default is ‘all’.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘nan’ which will raise up the exception.
- classic = ['WeightedAverage', 'WeightedSum', 'WeightedVariance', 'MaxPooling', 'MinPooling']
- property timer
- class xenonpy.descriptor.compositions.Counting(*, one_hot_vec=False, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseCompositionFeaturizer
- Parameters:
one_hot_vec (bool) – Set
true
to using one-hot-vector encoding.n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs
X
will be split into some blocks then run on each cpu cores.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.return_type (str) – Specific the return type. Can be
any
,array
anddf
.array
anddf
force return type tonp.ndarray
andpd.DataFrame
respectively. Ifany
, the return type dependent on the input type. Default isany
target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.compositions.GeometricMean(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseCompositionFeaturizer
- Parameters:
elemental_info (
Optional
[DataFrame
]) – Elemental level information for each element. For example, theatomic number
,atomic radius
, and etc. By default (None
), will use the XenonPy embedded information.n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs
X
will be split into some blocks then run on each cpu cores.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.return_type (str) – Specific the return type. Can be
any
,array
anddf
.array
anddf
force return type tonp.ndarray
andpd.DataFrame
respectively. Ifany
, the return type dependent on the input type. Default isany
target_col (
Union
[List
[str
],str
,None
]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. IfNone
, all columns of the pd.DataFrame is used. Default is None.
Base class for composition feature.
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.compositions.HarmonicMean(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseCompositionFeaturizer
- Parameters:
elemental_info (
Optional
[DataFrame
]) – Elemental level information for each element. For example, theatomic number
,atomic radius
, and etc. By default (None
), will use the XenonPy embedded information.n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs
X
will be split into some blocks then run on each cpu cores.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.return_type (str) – Specific the return type. Can be
any
,array
anddf
.array
anddf
force return type tonp.ndarray
andpd.DataFrame
respectively. Ifany
, the return type dependent on the input type. Default isany
target_col (
Union
[List
[str
],str
,None
]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. IfNone
, all columns of the pd.DataFrame is used. Default is None.
Base class for composition feature.
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.compositions.MaxPooling(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseCompositionFeaturizer
- Parameters:
elemental_info (
Optional
[DataFrame
]) – Elemental level information for each element. For example, theatomic number
,atomic radius
, and etc. By default (None
), will use the XenonPy embedded information.n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs
X
will be split into some blocks then run on each cpu cores.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.return_type (str) – Specific the return type. Can be
any
,array
anddf
.array
anddf
force return type tonp.ndarray
andpd.DataFrame
respectively. Ifany
, the return type dependent on the input type. Default isany
target_col (
Union
[List
[str
],str
,None
]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. IfNone
, all columns of the pd.DataFrame is used. Default is None.
Base class for composition feature.
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.compositions.MinPooling(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseCompositionFeaturizer
- Parameters:
elemental_info (
Optional
[DataFrame
]) – Elemental level information for each element. For example, theatomic number
,atomic radius
, and etc. By default (None
), will use the XenonPy embedded information.n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs
X
will be split into some blocks then run on each cpu cores.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.return_type (str) – Specific the return type. Can be
any
,array
anddf
.array
anddf
force return type tonp.ndarray
andpd.DataFrame
respectively. Ifany
, the return type dependent on the input type. Default isany
target_col (
Union
[List
[str
],str
,None
]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. IfNone
, all columns of the pd.DataFrame is used. Default is None.
Base class for composition feature.
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.compositions.WeightedAverage(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseCompositionFeaturizer
- Parameters:
elemental_info (
Optional
[DataFrame
]) – Elemental level information for each element. For example, theatomic number
,atomic radius
, and etc. By default (None
), will use the XenonPy embedded information.n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs
X
will be split into some blocks then run on each cpu cores.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.return_type (str) – Specific the return type. Can be
any
,array
anddf
.array
anddf
force return type tonp.ndarray
andpd.DataFrame
respectively. Ifany
, the return type dependent on the input type. Default isany
target_col (
Union
[List
[str
],str
,None
]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. IfNone
, all columns of the pd.DataFrame is used. Default is None.
Base class for composition feature.
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.compositions.WeightedSum(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseCompositionFeaturizer
- Parameters:
elemental_info (
Optional
[DataFrame
]) – Elemental level information for each element. For example, theatomic number
,atomic radius
, and etc. By default (None
), will use the XenonPy embedded information.n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs
X
will be split into some blocks then run on each cpu cores.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.return_type (str) – Specific the return type. Can be
any
,array
anddf
.array
anddf
force return type tonp.ndarray
andpd.DataFrame
respectively. Ifany
, the return type dependent on the input type. Default isany
target_col (
Union
[List
[str
],str
,None
]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. IfNone
, all columns of the pd.DataFrame is used. Default is None.
Base class for composition feature.
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.compositions.WeightedVariance(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseCompositionFeaturizer
- Parameters:
elemental_info (
Optional
[DataFrame
]) – Elemental level information for each element. For example, theatomic number
,atomic radius
, and etc. By default (None
), will use the XenonPy embedded information.n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs
X
will be split into some blocks then run on each cpu cores.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.return_type (str) – Specific the return type. Can be
any
,array
anddf
.array
anddf
force return type tonp.ndarray
andpd.DataFrame
respectively. Ifany
, the return type dependent on the input type. Default isany
target_col (
Union
[List
[str
],str
,None
]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. IfNone
, all columns of the pd.DataFrame is used. Default is None.
Base class for composition feature.
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
xenonpy.descriptor.fingerprint module
- class xenonpy.descriptor.fingerprint.AtomPairFP(n_jobs=-1, *, n_bits=2048, bit_per_entry=None, counting=False, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseFeaturizer
Atom Pair fingerprints. Returns the atom-pair fingerprint for a molecule.The algorithm used is described here: R.E. Carhart, D.H. Smith, R. Venkataraghavan; “Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications” JCICS 25, 64-73 (1985). This is currently just in binary bits with fixed length after folding.
- Parameters:
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).
n_bits (int) – Fixed bit length based on folding.
bit_per_entry (int) – Number of bits used to represent a single entry (only for non-counting case). Default value follows rdkit default.
counting (boolean) – Record counts of the entries instead of bits only.
input_type (string) – Set the specific type of transform input. Set to
mol
(default) tordkit.Chem.rdchem.Mol
objects as input. When set tosmlies
,transform
method can use a SMILES list as input. Set toany
to use both. If input is SMILES,Chem.MolFromSmiles
function will be used inside. forNone
returns, aValueError
exception will be raised.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- featurize(x)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.fingerprint.DescriptorFeature(n_jobs=-1, *, input_type='mol', on_errors='raise', return_type='any', target_col=None, desc_list='all', add_Hs=False)[source]
Bases:
BaseFeaturizer
- All descriptors in RDKit (length = 200) [may include NaN]
see https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors for the full list
- Parameters:
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).
input_type (string) – Set the specific type of transform input. Set to
mol
(default) tordkit.Chem.rdchem.Mol
objects as input. When set tosmlies
,transform
method can use a SMILES list as input. Set toany
to use both. If input is SMILES,Chem.MolFromSmiles
function will be used inside. forNone
returns, aValueError
exception will be raised.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.desc_list (string or list) – List of descriptor names to be called in rdkit to calculate molecule descriptors. If
classic
, the full list of rdkit v.2020.03.xx is used. (length = 200) Default is to use the latest list available in the rdkit. (length = 208 in rdkit v.2020.09.xx)add_Hs (boolean) – Add hydrogen atoms to the mol format in RDKit or not. This may affect a few physical descriptors (e.g., charge related ones).
- featurize(x)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- classic = ['MaxEStateIndex', 'MinEStateIndex', 'MaxAbsEStateIndex', 'MinAbsEStateIndex', 'qed', 'MolWt', 'HeavyAtomMolWt', 'ExactMolWt', 'NumValenceElectrons', 'NumRadicalElectrons', 'MaxPartialCharge', 'MinPartialCharge', 'MaxAbsPartialCharge', 'MinAbsPartialCharge', 'FpDensityMorgan1', 'FpDensityMorgan2', 'FpDensityMorgan3', 'BalabanJ', 'BertzCT', 'Chi0', 'Chi0n', 'Chi0v', 'Chi1', 'Chi1n', 'Chi1v', 'Chi2n', 'Chi2v', 'Chi3n', 'Chi3v', 'Chi4n', 'Chi4v', 'HallKierAlpha', 'Ipc', 'Kappa1', 'Kappa2', 'Kappa3', 'LabuteASA', 'PEOE_VSA1', 'PEOE_VSA10', 'PEOE_VSA11', 'PEOE_VSA12', 'PEOE_VSA13', 'PEOE_VSA14', 'PEOE_VSA2', 'PEOE_VSA3', 'PEOE_VSA4', 'PEOE_VSA5', 'PEOE_VSA6', 'PEOE_VSA7', 'PEOE_VSA8', 'PEOE_VSA9', 'SMR_VSA1', 'SMR_VSA10', 'SMR_VSA2', 'SMR_VSA3', 'SMR_VSA4', 'SMR_VSA5', 'SMR_VSA6', 'SMR_VSA7', 'SMR_VSA8', 'SMR_VSA9', 'SlogP_VSA1', 'SlogP_VSA10', 'SlogP_VSA11', 'SlogP_VSA12', 'SlogP_VSA2', 'SlogP_VSA3', 'SlogP_VSA4', 'SlogP_VSA5', 'SlogP_VSA6', 'SlogP_VSA7', 'SlogP_VSA8', 'SlogP_VSA9', 'TPSA', 'EState_VSA1', 'EState_VSA10', 'EState_VSA11', 'EState_VSA2', 'EState_VSA3', 'EState_VSA4', 'EState_VSA5', 'EState_VSA6', 'EState_VSA7', 'EState_VSA8', 'EState_VSA9', 'VSA_EState1', 'VSA_EState10', 'VSA_EState2', 'VSA_EState3', 'VSA_EState4', 'VSA_EState5', 'VSA_EState6', 'VSA_EState7', 'VSA_EState8', 'VSA_EState9', 'FractionCSP3', 'HeavyAtomCount', 'NHOHCount', 'NOCount', 'NumAliphaticCarbocycles', 'NumAliphaticHeterocycles', 'NumAliphaticRings', 'NumAromaticCarbocycles', 'NumAromaticHeterocycles', 'NumAromaticRings', 'NumHAcceptors', 'NumHDonors', 'NumHeteroatoms', 'NumRotatableBonds', 'NumSaturatedCarbocycles', 'NumSaturatedHeterocycles', 'NumSaturatedRings', 'RingCount', 'MolLogP', 'MolMR', 'fr_Al_COO', 'fr_Al_OH', 'fr_Al_OH_noTert', 'fr_ArN', 'fr_Ar_COO', 'fr_Ar_N', 'fr_Ar_NH', 'fr_Ar_OH', 'fr_COO', 'fr_COO2', 'fr_C_O', 'fr_C_O_noCOO', 'fr_C_S', 'fr_HOCCN', 'fr_Imine', 'fr_NH0', 'fr_NH1', 'fr_NH2', 'fr_N_O', 'fr_Ndealkylation1', 'fr_Ndealkylation2', 'fr_Nhpyrrole', 'fr_SH', 'fr_aldehyde', 'fr_alkyl_carbamate', 'fr_alkyl_halide', 'fr_allylic_oxid', 'fr_amide', 'fr_amidine', 'fr_aniline', 'fr_aryl_methyl', 'fr_azide', 'fr_azo', 'fr_barbitur', 'fr_benzene', 'fr_benzodiazepine', 'fr_bicyclic', 'fr_diazo', 'fr_dihydropyridine', 'fr_epoxide', 'fr_ester', 'fr_ether', 'fr_furan', 'fr_guanido', 'fr_halogen', 'fr_hdrzine', 'fr_hdrzone', 'fr_imidazole', 'fr_imide', 'fr_isocyan', 'fr_isothiocyan', 'fr_ketone', 'fr_ketone_Topliss', 'fr_lactam', 'fr_lactone', 'fr_methoxy', 'fr_morpholine', 'fr_nitrile', 'fr_nitro', 'fr_nitro_arom', 'fr_nitro_arom_nonortho', 'fr_nitroso', 'fr_oxazole', 'fr_oxime', 'fr_para_hydroxylation', 'fr_phenol', 'fr_phenol_noOrthoHbond', 'fr_phos_acid', 'fr_phos_ester', 'fr_piperdine', 'fr_piperzine', 'fr_priamide', 'fr_prisulfonamd', 'fr_pyridine', 'fr_quatN', 'fr_sulfide', 'fr_sulfonamd', 'fr_sulfone', 'fr_term_acetylene', 'fr_tetrazole', 'fr_thiazole', 'fr_thiocyan', 'fr_thiophene', 'fr_unbrch_alkane', 'fr_urea']
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.fingerprint.ECFP(n_jobs=-1, *, radius=3, n_bits=2048, counting=False, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseFeaturizer
Morgan (Circular) fingerprints (ECFP) The algorithm used is described in the paper Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. JCIM 50:742-54 (2010)
- Parameters:
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).
radius (int) – The radius parameter in the Morgan fingerprints, which is roughly half of the diameter parameter in ECFP, i.e., radius=2 is roughly equivalent to ECFP4.
n_bits (int) – Fixed bit length based on folding.
counting (boolean) – Record counts of the entries instead of bits only.
input_type (string) – Set the specific type of transform input. Set to
mol
(default) tordkit.Chem.rdchem.Mol
objects as input. When set tosmlies
,transform
method can use a SMILES list as input. Set toany
to use both. If input is SMILES,Chem.MolFromSmiles
function will be used inside. forNone
returns, aValueError
exception will be raised.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- featurize(x)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.fingerprint.FCFP(n_jobs=-1, *, radius=3, n_bits=2048, counting=False, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseFeaturizer
Morgan (Circular) fingerprints + feature-based (FCFP) The algorithm used is described in the paper Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. JCIM 50:742-54 (2010)
- Parameters:
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).
radius (int) – The radius parameter in the Morgan fingerprints, which is roughly half of the diameter parameter in FCFP, i.e., radius=2 is roughly equivalent to FCFP4.
n_bits (int) – Fixed bit length based on folding.
counting (boolean) – Record counts of the entries instead of bits only.
input_type (string) – Set the specific type of transform input. Set to
mol
(default) tordkit.Chem.rdchem.Mol
objects as input. When set tosmlies
,transform
method can use a SMILES list as input. Set toany
to use both. If input is SMILES,Chem.MolFromSmiles
function will be used inside. forNone
returns, aValueError
exception will be raised.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- featurize(x)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.fingerprint.Fingerprints(n_jobs=-1, *, radius=3, n_bits=2048, bit_per_entry=None, counting=False, input_type='mol', featurizers='all', on_errors='raise', target_col=None, desc_list='all', add_Hs=False)[source]
Bases:
BaseDescriptor
Calculate fingerprints or descriptors of organic molecules. Note that MHFP currently does not support parallel computing, so n_jobs is fixed to 1.
- Parameters:
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cpus. Set -1 to use all cpu cores (default).
radius (int) – The radius parameter in the Morgan fingerprints, which is roughly half of the diameter parameter in ECFP/FCFP, i.e., radius=2 is roughly equivalent to ECFP4/FCFP4.
n_bits (int) – Fixed bit length based on folding.
bit_per_entry (int) – Number of bits used to represent a single entry (only for non-counting case) in RDKitFP, AtomPairFP, and TopologicalTorsionFP. Default value follows rdkit default.
counting (boolean) – Record counts of the entries instead of bits only.
featurizers (list[str] or str or 'all') – Featurizer(s) that will be used. Default is ‘all’.
input_type (string) – Set the specific type of transform input. Set to
mol
(default) tordkit.Chem.rdchem.Mol
objects as input. When set tosmlies
,transform
method can use a SMILES list as input. Set toany
to use both. If input is SMILES,Chem.MolFromSmiles
function will be used inside. forNone
returns, aValueError
exception will be raised.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.desc_list (string or list) – List of descriptor names to be called in rdkit to calculate molecule descriptors. If
classic
, the full list of rdkit v.2020.03.xx is used. (length = 200) Default is to use the latest list available in the rdkit. (length = 208 in rdkit v.2020.09.xx)add_Hs (boolean) – Add hydrogen atoms to the mol format in RDKit or not. This may affect a few physical descriptors (e.g., charge related ones) and currently no effect to fingerprints.
- property timer
- class xenonpy.descriptor.fingerprint.LayeredFP(n_jobs=-1, *, n_bits=2048, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseFeaturizer
A substructure fingerprint that is more complex than PatternFP (unique in RDKit).
- Parameters:
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).
n_bits (int) – Fixed bit length based on folding.
input_type (string) – Set the specific type of transform input. Set to
mol
(default) tordkit.Chem.rdchem.Mol
objects as input. When set tosmlies
,transform
method can use a SMILES list as input. Set toany
to use both. If input is SMILES,Chem.MolFromSmiles
function will be used inside. forNone
returns, aValueError
exception will be raised.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- featurize(x)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.fingerprint.MACCS(n_jobs=-1, *, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseFeaturizer
The MACCS keys for a molecule. The result is a 167-bit vector. There are 166 public keys, but to maintain consistency with other software packages they are numbered from 1.
- Parameters:
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).
input_type (string) – Set the specific type of transform input. Set to
mol
(default) tordkit.Chem.rdchem.Mol
objects as input. When set tosmlies
,transform
method can use a SMILES list as input. Set toany
to use both. If input is SMILES,Chem.MolFromSmiles
function will be used inside. forNone
returns, aValueError
exception will be raised.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- featurize(x)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.fingerprint.MHFP(n_jobs=1, *, radius=3, n_bits=2048, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseFeaturizer
Variation from the MinHash fingerprint, which is based on ECFP with locality sensitive hashing to increase compactness of information during hashing. The algorithm used is described in the paper Probst, D. & Reymond, J.-L., A probabilistic molecular fingerprint for big data settings. Journal of Cheminformatics, 10:66 (2018)
Note that MHFP currently does not support parallel computing, so please fix n_jobs to 1.
- Parameters:
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).
radius (int) – The radius parameter in the SECFP(RDKit version) fingerprints, which is roughly half of the diameter parameter in ECFP, i.e., radius=2 is roughly equivalent to ECFP4.
n_bits (int) – Fixed bit length based on folding.
input_type (string) – Set the specific type of transform input. Set to
mol
(default) tordkit.Chem.rdchem.Mol
objects as input. When set tosmlies
,transform
method can use a SMILES list as input. Set toany
to use both. If input is SMILES,Chem.MolFromSmiles
function will be used inside. forNone
returns, aValueError
exception will be raised.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- featurize(x)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.fingerprint.PatternFP(n_jobs=-1, *, n_bits=2048, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseFeaturizer
A fingerprint designed to be used in substructure screening using SMARTS patterns (unique in RDKit).
- Parameters:
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).
n_bits (int) – Fixed bit length based on folding.
input_type (string) – Set the specific type of transform input. Set to
mol
(default) tordkit.Chem.rdchem.Mol
objects as input. When set tosmlies
,transform
method can use a SMILES list as input. Set toany
to use both. If input is SMILES,Chem.MolFromSmiles
function will be used inside. forNone
returns, aValueError
exception will be raised.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- featurize(x)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.fingerprint.RDKitFP(n_jobs=-1, *, n_bits=2048, bit_per_entry=None, counting=False, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseFeaturizer
RDKit fingerprint.
- Parameters:
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).
n_bits (int) – Fingerprint size.
bit_per_entry (int) – Number of bits used to represent a single entry (only for non-counting case). Default value follows rdkit default.
counting (boolean) – Record counts of the entries instead of bits only.
input_type (string) – Set the specific type of transform input. Set to
mol
(default) tordkit.Chem.rdchem.Mol
objects as input. When set tosmlies
,transform
method can use a SMILES list as input. Set toany
to use both. If input is SMILES,Chem.MolFromSmiles
function will be used inside. forNone
returns, aValueError
exception will be raised.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- featurize(x)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.fingerprint.TopologicalTorsionFP(n_jobs=-1, *, n_bits=2048, bit_per_entry=None, counting=False, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseFeaturizer
Topological Torsion fingerprints. Returns the topological-torsion fingerprint for a molecule. This is currently just in binary bits with fixed length after folding.
- Parameters:
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).
n_bits (int) – Fixed bit length based on folding.
bit_per_entry (int) – Number of bits used to represent a single entry (only for non-counting case). Default value follows rdkit default.
counting (boolean) – Record counts of the entries instead of bits only.
input_type (string) – Set the specific type of transform input. Set to
mol
(default) tordkit.Chem.rdchem.Mol
objects as input. When set tosmlies
,transform
method can use a SMILES list as input. Set toany
to use both. If input is SMILES,Chem.MolFromSmiles
function will be used inside. forNone
returns, aValueError
exception will be raised.on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- featurize(x)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
xenonpy.descriptor.frozen_featurizer module
- class xenonpy.descriptor.frozen_featurizer.FrozenFeaturizer(model=None, *, cuda=False, depth=None, n_layer=None, on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseFeaturizer
A Featurizer to extract hidden layers a from NN model.
- Parameters:
model (torch.nn.Module) – Source model.
cuda (bool) – If
true
, run on GPU.depth (int) – The depth will be retrieved from NN model.
n_layer (int) – Number of layer to be retrieved starting from the given depth.
on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.return_type (str) – Specific the return type. Can be
any
,array
anddf
.array
anddf
force return type tonp.ndarray
andpd.DataFrame
respectively. Ifany
, the return type dependent on the input type. Default isany
target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- featurize(descriptor, *, depth=None, n_layer=None)[source]
Main featurizer function, which has to be implemented in any derived featurizer subclass.
- Parameters:
x (depends on featurizer) – input data to featurize.
- Returns:
any – one or more features.
- Return type:
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
xenonpy.descriptor.structure module
- class xenonpy.descriptor.structure.OrbitalFieldMatrix(including_d=True, *, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseFeaturizer
Representation based on the valence shell electrons of neighboring atoms.
Each atom is described by a 32-element vector uniquely representing the valence subshell. A 32x32 (39x39) matrix is formed by multiplying two atomic vectors. An OFM for an atomic environment is the sum of these matrices for each atom the center atom coordinates with multiplied by a distance function (In this case, 1/r times the weight of the coordinating atom in the Voronoi.
- Parameters:
including_d (bool) – If true, add distance information.
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default).
on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.return_type (str) – Specific the return type. Can be
any
,array
anddf
.array
anddf
force return type tonp.ndarray
andpd.DataFrame
respectively. Ifany
, the return type dependent on the input type. Default isany
target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- featurize(structure, is_including_d=True)[source]
Generate OFM descriptor
- Parameters:
structure (pymatgen.Structure) – The input structure for OFM calculation.
- static get_element_representation(name)[source]
generate one-hot representation for a element, e.g, si = [0.0, 1.0, 0.0, 0.0, …]
- Parameters:
name (string) – element symbol
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.structure.RadialDistributionFunction(n_bins=201, r_max=20.0, *, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]
Bases:
BaseFeaturizer
Calculate pair distribution descriptor for machine learning.
- Parameters:
n_bins (int) – Number of radial grid points.
r_max (float) – Maximum of radial grid (the minimum is always set zero).
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default).
on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.return_type (str) – Specific the return type. Can be
any
,array
anddf
.array
anddf
force return type tonp.ndarray
andpd.DataFrame
respectively. Ifany
, the return type dependent on the input type. Default isany
target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- featurize(structure)[source]
Get RDF of the input structure. :type structure: :param structure: Pymatgen Structure object.
- Returns:
- (tuple of arrays) the first element is the
normalized RDF, whereas the second element is the inner radius of the RDF bin.
- Return type:
rdf, dist
- property feature_labels
Generate attribute names. :returns: ([str]) attribute labels.
- class xenonpy.descriptor.structure.Structures(n_bins=201, r_max=20.0, including_d=True, *, n_jobs=-1, featurizers='all', on_errors='raise', target_col=None)[source]
Bases:
BaseDescriptor
Calculate structure descriptors from compound’s structure.
- Parameters:
n_bins (int) – Number of radial grid points.
r_max (float) – Maximum of radial grid (the minimum is always set zero).
including_d (bool) – If true, add distance information.
n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default).
featurizers (list[str] or 'all') – Featurizers that will be used. Default is ‘all’.
on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with
np.nan
. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If
None
, all columns of the pd.DataFrame is used. Default is None.
- property timer