xenonpy.descriptor package

Submodules

xenonpy.descriptor.base module

class xenonpy.descriptor.base.BaseCompositionFeaturizer(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseFeaturizer

Base class for composition feature.

featurize(comp)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

abstract mix_function(elems, nums)[source]
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

class xenonpy.descriptor.base.BaseDescriptor(*, featurizers='all', on_errors='raise')[source]

Bases: BaseEstimator, TransformerMixin

Abstract class to organize featurizers. This class can take list-like[object] or pd.DataFrame as input for transformation or fitting. For pd.DataFrame, if any column name matches any group name, the matched group(s) will be calculated with corresponding column(s); otherwise, the pd.DataFrame will be passed on as is.

Examples

class MyDescriptor(BaseDescriptor):

    def __init__(self, n_jobs=-1):
        self.descriptor = SomeFeature1(n_jobs)
        self.descriptor = SomeFeature2(n_jobs)
        self.descriptor = SomeFeature3(n_jobs)
        self.descriptor = SomeFeature4(n_jobs)
Parameters:
  • featurizers (Union[List[str], str]) – Specify which Featurizer(s) will be used. Default is ‘all’.

  • on_errors (str) – How to handle the exceptions in a feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

fit(X, y=None, **kwargs)[source]
transform(X, **kwargs)[source]
property all_featurizers
property elapsed
property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

property featurizers
property on_errors
property timer
class xenonpy.descriptor.base.BaseFeaturizer(n_jobs=-1, *, on_errors='raise', return_type='any', target_col=None, parallel_verbose=0)[source]

Bases: BaseEstimator, TransformerMixin

Abstract class to calculate features from pandas.Series input data. Each entry can be any format such a compound formula or a pymatgen crystal structure dependent on the featurizer implementation.

This class have similar structure with matminer BaseFeaturizer but follow more strict convention. That means you can embed this feature directly into matminer BaseFeaturizer class implement.:

class MatFeature(BaseFeaturizer):
    def featurize(self, *x):
        return <xenonpy_featurizer>.featurize(*x)

Using a BaseFeaturizer Class

BaseFeaturizer() implement sklearn.base.BaseEstimator and sklearn.base.TransformerMixin that means you can use it in a scikit-learn way.:

featurizer = SomeFeaturizer()
features = featurizer.fit_transform(X)

You can also employ the featurizer as part of a ScikitLearn Pipeline object. You would then provide your input data as an array to the Pipeline, which would output the featurers as an pandas.DataFrame.

BaseFeaturizer also provide you to retrieving proper references for a featurizer. The __citations__ returns a list of papers that should be cited. The __authors__ returns a list of people who wrote the featurizer. Also can be accessed from property citations and citations.

Implementing a New BaseFeaturizer Class

These operations must be implemented for each new featurizer:

  • featurize - Takes a single material as input, returns the features of that material.

  • feature_labels - Generates a human-meaningful name for each of the features. Implement this as property.

Also suggest to implement these two properties:

  • citations - Returns a list of citations in BibTeX format.

  • implementors - Returns a list of people who contributed writing a paper.

All options of the featurizer must be set by the __init__ function. All options must be listed as keyword arguments with default values, and the value must be saved as a class attribute with the same name or as a property (e.g., argument n should be stored in self.n). These requirements are necessary for compatibility with the get_params and set_params methods of BaseEstimator, which enable easy interoperability with scikit-learn. featurize() must return a list of features in numpy.ndarray.

Note

None of these operations should change the state of the featurizer. I.e., running each method twice should no produce different results, no class attributes should be changed, running one operation should not affect the output of another.

Parameters:
  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs X will be split into some blocks then run on each cpu cores. When set to 0, input X will be treated as a block and pass to Featurizer.featurize directly. This default parallel implementation does not support pd.DataFrame input, so please make sure you set n_jobs=0 if the input will be pd.DataFrame.

  • on_errors (str) – How to handle the exceptions in a feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specify the return type. Can be any, custom, array and df. array and df force return type to np.ndarray and pd.DataFrame respectively. If any or custom, the return type depends on multiple factors (see transform function). Default is any

  • target_col (Union[List[str], str, None]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

  • parallel_verbose (int) – The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. Default 0.

abstract featurize(*x, **kwargs)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

fit(X, y=None, **fit_kwargs)[source]

Update the parameters of this featurizer based on available data :param X - [list of tuples]: :param training data:

Returns:

self

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (numpy array of shape [n_samples, n_features]) – Training set.

  • y (numpy array of shape [n_samples]) – Target values.

Returns:

X_new – Transformed array.

Return type:

numpy array of shape [n_samples, n_features_new]

transform(entries, *, return_type=None, target_col=None, **kwargs)[source]

Featurize a list of entries. If featurize takes multiple inputs, supply inputs as a list of tuples, or use pd.DataFrame with parameter target_col to specify the column name(s).

Parameters:
  • entries (list-like or pd.DataFrame) – A list of entries to be featurized or pd.DataFrame with one specified column. See detail of target_col if entries is pd.DataFrame. Also, make sure n_jobs=0 for pd.DataFrame.

  • return_type (str) – Specify the return type. Can be any, custom, array or df. array or df forces return type to np.ndarray or pd.DataFrame, respectively. If any, the return type follow prefixed rules: (1) if input type is pd.Series or pd.DataFrame, returns pd.DataFrame; (2) else if input type is np.array, returns np.array; (3) else if other input type and n_jobs=0, follows the featurize function return; (4) otherwise, return a list of objects (output of featurize function). If custom, the return type depends on the featurize function if n_jobs=0, or the return type is a list of objects (output of featurize function) for other n_jobs values. This is a one-time change that only have effect in the current transformation. Default is None for using the setting at initialization step.

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. Default is None for using the setting at initialization step. (see __init__ for more information)

Returns:

DataFrame

features for each entry.

property authors

List of implementors of the feature. :returns:

(list) each element should either be a string with author name (e.g.,

“Anubhav Jain”) or a dictionary with required key “name” and other keys like “email” or “institution” (e.g., {“name”: “Anubhav Jain”, “email”: “ajain@lbl.gov”, “institution”: “LBNL”}).

property citations

Citation(s) and reference(s) for this feature. :returns:

(list) each element should be a string citation,

ideally in BibTeX format.

abstract property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

property n_jobs
property on_errors
property parallel_verbose
property return_type

xenonpy.descriptor.cgcnn module

class xenonpy.descriptor.cgcnn.CrystalGraphFeaturizer(*, max_num_nbr=12, radius=8, atom_feature='origin', n_jobs=-1, on_errors='raise', return_type='any')[source]

Bases: BaseFeaturizer

This featurizer is a port of the original paper [CGCNN].

Parameters:
  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs X will be split into some blocks then run on each cpu cores.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, array and df. array and df force return type to numpy.ndarray and pandas.DataFrame respectively. If any, the return type dependent on the input type. Default is any

edge_features(structure, **kwargs)[source]
featurize(structure)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

node_features(structure)[source]
property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

xenonpy.descriptor.compositions module

class xenonpy.descriptor.compositions.Compositions(*, elemental_info=None, n_jobs=-1, featurizers='classic', on_errors='nan')[source]

Bases: BaseDescriptor

Calculate elemental descriptors from compound’s composition.

Parameters:
  • elemental_info (Optional[DataFrame]) – Elemental level information for each element. For example, the atomic number, atomic radius, and etc. By default (None), will use the XenonPy embedded information.

  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs X will be split into some blocks then run on each cpu cores.

  • featurizers (Union[str, List[str]]) – Name of featurizers that will be used. Set to classic to be compatible with the old version. This is equal to set featurizers=['WeightedAverage', 'WeightedSum', 'WeightedVariance', 'MaxPooling', 'MinPooling']. Default is ‘all’.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘nan’ which will raise up the exception.

classic = ['WeightedAverage', 'WeightedSum', 'WeightedVariance', 'MaxPooling', 'MinPooling']
property timer
class xenonpy.descriptor.compositions.Counting(*, one_hot_vec=False, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseCompositionFeaturizer

Parameters:
  • one_hot_vec (bool) – Set true to using one-hot-vector encoding.

  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs X will be split into some blocks then run on each cpu cores.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, array and df. array and df force return type to np.ndarray and pd.DataFrame respectively. If any, the return type dependent on the input type. Default is any

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

mix_function(elems, nums)[source]
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.compositions.GeometricMean(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseCompositionFeaturizer

Parameters:
  • elemental_info (Optional[DataFrame]) – Elemental level information for each element. For example, the atomic number, atomic radius, and etc. By default (None), will use the XenonPy embedded information.

  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs X will be split into some blocks then run on each cpu cores.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, array and df. array and df force return type to np.ndarray and pd.DataFrame respectively. If any, the return type dependent on the input type. Default is any

  • target_col (Union[List[str], str, None]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

Base class for composition feature.

mix_function(elems, nums)[source]
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.compositions.HarmonicMean(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseCompositionFeaturizer

Parameters:
  • elemental_info (Optional[DataFrame]) – Elemental level information for each element. For example, the atomic number, atomic radius, and etc. By default (None), will use the XenonPy embedded information.

  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs X will be split into some blocks then run on each cpu cores.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, array and df. array and df force return type to np.ndarray and pd.DataFrame respectively. If any, the return type dependent on the input type. Default is any

  • target_col (Union[List[str], str, None]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

Base class for composition feature.

mix_function(elems, nums)[source]
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.compositions.MaxPooling(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseCompositionFeaturizer

Parameters:
  • elemental_info (Optional[DataFrame]) – Elemental level information for each element. For example, the atomic number, atomic radius, and etc. By default (None), will use the XenonPy embedded information.

  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs X will be split into some blocks then run on each cpu cores.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, array and df. array and df force return type to np.ndarray and pd.DataFrame respectively. If any, the return type dependent on the input type. Default is any

  • target_col (Union[List[str], str, None]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

Base class for composition feature.

mix_function(elems, _)[source]
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.compositions.MinPooling(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseCompositionFeaturizer

Parameters:
  • elemental_info (Optional[DataFrame]) – Elemental level information for each element. For example, the atomic number, atomic radius, and etc. By default (None), will use the XenonPy embedded information.

  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs X will be split into some blocks then run on each cpu cores.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, array and df. array and df force return type to np.ndarray and pd.DataFrame respectively. If any, the return type dependent on the input type. Default is any

  • target_col (Union[List[str], str, None]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

Base class for composition feature.

mix_function(elems, _)[source]
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.compositions.WeightedAverage(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseCompositionFeaturizer

Parameters:
  • elemental_info (Optional[DataFrame]) – Elemental level information for each element. For example, the atomic number, atomic radius, and etc. By default (None), will use the XenonPy embedded information.

  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs X will be split into some blocks then run on each cpu cores.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, array and df. array and df force return type to np.ndarray and pd.DataFrame respectively. If any, the return type dependent on the input type. Default is any

  • target_col (Union[List[str], str, None]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

Base class for composition feature.

mix_function(elems, nums)[source]
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.compositions.WeightedSum(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseCompositionFeaturizer

Parameters:
  • elemental_info (Optional[DataFrame]) – Elemental level information for each element. For example, the atomic number, atomic radius, and etc. By default (None), will use the XenonPy embedded information.

  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs X will be split into some blocks then run on each cpu cores.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, array and df. array and df force return type to np.ndarray and pd.DataFrame respectively. If any, the return type dependent on the input type. Default is any

  • target_col (Union[List[str], str, None]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

Base class for composition feature.

mix_function(elems, nums)[source]
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.compositions.WeightedVariance(*, elemental_info=None, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseCompositionFeaturizer

Parameters:
  • elemental_info (Optional[DataFrame]) – Elemental level information for each element. For example, the atomic number, atomic radius, and etc. By default (None), will use the XenonPy embedded information.

  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default). Inputs X will be split into some blocks then run on each cpu cores.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, array and df. array and df force return type to np.ndarray and pd.DataFrame respectively. If any, the return type dependent on the input type. Default is any

  • target_col (Union[List[str], str, None]) – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

Base class for composition feature.

mix_function(elems, nums)[source]
Parameters:
  • elems (list) – Elements in compound.

  • nums (list) – Number of each element.

Returns:

descriptor

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

xenonpy.descriptor.fingerprint module

class xenonpy.descriptor.fingerprint.AtomPairFP(n_jobs=-1, *, n_bits=2048, bit_per_entry=None, counting=False, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseFeaturizer

Atom Pair fingerprints. Returns the atom-pair fingerprint for a molecule.The algorithm used is described here: R.E. Carhart, D.H. Smith, R. Venkataraghavan; “Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications” JCICS 25, 64-73 (1985). This is currently just in binary bits with fixed length after folding.

Parameters:
  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).

  • n_bits (int) – Fixed bit length based on folding.

  • bit_per_entry (int) – Number of bits used to represent a single entry (only for non-counting case). Default value follows rdkit default.

  • counting (boolean) – Record counts of the entries instead of bits only.

  • input_type (string) – Set the specific type of transform input. Set to mol (default) to rdkit.Chem.rdchem.Mol objects as input. When set to smlies, transform method can use a SMILES list as input. Set to any to use both. If input is SMILES, Chem.MolFromSmiles function will be used inside. for None returns, a ValueError exception will be raised.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

featurize(x)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.fingerprint.DescriptorFeature(n_jobs=-1, *, input_type='mol', on_errors='raise', return_type='any', target_col=None, desc_list='all', add_Hs=False)[source]

Bases: BaseFeaturizer

All descriptors in RDKit (length = 200) [may include NaN]

see https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors for the full list

Parameters:
  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).

  • input_type (string) – Set the specific type of transform input. Set to mol (default) to rdkit.Chem.rdchem.Mol objects as input. When set to smlies, transform method can use a SMILES list as input. Set to any to use both. If input is SMILES, Chem.MolFromSmiles function will be used inside. for None returns, a ValueError exception will be raised.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

  • desc_list (string or list) – List of descriptor names to be called in rdkit to calculate molecule descriptors. If classic, the full list of rdkit v.2020.03.xx is used. (length = 200) Default is to use the latest list available in the rdkit. (length = 208 in rdkit v.2020.09.xx)

  • add_Hs (boolean) – Add hydrogen atoms to the mol format in RDKit or not. This may affect a few physical descriptors (e.g., charge related ones).

featurize(x)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

classic = ['MaxEStateIndex', 'MinEStateIndex', 'MaxAbsEStateIndex', 'MinAbsEStateIndex', 'qed', 'MolWt', 'HeavyAtomMolWt', 'ExactMolWt', 'NumValenceElectrons', 'NumRadicalElectrons', 'MaxPartialCharge', 'MinPartialCharge', 'MaxAbsPartialCharge', 'MinAbsPartialCharge', 'FpDensityMorgan1', 'FpDensityMorgan2', 'FpDensityMorgan3', 'BalabanJ', 'BertzCT', 'Chi0', 'Chi0n', 'Chi0v', 'Chi1', 'Chi1n', 'Chi1v', 'Chi2n', 'Chi2v', 'Chi3n', 'Chi3v', 'Chi4n', 'Chi4v', 'HallKierAlpha', 'Ipc', 'Kappa1', 'Kappa2', 'Kappa3', 'LabuteASA', 'PEOE_VSA1', 'PEOE_VSA10', 'PEOE_VSA11', 'PEOE_VSA12', 'PEOE_VSA13', 'PEOE_VSA14', 'PEOE_VSA2', 'PEOE_VSA3', 'PEOE_VSA4', 'PEOE_VSA5', 'PEOE_VSA6', 'PEOE_VSA7', 'PEOE_VSA8', 'PEOE_VSA9', 'SMR_VSA1', 'SMR_VSA10', 'SMR_VSA2', 'SMR_VSA3', 'SMR_VSA4', 'SMR_VSA5', 'SMR_VSA6', 'SMR_VSA7', 'SMR_VSA8', 'SMR_VSA9', 'SlogP_VSA1', 'SlogP_VSA10', 'SlogP_VSA11', 'SlogP_VSA12', 'SlogP_VSA2', 'SlogP_VSA3', 'SlogP_VSA4', 'SlogP_VSA5', 'SlogP_VSA6', 'SlogP_VSA7', 'SlogP_VSA8', 'SlogP_VSA9', 'TPSA', 'EState_VSA1', 'EState_VSA10', 'EState_VSA11', 'EState_VSA2', 'EState_VSA3', 'EState_VSA4', 'EState_VSA5', 'EState_VSA6', 'EState_VSA7', 'EState_VSA8', 'EState_VSA9', 'VSA_EState1', 'VSA_EState10', 'VSA_EState2', 'VSA_EState3', 'VSA_EState4', 'VSA_EState5', 'VSA_EState6', 'VSA_EState7', 'VSA_EState8', 'VSA_EState9', 'FractionCSP3', 'HeavyAtomCount', 'NHOHCount', 'NOCount', 'NumAliphaticCarbocycles', 'NumAliphaticHeterocycles', 'NumAliphaticRings', 'NumAromaticCarbocycles', 'NumAromaticHeterocycles', 'NumAromaticRings', 'NumHAcceptors', 'NumHDonors', 'NumHeteroatoms', 'NumRotatableBonds', 'NumSaturatedCarbocycles', 'NumSaturatedHeterocycles', 'NumSaturatedRings', 'RingCount', 'MolLogP', 'MolMR', 'fr_Al_COO', 'fr_Al_OH', 'fr_Al_OH_noTert', 'fr_ArN', 'fr_Ar_COO', 'fr_Ar_N', 'fr_Ar_NH', 'fr_Ar_OH', 'fr_COO', 'fr_COO2', 'fr_C_O', 'fr_C_O_noCOO', 'fr_C_S', 'fr_HOCCN', 'fr_Imine', 'fr_NH0', 'fr_NH1', 'fr_NH2', 'fr_N_O', 'fr_Ndealkylation1', 'fr_Ndealkylation2', 'fr_Nhpyrrole', 'fr_SH', 'fr_aldehyde', 'fr_alkyl_carbamate', 'fr_alkyl_halide', 'fr_allylic_oxid', 'fr_amide', 'fr_amidine', 'fr_aniline', 'fr_aryl_methyl', 'fr_azide', 'fr_azo', 'fr_barbitur', 'fr_benzene', 'fr_benzodiazepine', 'fr_bicyclic', 'fr_diazo', 'fr_dihydropyridine', 'fr_epoxide', 'fr_ester', 'fr_ether', 'fr_furan', 'fr_guanido', 'fr_halogen', 'fr_hdrzine', 'fr_hdrzone', 'fr_imidazole', 'fr_imide', 'fr_isocyan', 'fr_isothiocyan', 'fr_ketone', 'fr_ketone_Topliss', 'fr_lactam', 'fr_lactone', 'fr_methoxy', 'fr_morpholine', 'fr_nitrile', 'fr_nitro', 'fr_nitro_arom', 'fr_nitro_arom_nonortho', 'fr_nitroso', 'fr_oxazole', 'fr_oxime', 'fr_para_hydroxylation', 'fr_phenol', 'fr_phenol_noOrthoHbond', 'fr_phos_acid', 'fr_phos_ester', 'fr_piperdine', 'fr_piperzine', 'fr_priamide', 'fr_prisulfonamd', 'fr_pyridine', 'fr_quatN', 'fr_sulfide', 'fr_sulfonamd', 'fr_sulfone', 'fr_term_acetylene', 'fr_tetrazole', 'fr_thiazole', 'fr_thiocyan', 'fr_thiophene', 'fr_unbrch_alkane', 'fr_urea']
property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.fingerprint.ECFP(n_jobs=-1, *, radius=3, n_bits=2048, counting=False, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseFeaturizer

Morgan (Circular) fingerprints (ECFP) The algorithm used is described in the paper Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. JCIM 50:742-54 (2010)

Parameters:
  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).

  • radius (int) – The radius parameter in the Morgan fingerprints, which is roughly half of the diameter parameter in ECFP, i.e., radius=2 is roughly equivalent to ECFP4.

  • n_bits (int) – Fixed bit length based on folding.

  • counting (boolean) – Record counts of the entries instead of bits only.

  • input_type (string) – Set the specific type of transform input. Set to mol (default) to rdkit.Chem.rdchem.Mol objects as input. When set to smlies, transform method can use a SMILES list as input. Set to any to use both. If input is SMILES, Chem.MolFromSmiles function will be used inside. for None returns, a ValueError exception will be raised.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

featurize(x)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.fingerprint.FCFP(n_jobs=-1, *, radius=3, n_bits=2048, counting=False, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseFeaturizer

Morgan (Circular) fingerprints + feature-based (FCFP) The algorithm used is described in the paper Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. JCIM 50:742-54 (2010)

Parameters:
  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).

  • radius (int) – The radius parameter in the Morgan fingerprints, which is roughly half of the diameter parameter in FCFP, i.e., radius=2 is roughly equivalent to FCFP4.

  • n_bits (int) – Fixed bit length based on folding.

  • counting (boolean) – Record counts of the entries instead of bits only.

  • input_type (string) – Set the specific type of transform input. Set to mol (default) to rdkit.Chem.rdchem.Mol objects as input. When set to smlies, transform method can use a SMILES list as input. Set to any to use both. If input is SMILES, Chem.MolFromSmiles function will be used inside. for None returns, a ValueError exception will be raised.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

featurize(x)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.fingerprint.Fingerprints(n_jobs=-1, *, radius=3, n_bits=2048, bit_per_entry=None, counting=False, input_type='mol', featurizers='all', on_errors='raise', target_col=None, desc_list='all', add_Hs=False)[source]

Bases: BaseDescriptor

Calculate fingerprints or descriptors of organic molecules. Note that MHFP currently does not support parallel computing, so n_jobs is fixed to 1.

Parameters:
  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cpus. Set -1 to use all cpu cores (default).

  • radius (int) – The radius parameter in the Morgan fingerprints, which is roughly half of the diameter parameter in ECFP/FCFP, i.e., radius=2 is roughly equivalent to ECFP4/FCFP4.

  • n_bits (int) – Fixed bit length based on folding.

  • bit_per_entry (int) – Number of bits used to represent a single entry (only for non-counting case) in RDKitFP, AtomPairFP, and TopologicalTorsionFP. Default value follows rdkit default.

  • counting (boolean) – Record counts of the entries instead of bits only.

  • featurizers (list[str] or str or 'all') – Featurizer(s) that will be used. Default is ‘all’.

  • input_type (string) – Set the specific type of transform input. Set to mol (default) to rdkit.Chem.rdchem.Mol objects as input. When set to smlies, transform method can use a SMILES list as input. Set to any to use both. If input is SMILES, Chem.MolFromSmiles function will be used inside. for None returns, a ValueError exception will be raised.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

  • desc_list (string or list) – List of descriptor names to be called in rdkit to calculate molecule descriptors. If classic, the full list of rdkit v.2020.03.xx is used. (length = 200) Default is to use the latest list available in the rdkit. (length = 208 in rdkit v.2020.09.xx)

  • add_Hs (boolean) – Add hydrogen atoms to the mol format in RDKit or not. This may affect a few physical descriptors (e.g., charge related ones) and currently no effect to fingerprints.

property timer
class xenonpy.descriptor.fingerprint.LayeredFP(n_jobs=-1, *, n_bits=2048, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseFeaturizer

A substructure fingerprint that is more complex than PatternFP (unique in RDKit).

Parameters:
  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).

  • n_bits (int) – Fixed bit length based on folding.

  • input_type (string) – Set the specific type of transform input. Set to mol (default) to rdkit.Chem.rdchem.Mol objects as input. When set to smlies, transform method can use a SMILES list as input. Set to any to use both. If input is SMILES, Chem.MolFromSmiles function will be used inside. for None returns, a ValueError exception will be raised.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

featurize(x)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.fingerprint.MACCS(n_jobs=-1, *, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseFeaturizer

The MACCS keys for a molecule. The result is a 167-bit vector. There are 166 public keys, but to maintain consistency with other software packages they are numbered from 1.

Parameters:
  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).

  • input_type (string) – Set the specific type of transform input. Set to mol (default) to rdkit.Chem.rdchem.Mol objects as input. When set to smlies, transform method can use a SMILES list as input. Set to any to use both. If input is SMILES, Chem.MolFromSmiles function will be used inside. for None returns, a ValueError exception will be raised.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

featurize(x)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.fingerprint.MHFP(n_jobs=1, *, radius=3, n_bits=2048, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseFeaturizer

Variation from the MinHash fingerprint, which is based on ECFP with locality sensitive hashing to increase compactness of information during hashing. The algorithm used is described in the paper Probst, D. & Reymond, J.-L., A probabilistic molecular fingerprint for big data settings. Journal of Cheminformatics, 10:66 (2018)

Note that MHFP currently does not support parallel computing, so please fix n_jobs to 1.

Parameters:
  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).

  • radius (int) – The radius parameter in the SECFP(RDKit version) fingerprints, which is roughly half of the diameter parameter in ECFP, i.e., radius=2 is roughly equivalent to ECFP4.

  • n_bits (int) – Fixed bit length based on folding.

  • input_type (string) – Set the specific type of transform input. Set to mol (default) to rdkit.Chem.rdchem.Mol objects as input. When set to smlies, transform method can use a SMILES list as input. Set to any to use both. If input is SMILES, Chem.MolFromSmiles function will be used inside. for None returns, a ValueError exception will be raised.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

featurize(x)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.fingerprint.PatternFP(n_jobs=-1, *, n_bits=2048, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseFeaturizer

A fingerprint designed to be used in substructure screening using SMARTS patterns (unique in RDKit).

Parameters:
  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).

  • n_bits (int) – Fixed bit length based on folding.

  • input_type (string) – Set the specific type of transform input. Set to mol (default) to rdkit.Chem.rdchem.Mol objects as input. When set to smlies, transform method can use a SMILES list as input. Set to any to use both. If input is SMILES, Chem.MolFromSmiles function will be used inside. for None returns, a ValueError exception will be raised.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

featurize(x)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.fingerprint.RDKitFP(n_jobs=-1, *, n_bits=2048, bit_per_entry=None, counting=False, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseFeaturizer

RDKit fingerprint.

Parameters:
  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).

  • n_bits (int) – Fingerprint size.

  • bit_per_entry (int) – Number of bits used to represent a single entry (only for non-counting case). Default value follows rdkit default.

  • counting (boolean) – Record counts of the entries instead of bits only.

  • input_type (string) – Set the specific type of transform input. Set to mol (default) to rdkit.Chem.rdchem.Mol objects as input. When set to smlies, transform method can use a SMILES list as input. Set to any to use both. If input is SMILES, Chem.MolFromSmiles function will be used inside. for None returns, a ValueError exception will be raised.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

featurize(x)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.fingerprint.TopologicalTorsionFP(n_jobs=-1, *, n_bits=2048, bit_per_entry=None, counting=False, input_type='mol', on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseFeaturizer

Topological Torsion fingerprints. Returns the topological-torsion fingerprint for a molecule. This is currently just in binary bits with fixed length after folding.

Parameters:
  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Can be -1 or # of cups. Set -1 to use all cpu cores (default).

  • n_bits (int) – Fixed bit length based on folding.

  • bit_per_entry (int) – Number of bits used to represent a single entry (only for non-counting case). Default value follows rdkit default.

  • counting (boolean) – Record counts of the entries instead of bits only.

  • input_type (string) – Set the specific type of transform input. Set to mol (default) to rdkit.Chem.rdchem.Mol objects as input. When set to smlies, transform method can use a SMILES list as input. Set to any to use both. If input is SMILES, Chem.MolFromSmiles function will be used inside. for None returns, a ValueError exception will be raised.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

featurize(x)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

xenonpy.descriptor.frozen_featurizer module

class xenonpy.descriptor.frozen_featurizer.FrozenFeaturizer(model=None, *, cuda=False, depth=None, n_layer=None, on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseFeaturizer

A Featurizer to extract hidden layers a from NN model.

Parameters:
  • model (torch.nn.Module) – Source model.

  • cuda (bool) – If true, run on GPU.

  • depth (int) – The depth will be retrieved from NN model.

  • n_layer (int) – Number of layer to be retrieved starting from the given depth.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, array and df. array and df force return type to np.ndarray and pd.DataFrame respectively. If any, the return type dependent on the input type. Default is any

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

featurize(descriptor, *, depth=None, n_layer=None)[source]

Main featurizer function, which has to be implemented in any derived featurizer subclass.

Parameters:

x (depends on featurizer) – input data to featurize.

Returns:

any – one or more features.

Return type:

numpy.ndarray

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

xenonpy.descriptor.structure module

class xenonpy.descriptor.structure.OrbitalFieldMatrix(including_d=True, *, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseFeaturizer

Representation based on the valence shell electrons of neighboring atoms.

Each atom is described by a 32-element vector uniquely representing the valence subshell. A 32x32 (39x39) matrix is formed by multiplying two atomic vectors. An OFM for an atomic environment is the sum of these matrices for each atom the center atom coordinates with multiplied by a distance function (In this case, 1/r times the weight of the coordinating atom in the Voronoi.

Parameters:
  • including_d (bool) – If true, add distance information.

  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default).

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, array and df. array and df force return type to np.ndarray and pd.DataFrame respectively. If any, the return type dependent on the input type. Default is any

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

featurize(structure, is_including_d=True)[source]

Generate OFM descriptor

Parameters:

structure (pymatgen.Structure) – The input structure for OFM calculation.

static get_element_representation(name)[source]

generate one-hot representation for a element, e.g, si = [0.0, 1.0, 0.0, 0.0, …]

Parameters:

name (string) – element symbol

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.structure.RadialDistributionFunction(n_bins=201, r_max=20.0, *, n_jobs=-1, on_errors='raise', return_type='any', target_col=None)[source]

Bases: BaseFeaturizer

Calculate pair distribution descriptor for machine learning.

Parameters:
  • n_bins (int) – Number of radial grid points.

  • r_max (float) – Maximum of radial grid (the minimum is always set zero).

  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default).

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • return_type (str) – Specific the return type. Can be any, array and df. array and df force return type to np.ndarray and pd.DataFrame respectively. If any, the return type dependent on the input type. Default is any

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

featurize(structure)[source]

Get RDF of the input structure. :type structure: :param structure: Pymatgen Structure object.

Returns:

(tuple of arrays) the first element is the

normalized RDF, whereas the second element is the inner radius of the RDF bin.

Return type:

rdf, dist

property feature_labels

Generate attribute names. :returns: ([str]) attribute labels.

class xenonpy.descriptor.structure.Structures(n_bins=201, r_max=20.0, including_d=True, *, n_jobs=-1, featurizers='all', on_errors='raise', target_col=None)[source]

Bases: BaseDescriptor

Calculate structure descriptors from compound’s structure.

Parameters:
  • n_bins (int) – Number of radial grid points.

  • r_max (float) – Maximum of radial grid (the minimum is always set zero).

  • including_d (bool) – If true, add distance information.

  • n_jobs (int) – The number of jobs to run in parallel for both fit and predict. Set -1 to use all cpu cores (default).

  • featurizers (list[str] or 'all') – Featurizers that will be used. Default is ‘all’.

  • on_errors (string) – How to handle exceptions in feature calculations. Can be ‘nan’, ‘keep’, ‘raise’. When ‘nan’, return a column with np.nan. The length of column corresponding to the number of feature labs. When ‘keep’, return a column with exception objects. The default is ‘raise’ which will raise up the exception.

  • target_col – Only relevant when input is pd.DataFrame, otherwise ignored. Specify a single column to be used for transformation. If None, all columns of the pd.DataFrame is used. Default is None.

property timer

Module contents