Descriptor calculation

XenonPy comes with a general interface for descriptor calculation. By using this interface, users can implement their descriptor calculator with only a few lines of codes and run it smoothly.

We also use this system to provide built-in calculators. Currently, 15 featurizers in 4 types are available out-of-the-box. The following list shows a summary.

Summary of built-in featurizers
Featurizer	Type	Description
Counting	Composition	Encoding number of compounds elements in to vector: \(f_{min, i} = min{f_{A,i}, f_{B,i}}\)
WeightedAverage	Composition	Weighted average (abbr: ave): \(f_{ave, i} = w_{A}^* f_{A,i} + w_{B}^* f_{B,i}\)
WeightedVariance	Composition	Weighted variance (abbr: var): \(f_{var, i} = w_{A}^* (f_{A,i} - f_{ave, i})^2 + w_{B}^* (f_{B,i} - f_{ave, i})^2\)
WeightedSum	Composition	Weighted sum (abbr: sum): \(f_{sum, i} = w_{A} f_{A,i} + w_{B} f_{B,i}\)
GeometricMean	Composition	Geometric mean (abbr: gmean): \(f_{gmean, i} = \sqrt[w_A + w_B]{f_{A,i}^{w_A} * f_{V,i}^{w_B}}\)
HarmonicMean	Composition	Harmonic mean (abbr: hmean): \(f_{hmean, i} = \frac{w_A +w_B}{\frac{1}{f_{A,i}}w_A + \frac{1}{f_{B,i}}w_B}\)
MaxPooling	Composition	Max-pooling (abbr: max): \(f_{max, i} = max{f_{A,i}, f_{B,i}}\)
MinPooling	Composition	Min-pooling (abbr: min): \(f_{min, i} = min{f_{A,i}, f_{B,i}}\)
RDKitFP	Fingerprint	RDKit fingerprint
AtomPairFP	Fingerprint	Atom Pair fingerprints
MACCS	Fingerprint	The MACCS keys for a molecule
ECFP	Fingerprint	Morgan (Circular) fingerprints (ECFP)
FCFP	Fingerprint	Morgan (Circular) fingerprints + feature-based (FCFP)
TopologicalTorsionFP	Fingerprint	Topological Torsion fingerprints
OrbitalFieldMatrix	Structure	Representation based on the valence shell electrons of neighboring atoms
RadialDistributionFunction	Structure	Radial distribution in crystal
FrozenFeaturizer	NN	Neural Network Extracted

Compositional descriptors

XenonPy can calculate 290 compositional features for a given chemical composition. This calculation uses the information of the 58 element-level property data recorded in elements_completed. See Data access for details.

>>> from xenonpy.descriptor import Compositions
>>> cal = Compositions()
>>> cal
Compositions:
  |- composition:
  |  |- Counting
  |  |- WeightedAverage
  |  |- WeightedSum
  |  |- WeightedVariance
  |  |- GeometricMean
  |  |- HarmonicMean
  |  |- MaxPooling
  |  |- MinPooling

The structure information of the calculator Cal is shown above. This information tells us Cal has one featurizer group called composition with featurizers WeightedAvgFeature, WeightedSumFeature, WeightedVarFeature, MaxFeature and MinFeature in it.

To use this calculator, users have to structure an iterable object that contains the information of compounds’ composition, then feed it to the method transform or fit_transform in cal. These methods accept two types of input, the pymatgen.Structure objects, or dicts which have the structure like {‘H’: 2, ‘O’: 1}.

Using our sample data, users will obtain a pandas.DataFrame object that contains all the compositional descriptors.

>>> from xenonpy.datatools import preset
>>> samples = preset.mp_samples
>>> comps = samples['composition']
>>> descriptor = cal.transform(comps)
>>> descriptor
     ave:atomic_number  ...  min:Polarizability
0            24.666667  ...            0.802000
1            33.000000  ...            1.100000
2            21.600000  ...            0.802000
...                ...  ...                 ...
928          44.500000  ...            5.500000
929          24.250000  ...           25.000000
930          26.750000  ...            4.800000
931          36.000000  ...            6.600000
932          16.500000  ...            0.802000
[933 rows x 290 columns]

where

>>> comps.__class__
pandas.core.series.Series
>>> comps[0].__class__
dict

If the input is a pandas.DataFrame object, the calculator will first try to read the data columns that have the same name as the featurizer groups. For example, the name of the featurizer group in the example above is composition. Therefore, the whole object entry can be fed into the calculator’s methods without explicitly extracting the composition column in the samples:

>>> descriptor = cal.transform(samples)
>>> descriptor
     ave:atomic_number  ...  min:Polarizability
0            24.666667  ...            0.802000
1            33.000000  ...            1.100000
2            21.600000  ...            0.802000
...                ...  ...                 ...
928          44.500000  ...            5.500000
929          24.250000  ...           25.000000
930          26.750000  ...            4.800000
931          36.000000  ...            6.600000
932          16.500000  ...            0.802000
[933 rows x 290 columns]

This does the same work as the previous one.

Structural descriptors

Similar to the Compositions calculator, Structures accepts pymatgen.Structure objects as its input, and then return calculated results as a pandas.DataFrame.

>>> from xenonpy.descriptor import Structures
>>> cal = Structures()
>>> cal
Structures:
  |- structure:
  |  |- RadialDistributionFunction
  |  |- OrbitalFieldMatrix

Structures contains one featurizer group called structure with RadialDistributionFunction and OrbitalFieldMatrix in it. samples also has the structure information. We can use these to calculate structural descriptors.

>>> descriptor = cal.transform(samples)

This will take 3 ~ 5 min to run and finally, you will get:

>>> descriptor.head(5)
            0.1  0.2  0.30000000000000004  ...  f14_f12  f14_f13  f14_f14
mp-1008807  0.0  0.0                  0.0  ...      0.0      0.0   0.0000
mp-1009640  0.0  0.0                  0.0  ...      0.0      0.0   0.0000
mp-1016825  0.0  0.0                  0.0  ...      0.0      0.0   0.0000
mp-1017582  0.0  0.0                  0.0  ...      0.0      0.0   0.3851
mp-1021511  0.0  0.0                  0.0  ...      0.0      0.0   0.0000
[5 rows x 1224 columns]

Advance

There are more details of the descriptor calculator system that are not yet included in this tutorial. Before we complete this document, you can check out https://github.com/yoshida-lab/XenonPy/blob/master/samples/custom_descriptor_calculator.ipynb for more information.