Descriptor calculation

XenonPy comes with a general interface for descriptor calculation. By using this interface, users can implement their descriptor calculator with only a few lines of codes and run it smoothly.

We also use this system to provide built-in calculators. Currently, 15 featurizers in 4 types are available out-of-the-box. The following list shows a summary.

Summary of built-in featurizers

Featurizer

Type

Description

Counting

Composition

Encoding number of compounds elements in to vector: \(f_{min, i} = min{f_{A,i}, f_{B,i}}\)

WeightedAverage

Composition

Weighted average (abbr: ave): \(f_{ave, i} = w_{A}^* f_{A,i} + w_{B}^* f_{B,i}\)

WeightedVariance

Composition

Weighted variance (abbr: var): \(f_{var, i} = w_{A}^* (f_{A,i} - f_{ave, i})^2 + w_{B}^* (f_{B,i} - f_{ave, i})^2\)

WeightedSum

Composition

Weighted sum (abbr: sum): \(f_{sum, i} = w_{A} f_{A,i} + w_{B} f_{B,i}\)

GeometricMean

Composition

Geometric mean (abbr: gmean): \(f_{gmean, i} = \sqrt[w_A + w_B]{f_{A,i}^{w_A} * f_{V,i}^{w_B}}\)

HarmonicMean

Composition

Harmonic mean (abbr: hmean): \(f_{hmean, i} = \frac{w_A +w_B}{\frac{1}{f_{A,i}}*w_A + \frac{1}{f_{B,i}}*w_B}\)

MaxPooling

Composition

Max-pooling (abbr: max): \(f_{max, i} = max{f_{A,i}, f_{B,i}}\)

MinPooling

Composition

Min-pooling (abbr: min): \(f_{min, i} = min{f_{A,i}, f_{B,i}}\)

RDKitFP

Fingerprint

RDKit fingerprint

AtomPairFP

Fingerprint

Atom Pair fingerprints

MACCS

Fingerprint

The MACCS keys for a molecule

ECFP

Fingerprint

Morgan (Circular) fingerprints (ECFP)

FCFP

Fingerprint

Morgan (Circular) fingerprints + feature-based (FCFP)

TopologicalTorsionFP

Fingerprint

Topological Torsion fingerprints

OrbitalFieldMatrix

Structure

Representation based on the valence shell electrons of neighboring atoms

RadialDistributionFunction

Structure

Radial distribution in crystal

FrozenFeaturizer

NN

Neural Network Extracted

Compositional descriptors

XenonPy can calculate 290 compositional features for a given chemical composition. This calculation uses the information of the 58 element-level property data recorded in elements_completed. See Data access for details.

>>> from xenonpy.descriptor import Compositions
>>> cal = Compositions()
>>> cal
Compositions:
  |- composition:
  |  |- Counting
  |  |- WeightedAverage
  |  |- WeightedSum
  |  |- WeightedVariance
  |  |- GeometricMean
  |  |- HarmonicMean
  |  |- MaxPooling
  |  |- MinPooling

The structure information of the calculator Cal is shown above. This information tells us Cal has one featurizer group called composition with featurizers WeightedAvgFeature, WeightedSumFeature, WeightedVarFeature, MaxFeature and MinFeature in it.

To use this calculator, users have to structure an iterable object that contains the information of compounds’ composition, then feed it to the method transform or fit_transform in cal. These methods accept two types of input, the pymatgen.Structure objects, or dicts which have the structure like {‘H’: 2, ‘O’: 1}.

Using our sample data, users will obtain a pandas.DataFrame object that contains all the compositional descriptors.

>>> from xenonpy.datatools import preset
>>> samples = preset.mp_samples
>>> comps = samples['composition']
>>> descriptor = cal.transform(comps)
>>> descriptor
     ave:atomic_number  ...  min:Polarizability
0            24.666667  ...            0.802000
1            33.000000  ...            1.100000
2            21.600000  ...            0.802000
...                ...  ...                 ...
928          44.500000  ...            5.500000
929          24.250000  ...           25.000000
930          26.750000  ...            4.800000
931          36.000000  ...            6.600000
932          16.500000  ...            0.802000
[933 rows x 290 columns]

where

>>> comps.__class__
pandas.core.series.Series
>>> comps[0].__class__
dict

If the input is a pandas.DataFrame object, the calculator will first try to read the data columns that have the same name as the featurizer groups. For example, the name of the featurizer group in the example above is composition. Therefore, the whole object entry can be fed into the calculator’s methods without explicitly extracting the composition column in the samples:

>>> descriptor = cal.transform(samples)
>>> descriptor
     ave:atomic_number  ...  min:Polarizability
0            24.666667  ...            0.802000
1            33.000000  ...            1.100000
2            21.600000  ...            0.802000
...                ...  ...                 ...
928          44.500000  ...            5.500000
929          24.250000  ...           25.000000
930          26.750000  ...            4.800000
931          36.000000  ...            6.600000
932          16.500000  ...            0.802000
[933 rows x 290 columns]

This does the same work as the previous one.

Structural descriptors

Similar to the Compositions calculator, Structures accepts pymatgen.Structure objects as its input, and then return calculated results as a pandas.DataFrame.

>>> from xenonpy.descriptor import Structures
>>> cal = Structures()
>>> cal
Structures:
  |- structure:
  |  |- RadialDistributionFunction
  |  |- OrbitalFieldMatrix

Structures contains one featurizer group called structure with RadialDistributionFunction and OrbitalFieldMatrix in it. samples also has the structure information. We can use these to calculate structural descriptors.

>>> descriptor = cal.transform(samples)

This will take 3 ~ 5 min to run and finally, you will get:

>>> descriptor.head(5)
            0.1  0.2  0.30000000000000004  ...  f14_f12  f14_f13  f14_f14
mp-1008807  0.0  0.0                  0.0  ...      0.0      0.0   0.0000
mp-1009640  0.0  0.0                  0.0  ...      0.0      0.0   0.0000
mp-1016825  0.0  0.0                  0.0  ...      0.0      0.0   0.0000
mp-1017582  0.0  0.0                  0.0  ...      0.0      0.0   0.3851
mp-1021511  0.0  0.0                  0.0  ...      0.0      0.0   0.0000
[5 rows x 1224 columns]

Advance

There are more details of the descriptor calculator system that are not yet included in this tutorial. Before we complete this document, you can check out https://github.com/yoshida-lab/XenonPy/blob/master/samples/custom_descriptor_calculator.ipynb for more information.