# Topic Extraction

Topic extraction determines the dominant topics in the text.

This functionality is also known as:

* theme identification
* subject detection
* key topic recognition


Tisane stores the topics under the `topics` array (strings without `topic_stats`, objects with `topic_stats`). The topics are document level.

When a particular word has multiple interpretations, the sense of the word must be determined in the current context. For example, *Jupiter* is a planet and a Roman deity. Whether it's the planet or the deity, depends on the text.

For example, the sentence *Juno is the wife of Jupiter* refers to the deity. Tisane determines the relevant topics as `Roman mythology`, `supernatural` (gods), `relationship`, and `family` (since the spousal connection is mentioned).


```json
{
	"text": "Juno is the wife of Jupiter",
	"topics": [
		"supernatural",
		"Roman mythology",
		"relationship",
		"family"
	]
}
```

On the other hand, the sentence *Jupiter is farther from the sun than Mars* refers to planets. Tisane determines the topics to be `outer space` and `astronomy`.


```json
{
	"text": "Jupiter is farther from the sun than Mars",
	"topics": [
		"outer space",
		"astronomy"
	]
}
```

## Topic Statistics

If the setting `topic_stats` is set to `true`, then the portion of the input where the topic is active is provided. The topic is then not provided as a string but as an object made of the topic itself (`topic` (string) attribute) and its distribution statistic (`coverage` (float) attribute).

**Example**

Request:


```json
{
  "language":"en",
  "content":"Jupiter is farther from the sun than Mars. Which is not important in the current context",
  "settings": 
  {
    "topic_stats": true
  }
}
```

Response:


```json
{
	"text": "Jupiter is farther from the sun than Mars. Which is not important in the current context",
	"topics": [
		{
			"topic": "outer space",
			"coverage": 0.5
		},
		{
			"topic": "astronomy",
			"coverage": 0.5
		}
	]
}
```

(both detected topics appear in 1 sentence out of 2, which is 0.5 of all sentences)

## Standards

There are common taxonomy standards that Tisane can use with `topic_standard` setting:

* `native` - native Tisane topic names; based on standard English terms for the topic. The default standard.
* `iptc_code` - codes of the [IPTC (International Press Telecommunications Council) Media Topics](https://iptc.org/standards/media-topics/) classification - a standard used in the media.
* `iptc_description` - English descriptions of the IPTC codes.
* `iab_code` - codes of the [IAB (Interactive Advertising Bureau)](https://www.iab.com/guidelines/content-taxonomy/) content taxonomy.
* `iab_description` - English descriptions of the IAB codes.
* `wikidata` - Wikidata codes (usually of the form Qnnnnn, e.g. Q123).


To specify the standard, add the `topic_standard` setting.

**Example**

Request:


```json
{
  "language":"en",
  "content":"Jupiter is farther from the sun than Mars.",
  "settings": 
  {
    "topic_standard": "wikidata"
  }
}
```

Response:


```json
{
	"text": "Jupiter is farther from the sun than Mars. Which is not important in the current contex",
	"topics": [
		"Q4169",
		"Q333"
	]
}
```

The standard taxonomies cover a small fraction of the native standard. When a concept is not covered by a taxonomy, it is omitted from the response.