# Create Knowledge

Steps to upload documents to create a knowledge base:

1. Create a knowledge base and import either local document file or online data.

{% content-ref url="create-knowledge-and-upload-documents/import-content-data" %}
[import-content-data](https://legacy-docs.dify.ai/guides/knowledge-base/create-knowledge-and-upload-documents/import-content-data)
{% endcontent-ref %}

2. Choose a chunking mode and preview the spliting results. This stage involves content preprocessing and structuring, where long texts are divided into multiple smaller chunks.

{% content-ref url="create-knowledge-and-upload-documents/chunking-and-cleaning-text" %}
[chunking-and-cleaning-text](https://legacy-docs.dify.ai/guides/knowledge-base/create-knowledge-and-upload-documents/chunking-and-cleaning-text)
{% endcontent-ref %}

3. Configure the indexing method and retrieval setting. Once the knowledge base receives a user query, it searches existing documents according to preset retrieval methods and extracts highly relevant content chunks.

{% content-ref url="create-knowledge-and-upload-documents/setting-indexing-methods" %}
[setting-indexing-methods](https://legacy-docs.dify.ai/guides/knowledge-base/create-knowledge-and-upload-documents/setting-indexing-methods)
{% endcontent-ref %}

4. Wait for the chunk embeddings to complete.
5. Once finished, link the knowledge base to your application and start using it. You can then [integrate it into your application](https://legacy-docs.dify.ai/guides/knowledge-base/integrate-knowledge-within-application) to build an LLM that are capable of Q\&A based on knowledge-bases. If you want to modify and manage the knowledge base further, take refer to [Knowledge Base and Document Maintenance](https://legacy-docs.dify.ai/guides/knowledge-base/knowledge-and-documents-maintenance).

<figure><img src="https://assets-docs.dify.ai/2024/12/a3362a1cd384cb2b539c9858de555518.png" alt=""><figcaption><p>Complete the creation of the knowledge base</p></figcaption></figure>

***

### Reference

#### ETL

In production-level applications of RAG, to achieve better data retrieval, multi-source data needs to be preprocessed and cleaned, i.e., ETL (extract, transform, load). To enhance the preprocessing capabilities of unstructured/semi-structured data, Dify supports optional ETL solutions: **Dify ETL** and [**Unstructured ETL**](https://unstructured.io/).

> Unstructured can efficiently extract and transform your data into clean data for subsequent steps.

ETL solution choices in different versions of Dify:

* The SaaS version defaults to using Unstructured ETL and cannot be changed;
* The community version defaults to using Dify ETL but can enable Unstructured ETL through [environment variables](https://legacy-docs.dify.ai/getting-started/install-self-hosted/environments#zhi-shi-ku-pei-zhi);

Differences in supported file formats for parsing:

| DIFY ETL                                                | Unstructured ETL                                                                        |
| ------------------------------------------------------- | --------------------------------------------------------------------------------------- |
| txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv | txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv, eml, msg, pptx, ppt, xml, epub |

{% hint style="info" %}
Different ETL solutions may have differences in file extraction effects. For more information on Unstructured ETL’s data processing methods, please refer to the [official documentation](https://docs.unstructured.io/open-source/core-functionality/partitioning).
{% endhint %}

#### **Embedding**

**Embedding** transforms discrete variables (words, sentences, documents) into continuous vector representations, mapping high-dimensional data to lower-dimensional spaces. This technique preserves crucial semantic information while reducing dimensionality, enhancing content retrieval efficiency.

**Embedding models**, specialized large language models, excel at converting text into dense numerical vectors, effectively capturing semantic nuances for improved data processing and analysis.

#### **Metadata**

For managing the knowledge base with metadata, see [*Metadata*](https://docs.dify.ai/guides/knowledge-base/metadata).
