GraphRAG

Engineering/RAG

GraphRAG

alchemine 2025. 8. 20. 21:02

RAG evolution timeline. Image from elvis (https://x.com/omarsar0)

기술발전의 속도가 상상할 수 없을 만큼 빠른 시대입니다. 매일 Hugging Face에서 새로운 논문들이 소개되고 1년 전 기술은 물론 몇 달 전에 발표된 SoTA가 퇴물이 되어버리는 세상이 된 것 같습니다.

RAG 역시 수많은 기술들이 등장하고 후속 연구들의 거름이 되어가길 반복하며 다양한 방면에서 성장하고 있는데요.

이번에는 문서에 대한 포괄적인 질문에 답하기 위해 graph를 활용한 graph-based RAG의 대부 Microsoft의 GraphRAG에 대해 알아보도록 하겠습니다!

From Local to Global: A GraphRAG Approach to Query-Focused Summarization
by Microsoft Research, Microsoft Strategic Missions and Technologies, Microsoft Office of the CTO

기존 연구들의 한계점
기존 RAG 시스템은 "이 데이터셋의 주요 주제가 뭐야?"와 같이 문서 전체를 아우르는 질문에 답하기 어려움

제안방법
1. 문서 인덱싱 단계: LLM을 활용하여 graph index를 구축
1) 원본 문서에서 entity knowledge graph를 생성
2) 밀접하게 관련된 entity들을 community로 그룹화하고 community summary를 생성
2. 사용자 질의 단계
1) 각 community summary를 활용하여 질의에 대한 partial response를 생성
2) Partial response들을 요약하여 최종답변을 생성

References
- https://microsoft.github.io/graphrag/
- https://github.com/microsoft/graphrag
- https://arxiv.org/pdf/2404.16130

Data flow of the GraphRAG pipeline. Image from paper.

GraphRAG의 핵심 아이디어는 chunks에 대한 map-reduce/roll-up 요약 처리로 꽤나 직관적입니다.

각 과정을 자세히 살펴보면 다음과 같습니다.

1. 원본 문서를 text chunks로 분할

2. 각각의 chunk로부터 entities와 이들 간의 relationships를 LLM으로 추출

- 도메인에 따라 선별된 few-shot examples 사용 (in-context learning)

- Entity, relationship 추출 시, description도 함께 추출

- Entity 추출 시 관련된 설명(claim)도 추출

Example for entities & relationships extraction. Image from paper.

Example for claim description. Image from paper.

3. Entities & relationships를 knowledge graph로 변환

- Entity 집계 및 요약 (중복 개수: node degree)

- Relationships 집계 (중복 개수: edge weights)

4. Knowledge graph를 graph communities로 변환

- Leiden community detection 알고리즘을 계층적 방식으로 활용

(divisive/top-down hierarchical clustering 생각하면 될 듯?)

- Level 별로 community를 저장 (level-0, level-1,...)

Graph communities detected using the Leiden algorithm. Image from paper.

5. 각 graph community 마다 community summary 생성

- Community hierarchy에 있는 각 community 마다 보고서 같은 요약문을 생성

- 가령, level-1에서 관심 있는 주제를 찾고 level-2에서 해당 주제와 관련된 더 자세한 내용을 찾을 수 있음

- 요약 생성 방법

- Leaf-level communities

연결된 nodes의 degree 합이 큰 edge부터 연결된 elements(nodes, edge)의 description, claims를 활용

- Higher-level communities

Summary tokens가 큰 elements부터 사용하되, context window 초과 시 sub-community summary로 대체

6. 사용자 질의를 입력받으면, community summaries들을 활용하여 답변을 생성

1) 특정 community level이 주어짐 (예제: 0 or 1,..., 3)

2) Community summaries들을 chunks로 분할 (+ shuffling)

3) 중간 답변과 해당 답변이 얼마나 도움이 되는지 점수(0-100)를 생성. 0점은 필터링

4) 고득점 답변부터 context window를 채워 최종 답변을 생성

이런 과정들을 통해, 단순히 chunk 유사도 비교를 넘어 다음과 같이 포괄적인 질문들에도 답할 수 있게 됩니다.

Examples of generated questions targetting global understanding. Image from paper.

Graph와 RAG를 접목시켜 기존 RAG의 한계점을 극복한 GraphRAG는 코드공개에 힘입어 많은 관심을 받게 되었습니다.

Star history for https://github.com/microsoft/graphrag

하지만, GraphRAG의 실제 사용에 발목을 잡고 있는 큰 문제가 있었으니.. 바로 인덱싱 문제입니다.

RAG는 그 본질 상 지속적으로 데이터가 추가/변경되는 특성을 가지는데 GraphRAG는 데이터의 변화가 발생할 때마다 community detection, community summary 생성과 같은 작업들을 반복해야 합니다.

Community summary를 생성할 때도 context window를 꽉꽉 채우는 것을 염두에 두기도 하고, github에 다음과 같이 주의문구를 둘 정도로 많은 비용과 리소스가 필요하죠.

⚠️ Warning: GraphRAG indexing can be an expensive operation, please read all of the documentation to understand the process and costs involved, and start small.

이를 해결하기 위해, 더 효율적으로 인덱싱을 수행하고 성능은 비슷하거나 더 좋아지는 graph-based RAG 알고리즘들이 등장하기 시작했습니다.

불과 1년 남짓한 시간만에 GraphRAG는 결국 전투력 측정기가 되어버렸죠.

Evaluation figures from "EraRAG: Efficient and Incremental Retrieval Augmented Generation for Growing Corpora"

다음 글에선 GraphRAG의 인덱싱 문제를 완화시키고 이미지, 수식, 테이블 등 다양한 종류의 입력값까지 처리할 수 있도록 발전한 LightRAG에 대해 다뤄볼 예정입니다.

GraphRAG 다음으로 인기가 많고 재밌는 로직을 활용하여 문제를 해결하는 LightRAG의 핵심 원리와 동작 방식에 대해 자세히 알아보겠습니다.

Image from https://github.com/HKUDS/LightRAG

그럼 다음 포스팅에서 뵙겠습니다!