데이터 만들기 a.k.a 데이터 생산

데이터 만들기 a.k.a 데이터 생산BIOLOGY/Bioinformatics2024. 10. 19. 20:50@Ungbae

Table of Contents

Illumina sequencing
PacBio Sequencing
Nanopore Sequencing

본 게시글은 학부 '생물정보학(Bioinformatics)' 강의를 토대로 필자가 이해한 내용을 정리하였습니다.

생물학을 정진하는 3, 4학년에게도 NGS는 생소할 것이다. 물론 알아본 데이터들이 어떻게 생기는 지는 중요하지 않다. 우리가 직접 생산하는 집단이라면 이들이 중요할 수는 있겠으나 이제 대부분 업체가 생산한다. 우리가 DNA, RNA sample을 PREP을 해야하겠지만, DNA 또는 RNA 상태로 그냥 보내면 그것들을 데이터로 받아볼 수 있다. 이러한 이유로 데이터가 어떻게 생겼는지 아는 것이 중요하다. Youtube 사이트에 회사들이 잘 설명해 놓았다.

Illumina - short read
PacBio - Long read, 일루미나에 비해 방식이 완전 다르다. Long read 방식이 2가지.
Nanopore - Long read

Illumina sequencing

참고영상 : https://www.youtube.com/watch?v=fCd6B5HRaZ8&t=70s

샘플 준비(Sample prep)

↓

클러스터 생성(Cluster generation)

↓

Sequencing

↓

Data analysis

Illumina Sequencing 과정 요약

샘플 준비(Sample Preparation)

DNA 조각에 Adapter 부착 : 모든 샘플 준비 방법에는 DNA 조각 양 끝에 Adapter를 부착한다. 이러한 Adapter에는 Sequencing binding site, Index, Flow cell oligo 그리고 상보적인 영역과 같은 다양한 요소들이 포함된다. 대표적으로 바코드, Primer site, Attatchment site 등이 있음을 기억하자.
사이클 증폭(Cycle Amplification) : 이후에는 추가적인 motif가 도입되어, Sequencing에 필요한 준비가 끝난다.

출처 : https://www.slideshare.net/slideshow/an-introduction-to-illumina-sequencing/251216218

Adapter : 바코드, Primer site, Attatchment site

클러스터 생성(Cluster Generation)

등온 증폭 : 각 DNA 조각은 Flow cell에서 등온 증폭 과정을 거친다. Flow cell은 유리 슬라이드로, 각 채널에 잔디(Lawn) 모양으로 두 종류의 oligo가 코팅되어 있다.

Hybridization and Amplification

첫 번째 oligo가 adapter 영역과 상보적으로 결합하여 hybridization을 돕는다.
Polymerase가 결합된 조각의 상보적인 서열을 만든다. 이후 이중가닥은 denaturation되고 원본 서열은 제거한다.
Bridge amplification : DNA 가닥이 접히면서 Adapter 영역이 두 번째 oligo와 결합하고, polymeratse가 상보적인 가닥을 생성해 이중가닥 bridge가 형성된다.
반복 과정 : 이 과정을 반복하여 Flow cell에서 수백만 개의 Cluster가 증폭된다.
역방향(Reverse) 가닥 제거 : 증폭 후 역방향 가닥이 제거되고 세척되어 전방향(Forward) 가닥만 남는다.

Sequencing

첫 번째 Sequencing primer가 결합하여 Read 1이 시작된다.
형광 표지된 nucleotide가 서열에 맞춰 추가된다. 각 cluster는 빛을 통해 특정 형광 신호를 발산하며, 이를 통해 염기 서열을 해독한다. A, T, G, C가 각자 다른 색깔의 빛이 번쩍인다. 그래서 이를 카메라가 서로 다른 빛이 번쩍일 때마다 그 좌표에 있는 빛을 기록한다. 그래서 A라면 T, T라면 A를 기록하는 것이다. 이 방법을 Sequencing-by-Synthesis(SBS)라고 한다.
인덱스 읽기 : Read 1이 완료되면 Index 1 primer가 결합하고, 동일한 방식으로 index 읽기가 진행된다. 이후, Index 2도 같은 방식으로 읽혀진다. 즉 처음 읽은 부분은 리드의 1번이 되는 것(전 게시글 참고). 두 번째는 이제 반대로 읽게 되기 때문에 2번 데이터로 기록되는 것이다.
Read 2 : 서열이 접혀 두 번째 oligo와 결합하여 Bridge 구조가 형성되고, Foward 가닥이 제거된 후 Reverse 가닥을 읽어 두 번째 read가 수행된다.

수백만, 수천만개의 Cluster를 만든다. 여기서 500bp 짜리 DNA를 읽는다고 하면 한 번에 최대 150bp를 읽는다고 하였다. 하지만 앞에서 150bp, 뒤에서 150bp를 읽게되면 가운데의 200bp는 못읽은 것이 아닌가. 그렇다면 가운데의 200bp의 정보는 없는 것. 하지만 계속 과정을 반복하다보면 조금씩 옆의 서열들도 읽어질텐데, 이를 나중에 수천만 개를 모두 overlap하다보면 결국 빈틈없이 읽어지게 된다.

Quiz ) 만약 G 염기가 와서 붙었고 초록색이 번쩍였다고 하자. 그렇다면 컴퓨터는 이를 뭐라고 기록할까?

Answer ) G가 와서 붙었으니까 원래 알고 싶은 서열은 C. 따라서 C라고 기록된다.

데이터 분석(Data Analysis)

Read data 생성 : 이 모든 과정은 수백만개의 read를 생성하고 이는 모든 DNA 조각을 대표한다.
샘플 구분 및 정렬 : 샘플 준비 과정에서 추가된 고유 Index를 바탕으로 각 샘플을 구분한다. 비슷한 염기 서열을 가진 read들은 로컬하게 clustering 되며, Foward와 Reverse read가 결합해 연속된 서열(Contiguous sequence)를 만든다.
Reference genome alignment : 생성된 contiguous sequence는 참조 게놈(Reference Genome)과 정렬(Alignment)되어 변이(Variants)를 식별한다.
데이터 관리 : 분석된 Genome 데이터는 BaseSpace Sequence Hub를 통해 안전하게 전송, 저장, 분석, 공유된다.

부연 설명

영상에 나와있는 서열에서 하얀색 부분이 우리가 읽고자 하는 DNA이다. DNA extraction을 한 후 이를 랜덤하게 자른 것이다. 길이는 제각각이기 때문에 똑같지 않다. 비슷하게 자를 수는 있으나 똑같게 자를 수는 없다. 그리고 하얀 부분 양 끝으로 붙여준 것이 Adapter(영상에서는 보라색과 하늘색 영역).

Adapter를 붙여주는 이유 : Priming site가 필요하다. 즉, Sequencing이 시작되는 곳이 필요하다. 그리고 이 곳에 Polymerase가 binding 된다. 어디인 줄 알고는 읽어야 할 것 아닌가. 그래서 우리가 알고 있는 서열을 넣어준다.

Adapter를 붙인다 = Library를 만든다

업체에 맡길 때 DNA를 잘라서 주지 않는다. Total DNA를 뽑기만 해서 주면 업체에서 첫 번째로 하는 일이 이 DNA를 랜덤으로 자르는 과정을 진행한다. 그 다음 Adapter를 붙이는 데, 즉 라이브러리를 만든다.

그리고 Flow cell, 8개의 lane이 존재하는데 이 lane의 작은 구멍에 DNA를 넣는다. 그리고 lane 각각을 들여다보면 두 가지 타입의 oligo 들이 꽂혀있다. 이 때문에 Illumina 라이브러리를 만들 때에는 Illumina 용 라이브러리를 만들어야 한다. 그래야 Adapter를 붙일 때 oligo 서열에 붙을 수가 있다. 라이브러리를 넣어주면 lane 안에서 흘러가다가 상보적인 염기를 만나면 붙게 된다.

DNA를 아주 조금씩, 예를들면 150bp씩 읽어서 어떻게 그 긴 유전체를 결국 완성할 수 있을까?

영상에서 보았듯이 우리는 150bp를 2번 밖에 읽을 수 없다. 하지만 이런 것들이 수천만개, 수억개라는 것이다. 이런 나열을 Alignment라고 한다. 아무리 작아도 매우 많기 때문에 빈틈없이 쌓이게 되는 것이다. 이것들이 쌓이게 되면 퍼즐 맞추듯이 조립을 해서 원래 가지고 있는 DNA 서열을 읽을 수 있게 된다. 이것이 Genome Assembly의 원리이다. Human Genome project가 3년 만에 끝날 수 있었던 이유이다.

이렇게 짧은 데도 할 수 있다면, 더 길어지게 되면 길어질 수록 더 잘할 수 있다! 그래서 Long-read가 나온 것이다. 오히려 짧으면 한계가 있다. 반복되는 서열이 많으면 많을 수록 short-read를 조립하기 매우 까다로울 것이다. 직소 퍼즐을 생각해보라. Repeat sequence, 의미가 있든 없든 반복되는 서열들. 퍼즐의 피스가 작아질 수록 조립의 난이도는 어려워지는 것처럼 이또한 그러하다.

PacBio Sequencing

참고 영상 : https://www.youtube.com/watch?v=_lD8JyAbwEo

위에서 언급한 내용을 이유로 Long-read가 개발이 되었다. Long-read의 라이브러리 형태도 Short-read와는 차이가 있다. 20KB, DNA개수로 치면 2만개까지 할 수 있다. 기계도 다르게 생겼고, Flow cell이라고 한 Illumina와 달리, SMRT bell이 있다. 그 bell 안에는 구멍들이 수백만개가 있는데, 그 구멍은 DNA 라이브러리 하나만 딱 들어갈 크기로 만들어졌다. 이렇게 하나씩 채워진다. Illumina의 경우 Flow cell에 DNA를 고정시켜놓고 polymerase가 지나가는 방식이었다. PacBio의 경우 SMRT cell 안에 polymerase를 고정시켜 놓는다. 그리고 DNA가 계속 챗바퀴 돌듯이 돌아가는 형태.

PacBio에는 두 가지 Sequencing 기술이 있다.

CCS(Circular Consensus Sequencing)

높은 정확도(Hifi reads)의 Long reads를 생성
동일한 DNA 분자를 여러 번 읽어 높은 정확도를 보장
에러가 랜덤하게 생기기 때문에 에러였던 부분이 아닐 가능성도 있기 때문에 보정이 가능한 것

원형으로 된 라이브러리이기 때문에 계속 돌리게 되면 Overlap이 되면서 에러를 보정하게 된다. 보통 90번 정도의 사이클을 반복하면 약 99%까지 정확도가 올라가는 것으로 알려져 있다. 요즘은 기술이 더 향상되어서 Illumina short-read만큼 정확해졌다고 한다. 거의 에러가 없다고 봐도 무방할 정도이다.

CCS 방식으로 Hifi sequencing을 한다!

CLR(Continuous Long Read Sequencing)

Long-read를 생성하지만, 여러 번 반복 읽기 대신 단일 분자에 대해 긴 연속 sequence를 생성
보다 긴 서열 정보를 제공
CCS와는 조금 다르게 헤어핀 구조를 가지고 있어서 CCS처럼 계속 사이클을 돌릴 수 없다.
CCS보다 더 길게 읽을 수 있지만 여러 번 읽을 수 없다. 1번만 읽는다. 즉, 에러 수정이 안된다는 것.
50kb 이상 읽을 수 있다.

이러한 특징 때문에 요즘은 CCS 방식을 선호한다. CLR을 할 바엔 Nanopore를 쓰는 편이 더 낫다. Nanopore는 더 길게 읽는데 에러가 있는 건 똑같기 때문이다.

Nanopore Sequencing

참고 영상 : https://www.youtube.com/watch?v=RcP85JHLmnI

영상의 기기를 실제로 보면 매우 작다. 손가락 2개 정도의 크기

정식명칭 : Oxford Nanopore Technology, ONT

Illumina, PacBio와 다르게 Nanopore는 일정한 크기로 자르지 않는다. 아예 자르지 않는다. 최대한 길게 나오면 나올 수록 좋은 것. DNA를 무조건 자르지 않는다. 그리고 DNA가 기기로 들어가게 되면 그 곳에 Motor protein, Adapter sequence가 있는데 그 곳에 붙게 된다. 그래서 넣어주게 되면 nanopore라는 나노 사이즈의 구멍으로 기차처럼 들어간다. 대신 이 nanopore의 개수가 많지 않다. Illumina가 제일 많고 그 다음으로 PacBio, Nanopore 순이다. Nanopore가 제일 적다. 그래서 생산량이 적을 수 밖에 없다. 하지만 긴 것들, 이론상으로는 Chromosome 처음부터 끝까지 끊기지 않았으면 이또한 읽을 수 있다. 그리고 Motor protein이 이를 데리고 들어간다. Helicase가 가닥을 벌려서 한쪽만 nanopore에 집어넣어준다. 그 분리된 가닥 하나가 지나가면서 전류를 읽는다.

Repeat sequence가 많으면 Nanopore를 사용하면 좋을 것. 조립이 잘된다.

적당히 길어도 되고, 더 정확한 결과를 원하면 PacBio를 사용하면 된다. 그래서 요즘 대세는 PacBio.

과거에는 Illumina short-read 또는 PacBio, Nanopore를 섞어서 사용했다. 합쳐서 Hybrid라고 하는데 요즘은 기술력이 향상되어 이러한 방식은 사용하지 않는다.

필자의 막간 영어듣기 평가. 참고용 스크립트.

Illumina

Sample prep

All preparation methods add adapters to the ends of the DNA fragments. Through reduced cycle amplification, additional motifs are introduced, such as the sequencing binding site, indices and regions complementary to the flow cell oligos.

Cluster generation

Clustering is a process where each fragment molecule is isothermally amplified. The flow cell is a glass slide with lanes. Each lane is a channel coated with a lawn, composed of two types of oligos. Hybridization is enabled by the first of the two types of oligos on the surface. This oligo is complementary to the adapter region on one of the fragment strands. A polymerase creates a complement of the hybridized fragment. The double stranded molecule is denatured and the original template is washed away. The strands are clonally amplified through bridge amplification. In this process the strand folds over and the adapter region hybrides to the second type of oligo on the flow cell. Polymerases generate the complimentary strand forming a double stranded bridge. This bridge is denatured, resulting in 2 single stranded copies of the molecule that are tethrered to the flow cell. The process is then repeated over and over, and occurs simultaneously for millions of clusters resulting in clonal amplification of all the fragments. After bridge amplification the reverse strands are cleaved andn washed off, leaving only the forward strands.The three prime ends are blocked to prevent unwanted priming.

Sequencing

Sequencing begins with the extension of the first sequencing primer to produce the first read. With each cycle, fluorescently tagged nucleotides compete for addition to the growing chain. Only one is incorporated based on the sequence of the template. After the addition of each nucleotide the clusters are excited by a light source and a characteristic fluorescent signal is emmited. This proprietary process is called Sequencing-by-Synthesis. The number of cycles determines the lengths of the read. The emission wave length, along with the signal intensity, determines the base call. For a given cluster, all identical strands are read simultaneously. Hundreds of millions of clusters are sequenced in a massively parallel process. This image represents a small fraction of the flow cell. After the completion of the first read, the read product is washed away. In this step, the index 1 read primer is introduced and hybridized to the template. The read is generated, similar to the first read. After completion of the index read, the read product is washed off, and the three prime ends of the template are deprotected. The template now folds over and binds the second oligo on the flow cell. Index 2 is read in the same manner as index 1. Polymerases extend the second flow cell oligo forming a double stranded bridge. This double stranded stranded DNA is then linearized and the three prime ends are blocked. The original forward strand is cleaved off and washed away leaving only the reverse strand. Read 2 begins with the introduction of the read 2 sequencing primer. As with read 1, the sequencing steps are repeated until the desired read length is achieved. The read 2 product is then washed away.

Data anaylsis

This entire process generates millions of reads representing all the fragments. Sequences from pooled sample libraries are seperated based on the unique indicies introduced during the sample preparation. For each sample reads with similar stretches of base calls are locally clustered. Forward and reverse reads paired creating contiguous sequences. These contiguous sequences are aligned back to the reference genome for variant identification. The paired end information is used to resolve ambiguous alignments. Genomic data can be securely transferred stored, analyzed and shared in BaseSpace Sequence Hub. Discover the possibilites of Next Generation Sequencing.

PacBio

Introducing PacBio sequencing system powered by Single Molecule, Real-Time(SMRT) Sequencing technology.

First, from any sample type ranging from viruses or vertebrates DNA or RNA isolated. Next, SMRTbell Library is created by ligating adapters .the double stranded DNA creating circular template. Primer and polymerase are added to the libraries that is placed on the instruments for sequencing. At the Corp SMRT sequencing is the SMRT cell which contains the millions of tiny wells called Zero-Mode Waveguides. A single molecule DNA is mobilized in ZMWs and ask polymerase and corporates labeled nucleotides light is emitted. With this approach, nucleotide in corporation is measured in real-time. With the two systems, you can optimize your results two sequencing modes. Circular Consensus Sequencing(CCS) and Continuous Long Read(CLR) Sequencing. Circular Consensus Sequencing(CCS) modes do reproduce highly accurate long reads known as hifi reads. Or Continuous Long Read(CLR) Sequencing modes do generate the long reads.

Nanopore

MinION, DNA(or RNA), Motor protein, Adapter sequence,

2048 membrane wells, each containing a nanopore

Tether가 nanopore에 한 가닥만 들어갈 수 있게 도와준다.

Disruption of ionic current

Disruption of ionic current measured in signal trace. 400 bases per sec.

Sequenced strands.

'BIOLOGY > Bioinformatics' 카테고리의 다른 글

데이터 수집부터 Alignment까지 (0)	2024.10.22
DNA 온라인 스토어 (4)	2024.10.20
Sequencing에 점수를, Phred score (0)	2024.10.18
저는 NGS를 하고 싶어요. (0)	2024.10.18
생명과학의 미래, NGS(차세대 염기서열 분석) (2)	2024.10.15