Development Tip

SQL Server (C # 클라이언트)에 많은 데이터를 대량 삽입하는 가장 빠른 방법은 무엇입니까?

yourdevel 2021. 1. 8. 22:27
반응형

SQL Server (C # 클라이언트)에 많은 데이터를 대량 삽입하는 가장 빠른 방법은 무엇입니까?


C # 클라이언트가 SQL Server 2005 데이터베이스에 대량 데이터를 삽입 할 때 성능 병목 현상이 발생하고 있으며 프로세스 속도를 높일 수있는 방법을 찾고 있습니다.

나는 이미 SqlClient.SqlBulkCopy (TDS 기반)를 사용하여 많은 도움이 된 유선을 통한 데이터 전송 속도를 높이고 있지만 여전히 더 많은 것을 찾고 있습니다.

다음과 같은 간단한 테이블이 있습니다.

 CREATE TABLE [BulkData](
 [ContainerId] [int] NOT NULL,
 [BinId] [smallint] NOT NULL,
 [Sequence] [smallint] NOT NULL,
 [ItemId] [int] NOT NULL,
 [Left] [smallint] NOT NULL,
 [Top] [smallint] NOT NULL,
 [Right] [smallint] NOT NULL,
 [Bottom] [smallint] NOT NULL,
 CONSTRAINT [PKBulkData] PRIMARY KEY CLUSTERED 
 (
  [ContainerIdId] ASC,
  [BinId] ASC,
  [Sequence] ASC
))

ContainerId와 BinId가 각 청크에서 일정하고 Sequence 값이 0-n이고 값이 기본 키를 기반으로 사전 정렬 된 평균 약 300 행의 데이터를 청크에 삽입하고 있습니다.

% Disk time 성능 카운터는 100 %에서 많은 시간을 소비하므로 디스크 IO가 주요 문제라는 것이 분명하지만 내가 얻고있는 속도는 원시 파일 복사본보다 몇 배 더 낮습니다.

다음과 같은 경우 도움이됩니까?

  1. 삽입하는 동안 기본 키를 삭제하고 나중에 다시 만듭니다.
  2. 동일한 스키마를 사용하여 임시 테이블에 삽입하고 주기적으로 기본 테이블로 전송하여 삽입이 발생하는 테이블의 크기를 작게 유지합니다.
  3. 다른 건 없나요?

-제가받은 답변을 바탕으로 조금 더 명확히하겠습니다.

Portman : 데이터를 모두 가져 오면 그 순서대로 순차적으로 데이터에 액세스해야하므로 클러스터형 인덱스를 사용하고 있습니다. 특히 데이터를 가져 오는 동안 인덱스가 필요하지 않습니다. 가져 오기에 대한 제약 조건을 완전히 삭제하는 것과 달리 삽입을 수행하는 동안 클러스터되지 않은 PK 인덱스를 갖는 것이 어떤 이점이 있습니까?

Chopeen : 데이터는 다른 많은 컴퓨터에서 원격으로 생성되고 있습니다 (현재 내 SQL 서버는 약 10 개만 처리 할 수 ​​있지만 더 추가 할 수 있기를 바랍니다). 로컬 컴퓨터에서 전체 프로세스를 실행하는 것은 실용적이지 않습니다. 출력을 생성하려면 50 배의 입력 데이터를 처리해야하기 때문입니다.

Jason : 가져 오기 프로세스 중에 테이블에 대해 동시 쿼리를 수행하지 않습니다. 기본 키를 삭제하고 도움이되는지 확인합니다.


SQL Server에서 인덱스를 비활성화 / 활성화하는 방법은 다음과 같습니다.

--Disable Index ALTER INDEX [IX_Users_UserID] SalesDB.Users DISABLE
GO
--Enable Index ALTER INDEX [IX_Users_UserID] SalesDB.Users REBUILD

다음은 솔루션을 찾는 데 도움이되는 몇 가지 리소스입니다.

벌크 로딩 속도 비교

SqlBulkCopy를 사용하여 클라이언트에서 SQL Server로 데이터를 빠르게로드

대량 복사 성능 최적화

NOCHECK 및 TABLOCK 옵션을 확실히 살펴보십시오.

테이블 힌트 (Transact-SQL)

INSERT (Transact-SQL)


이미 SqlBulkCopy를 사용 하고 있습니다. 이는 좋은 시작입니다.

그러나 SqlBulkCopy 클래스를 사용한다고해서 반드시 SQL이 대량 복사를 수행한다는 의미는 아닙니다. 특히 SQL Server가 효율적인 대량 삽입을 수행하기 위해 충족해야하는 몇 가지 요구 사항이 있습니다.

추가 읽기 :

Out of curiosity, why is your index set up like that? It seems like ContainerId/BinId/Sequence is much better suited to be a nonclustered index. Is there a particular reason you wanted this index to be clustered?


My guess is that you'll see a dramatic improvement if you change that index to be nonclustered. This leaves you with two options:

  1. Change the index to nonclustered, and leave it as a heap table, without a clustered index
  2. Change the index to nonclustered, but then add a surrogate key (like "id") and make it an identity, primary key, and clustered index

Either one will speed up your inserts without noticeably slowing down your reads.

Think about it this way -- right now, you're telling SQL to do a bulk insert, but then you're asking SQL to reorder the entire table every table you add anything. With a nonclustered index, you'll add the records in whatever order they come in, and then build a separate index indicating their desired order.


Have you tried using transactions?

From what you describe, having the server committing 100% of the time to disk, it seems you are sending each row of data in an atomic SQL sentence thus forcing the server to commit (write to disk) every single row.

If you used transactions instead, the server would only commit once at the end of the transaction.

For further help: What method are you using for inserting data to the server? Updating a DataTable using a DataAdapter, or executing each sentence using a string?


BCP - it's a pain to set up, but it's been around since the dawn of DBs and it's very very quick.

Unless you're inserting data in that order the 3-part index will really slow things. Applying it later will really slow things too, but will be in a second step.

Compound keys in Sql are always quite slow, the bigger the key the slower.


I'm not really a bright guy and I don't have a lot of experience with the SqlClient.SqlBulkCopy method but here's my 2 cents for what it's worth. I hope it helps you and others (or at least causes people to call out my ignorance ;).

You will never match a raw file copy speed unless your database data file (mdf) is on a separate physical disk from your transaction log file (ldf). Additionally, any clustered indexes would also need to be on a separate physical disk for a fairer comparison.

Your raw copy is not logging or maintaining a sort order of select fields (columns) for indexing purposes.

I agree with Portman on creating a nonclustered identity seed and changing your existing nonclustered index to a clustered index.

As far as what construct you're using on the clients...(data adapter, dataset, datatable, etc). If your disk io on the server is at 100%, I don't think your time is best spent analyzing client constructs as they appear to be faster than the server can currently handle.

If you follow Portman's links about minimal logging, I wouldn't think surrounding your bulk copies in transactions would help a lot if any but I've been wrong many times in my life ;)

This won't necessarily help you right now but if you figure out your current issue, this next comment might help with the next bottleneck (network throughput) - especially if it's over the Internet...

Chopeen asked an interesting question too. How did you determine to use 300 record count chunks to insert? SQL Server has a default packet size (I believe it is 4096 bytes) and it would make sense to me to derive the size of your records and ensure that you are making efficient use of the packets transmitting between client and server. (Note, you can change your packet size on your client code as opposed to the server option which would obviously change it for all server communications - probably not a good idea.) For instance, if your record size results in 300 record batches requiring 4500 bytes, you will send 2 packets with the second packet being mostly wasted. If batch record count was arbitrarily assigned, it might make sense to do some quick easy math.

From what I can tell (and remember about data type sizes) you have exactly 20 bytes for each record (if int=4 bytes and smallint=2 bytes). If you are using 300 record count batches, then you are trying to send 300 x 20 = 6,000 bytes (plus I'm guessing a little overhead for the connection, etc). You might be more efficient to send these up in 200 record count batches (200 x 20 = 4,000 + room for overhead) = 1 packet. Then again, your bottleneck still appears to be the server's disk io.

I realize you're comparing a raw data transfer to the SqlBulkCopy with the same hardware/configuration but here's where I would go also if the challenge was mine:

This post probably won't help you anymore as it's rather old but I would next ask what your disk's RAID configuration is and what speed of disk are you using? Try putting the log file on a drive that uses RAID 10 with a RAID 5 (ideally 1) on your data file. This can help reduce a lot of spindle movement to different sectors on the disk and result in more time reading/writing instead of the unproductive "moving" state. If you already separate your data and log files, do you have your index on a different physical disk drive from your data file (you can only do this with clustered indexes). That would allow for not only concurrently updating logging information with data inserting but would allow index inserting (and any costly index page operations) to occur concurrently.


I think that it sounds like this could be done using SSIS packages. They're similar to SQL 2000's DTS packages. I've used them to successfully transform everything from plain text CSV files, from existing SQL tables, and even from XLS files with 6-digit rows spanned across multiple worksheets. You could use C# to transform the data into an importable format (CSV, XLS, etc), then have your SQL server run a scheduled SSIS job to import the data.

It's pretty easy to create an SSIS package, there's a wizard built-into SQL Server's Enterprise Manager tool (labeled "Import Data" I think), and at the end of the wizard it gives you the option of saving it as an SSIS package. There's a bunch more info on Technet as well.


Yes your ideas will help.
Lean on option 1 if there are no reads happening while your loading.
Lean on option 2 if you destination table is being queried during your processing.

@Andrew
Question. Your inserting in chunks of 300. What is the total amount your inserting? SQL server should be able to handle 300 plain old inserts very fast.

ReferenceURL : https://stackoverflow.com/questions/24200/whats-the-fastest-way-to-bulk-insert-a-lot-of-data-in-sql-server-c-client

반응형