linq에서 배치 만들기

Development Tip

linq에서 배치 만들기

yourdevel 2020. 9. 25. 23:43

linq에서 배치 만들기

누군가 linq에서 특정 크기의 배치를 만드는 방법을 제안 할 수 있습니까?

이상적으로는 구성 가능한 양의 청크에서 작업을 수행 할 수 있기를 원합니다.

코드를 작성할 필요가 없습니다. 소스 시퀀스를 크기가 지정된 버킷으로 일괄 처리하는 MoreLINQ Batch 메서드를 사용 합니다 (MoreLINQ는 설치할 수있는 NuGet 패키지로 제공됨).

int size = 10;
var batches = sequence.Batch(size);

다음과 같이 구현됩니다.

public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(
                  this IEnumerable<TSource> source, int size)
{
    TSource[] bucket = null;
    var count = 0;

    foreach (var item in source)
    {
        if (bucket == null)
            bucket = new TSource[size];

        bucket[count++] = item;
        if (count != size)
            continue;

        yield return bucket;

        bucket = null;
        count = 0;
    }

    if (bucket != null && count > 0)
        yield return bucket.Take(count);
}

public static class MyExtensions
{
    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> items,
                                                       int maxItems)
    {
        return items.Select((item, inx) => new { item, inx })
                    .GroupBy(x => x.inx / maxItems)
                    .Select(g => g.Select(x => x.item));
    }
}

그리고 사용법은 다음과 같습니다.

List<int> list = new List<int>() { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };

foreach(var batch in list.Batch(3))
{
    Console.WriteLine(String.Join(",",batch));
}

산출:

0,1,2
3,4,5
6,7,8
9

위의 모든 것은 대량 배치 또는 낮은 메모리 공간에서 끔찍한 성능을 발휘합니다. 파이프 라인이 될 내 자신을 작성해야했습니다 (어디에나 항목이 누적되지 않음).

public static class BatchLinq {
    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size) {
        if (size <= 0)
            throw new ArgumentOutOfRangeException("size", "Must be greater than zero.");

        using (IEnumerator<T> enumerator = source.GetEnumerator())
            while (enumerator.MoveNext())
                yield return TakeIEnumerator(enumerator, size);
    }

    private static IEnumerable<T> TakeIEnumerator<T>(IEnumerator<T> source, int size) {
        int i = 0;
        do
            yield return source.Current;
        while (++i < size && source.MoveNext());
    }
}

편집 : 이 접근 방식의 알려진 문제는 다음 배치로 이동하기 전에 각 배치를 열거하고 완전히 열거해야한다는 것입니다. 예를 들어 이것은 작동하지 않습니다.

//Select first item of every 100 items
Batch(list, 100).Select(b => b.First())

sequencedefined로 시작하고 IEnumerable<T>여러 번 안전하게 열거 할 수 있다는 것을 알고 있다면 (예 : 배열 또는 목록이기 때문에) 다음과 같은 간단한 패턴을 사용하여 요소를 일괄 처리 할 수 있습니다.

while (sequence.Any())
{
    var batch = sequence.Take(10);
    sequence = sequence.Skip(10);

    // do whatever you need to do with each batch here
}

이것은 누적을 수행하지 않는 Batch의 완전히 게으르고 오버 헤드가 적은 단일 함수 구현입니다. EricRoller의 도움을 받아 Nick Whaley의 솔루션 을 기반으로 (및 문제 수정) .

반복은 기본 IEnumerable에서 직접 이루어 지므로 요소는 엄격한 순서로 열거되어야하며 한 번만 액세스해야합니다. 내부 루프에서 일부 요소가 사용되지 않으면 폐기됩니다 (저장된 반복기를 통해 다시 액세스하려고하면 InvalidOperationException: Enumeration already finished.).

.NET Fiddle 에서 전체 샘플을 테스트 할 수 있습니다 .

public static class BatchLinq
{
    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size)
    {
        if (size <= 0)
            throw new ArgumentOutOfRangeException("size", "Must be greater than zero.");
        using (var enumerator = source.GetEnumerator())
            while (enumerator.MoveNext())
            {
                int i = 0;
                // Batch is a local function closing over `i` and `enumerator` that
                // executes the inner batch enumeration
                IEnumerable<T> Batch()
                {
                    do yield return enumerator.Current;
                    while (++i < size && enumerator.MoveNext());
                }

                yield return Batch();
                while (++i < size && enumerator.MoveNext()); // discard skipped items
            }
    }
}

MoreLINQ와 동일한 접근 방식이지만 Array 대신 List를 사용합니다. 벤치마킹을하지 않았지만 가독성이 어떤 사람들에게는 더 중요합니다.

    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size)
    {
        List<T> batch = new List<T>();

        foreach (var item in source)
        {
            batch.Add(item);

            if (batch.Count >= size)
            {
                yield return batch;
                batch.Clear();
            }
        }

        if (batch.Count > 0)
        {
            yield return batch;
        }
    }

나는 매우 늦게 참여하고 있지만 더 흥미로운 것을 발견했습니다.

그래서 우리는 여기에 사용할 수 있습니다 Skip및 Take성능 향상을 위해.

public static class MyExtensions
    {
        public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> items, int maxItems)
        {
            return items.Select((item, index) => new { item, index })
                        .GroupBy(x => x.index / maxItems)
                        .Select(g => g.Select(x => x.item));
        }

        public static IEnumerable<T> Batch2<T>(this IEnumerable<T> items, int skip, int take)
        {
            return items.Skip(skip).Take(take);
        }

    }

다음으로 100000 개의 레코드를 확인했습니다. 루핑은 다음과 같은 경우에만 더 많은 시간이 걸립니다.Batch

콘솔 애플리케이션 코드.

static void Main(string[] args)
{
    List<string> Ids = GetData("First");
    List<string> Ids2 = GetData("tsriF");

    Stopwatch FirstWatch = new Stopwatch();
    FirstWatch.Start();
    foreach (var batch in Ids2.Batch(5000))
    {
        // Console.WriteLine("Batch Ouput:= " + string.Join(",", batch));
    }
    FirstWatch.Stop();
    Console.WriteLine("Done Processing time taken:= "+ FirstWatch.Elapsed.ToString());


    Stopwatch Second = new Stopwatch();

    Second.Start();
    int Length = Ids2.Count;
    int StartIndex = 0;
    int BatchSize = 5000;
    while (Length > 0)
    {
        var SecBatch = Ids2.Batch2(StartIndex, BatchSize);
        // Console.WriteLine("Second Batch Ouput:= " + string.Join(",", SecBatch));
        Length = Length - BatchSize;
        StartIndex += BatchSize;
    }

    Second.Stop();
    Console.WriteLine("Done Processing time taken Second:= " + Second.Elapsed.ToString());
    Console.ReadKey();
}

static List<string> GetData(string name)
{
    List<string> Data = new List<string>();
    for (int i = 0; i < 100000; i++)
    {
        Data.Add(string.Format("{0} {1}", name, i.ToString()));
    }

    return Data;
}

걸린 시간은 이렇습니다.

첫 번째-00 : 00 : 00.0708, 00 : 00 : 00.0660

두 번째 (하나 건너 뛰기)-00 : 00 : 00.0008, 00 : 00 : 00.0008

따라서 기능적 모자를 사용하면 사소한 것처럼 보이지만 C #에서는 몇 가지 중요한 단점이 있습니다.

당신은 아마도 이것을 IEnumerable의 전개로 볼 것입니다 (구글하고 당신은 아마도 일부 Haskell 문서에서 끝날 것입니다. 감각).

Unfold는 입력 IEnumerable을 통해 반복하는 것이 아니라 출력 데이터 구조 (IEnumerable와 IObservable 사이의 유사한 관계)를 반복한다는 점을 제외하면 fold ( "aggregate")와 관련이 있습니다. 사실 IObservable은 generate라는 "unfold"를 구현한다고 생각합니다. ..)

어쨌든 먼저 unfold 메서드가 필요합니다. 이것이 작동한다고 생각합니다 (불행히도 결국 큰 "목록"에 대한 스택을 날려 버릴 것입니다 ... concat 대신 yield!를 사용하여 F #에서 안전하게 작성할 수 있습니다).

    static IEnumerable<T> Unfold<T, U>(Func<U, IEnumerable<Tuple<U, T>>> f, U seed)
    {
        var maybeNewSeedAndElement = f(seed);

        return maybeNewSeedAndElement.SelectMany(x => new[] { x.Item2 }.Concat(Unfold(f, x.Item1)));
    }

이것은 C #이 기능적 언어가 당연한 것으로 받아들이는 것들을 구현하지 않기 때문에 약간 둔한 것입니다 ...하지만 기본적으로 시드를 취한 다음 IEnumerable의 다음 요소와 다음 시드에 대한 "어쩌면"답변을 생성합니다 (아마도 C #에는 존재하지 않으므로 IEnumerable을 사용하여 가짜로 만들었으며 나머지 답변을 연결합니다 ( "O (n?)"의 복잡성을 보증 할 수 없습니다).

일단 당신이 그것을 한 후에;

    static IEnumerable<IEnumerable<T>> Batch<T>(IEnumerable<T> xs, int n)
    {
        return Unfold(ys =>
            {
                var head = ys.Take(n);
                var tail = ys.Skip(n);
                return head.Take(1).Select(_ => Tuple.Create(tail, head));
            },
            xs);
    }

모든 것이 매우 깔끔해 보입니다. IEnumerable에서 "n"요소를 "다음"요소로 사용하고 "꼬리"는 처리되지 않은 나머지 목록입니다.

머리에 아무것도 없으면 ... 끝났어 ... "Nothing"을 반환 (그러나 빈 IEnumerable>로 가짜) ... 그렇지 않으면 처리 할 머리 요소와 꼬리를 반환합니다.

IObservable을 사용하여이 작업을 수행 할 수 있습니다. 이미 "Batch"와 같은 방법이있을 수 있으며 사용할 수도 있습니다.

스택 오버플로의 위험이 걱정된다면 (아마도 그래야 할 것 같습니다) F #으로 구현해야합니다 (이미 이미 F # 라이브러리 (FSharpX?)가있을 것입니다).

(저는 이것에 대한 기초적인 테스트 만 수행했기 때문에 거기에 이상한 버그가있을 수 있습니다).

linq없이 작동하고 데이터에 대한 단일 열거를 보장하는 사용자 지정 IEnumerable 구현을 작성했습니다. 또한 대용량 데이터 세트에 대한 메모리 폭발을 유발하는 백업 목록이나 배열없이이 모든 작업을 수행합니다.

다음은 몇 가지 기본 테스트입니다.

    [Fact]
    public void ShouldPartition()
    {
        var ints = new List<int> {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
        var data = ints.PartitionByMaxGroupSize(3);
        data.Count().Should().Be(4);

        data.Skip(0).First().Count().Should().Be(3);
        data.Skip(0).First().ToList()[0].Should().Be(0);
        data.Skip(0).First().ToList()[1].Should().Be(1);
        data.Skip(0).First().ToList()[2].Should().Be(2);

        data.Skip(1).First().Count().Should().Be(3);
        data.Skip(1).First().ToList()[0].Should().Be(3);
        data.Skip(1).First().ToList()[1].Should().Be(4);
        data.Skip(1).First().ToList()[2].Should().Be(5);

        data.Skip(2).First().Count().Should().Be(3);
        data.Skip(2).First().ToList()[0].Should().Be(6);
        data.Skip(2).First().ToList()[1].Should().Be(7);
        data.Skip(2).First().ToList()[2].Should().Be(8);

        data.Skip(3).First().Count().Should().Be(1);
        data.Skip(3).First().ToList()[0].Should().Be(9);
    }

데이터를 분할하는 확장 방법입니다.

/// <summary>
/// A set of extension methods for <see cref="IEnumerable{T}"/>. 
/// </summary>
public static class EnumerableExtender
{
    /// <summary>
    /// Splits an enumerable into chucks, by a maximum group size.
    /// </summary>
    /// <param name="source">The source to split</param>
    /// <param name="maxSize">The maximum number of items per group.</param>
    /// <typeparam name="T">The type of item to split</typeparam>
    /// <returns>A list of lists of the original items.</returns>
    public static IEnumerable<IEnumerable<T>> PartitionByMaxGroupSize<T>(this IEnumerable<T> source, int maxSize)
    {
        return new SplittingEnumerable<T>(source, maxSize);
    }
}

이것은 구현 클래스입니다

    using System.Collections;
    using System.Collections.Generic;

    internal class SplittingEnumerable<T> : IEnumerable<IEnumerable<T>>
    {
        private readonly IEnumerable<T> backing;
        private readonly int maxSize;
        private bool hasCurrent;
        private T lastItem;

        public SplittingEnumerable(IEnumerable<T> backing, int maxSize)
        {
            this.backing = backing;
            this.maxSize = maxSize;
        }

        public IEnumerator<IEnumerable<T>> GetEnumerator()
        {
            return new Enumerator(this, this.backing.GetEnumerator());
        }

        IEnumerator IEnumerable.GetEnumerator()
        {
            return this.GetEnumerator();
        }

        private class Enumerator : IEnumerator<IEnumerable<T>>
        {
            private readonly SplittingEnumerable<T> parent;
            private readonly IEnumerator<T> backingEnumerator;
            private NextEnumerable current;

            public Enumerator(SplittingEnumerable<T> parent, IEnumerator<T> backingEnumerator)
            {
                this.parent = parent;
                this.backingEnumerator = backingEnumerator;
                this.parent.hasCurrent = this.backingEnumerator.MoveNext();
                if (this.parent.hasCurrent)
                {
                    this.parent.lastItem = this.backingEnumerator.Current;
                }
            }

            public bool MoveNext()
            {
                if (this.current == null)
                {
                    this.current = new NextEnumerable(this.parent, this.backingEnumerator);
                    return true;
                }
                else
                {
                    if (!this.current.IsComplete)
                    {
                        using (var enumerator = this.current.GetEnumerator())
                        {
                            while (enumerator.MoveNext())
                            {
                            }
                        }
                    }
                }

                if (!this.parent.hasCurrent)
                {
                    return false;
                }

                this.current = new NextEnumerable(this.parent, this.backingEnumerator);
                return true;
            }

            public void Reset()
            {
                throw new System.NotImplementedException();
            }

            public IEnumerable<T> Current
            {
                get { return this.current; }
            }

            object IEnumerator.Current
            {
                get { return this.Current; }
            }

            public void Dispose()
            {
            }
        }

        private class NextEnumerable : IEnumerable<T>
        {
            private readonly SplittingEnumerable<T> splitter;
            private readonly IEnumerator<T> backingEnumerator;
            private int currentSize;

            public NextEnumerable(SplittingEnumerable<T> splitter, IEnumerator<T> backingEnumerator)
            {
                this.splitter = splitter;
                this.backingEnumerator = backingEnumerator;
            }

            public bool IsComplete { get; private set; }

            public IEnumerator<T> GetEnumerator()
            {
                return new NextEnumerator(this.splitter, this, this.backingEnumerator);
            }

            IEnumerator IEnumerable.GetEnumerator()
            {
                return this.GetEnumerator();
            }

            private class NextEnumerator : IEnumerator<T>
            {
                private readonly SplittingEnumerable<T> splitter;
                private readonly NextEnumerable parent;
                private readonly IEnumerator<T> enumerator;
                private T currentItem;

                public NextEnumerator(SplittingEnumerable<T> splitter, NextEnumerable parent, IEnumerator<T> enumerator)
                {
                    this.splitter = splitter;
                    this.parent = parent;
                    this.enumerator = enumerator;
                }

                public bool MoveNext()
                {
                    this.parent.currentSize += 1;
                    this.currentItem = this.splitter.lastItem;
                    var hasCcurent = this.splitter.hasCurrent;

                    this.parent.IsComplete = this.parent.currentSize > this.splitter.maxSize;

                    if (this.parent.IsComplete)
                    {
                        return false;
                    }

                    if (hasCcurent)
                    {
                        var result = this.enumerator.MoveNext();

                        this.splitter.lastItem = this.enumerator.Current;
                        this.splitter.hasCurrent = result;
                    }

                    return hasCcurent;
                }

                public void Reset()
                {
                    throw new System.NotImplementedException();
                }

                public T Current
                {
                    get { return this.currentItem; }
                }

                object IEnumerator.Current
                {
                    get { return this.Current; }
                }

                public void Dispose()
                {
                }
            }
        }
    }

저는 모두가 복잡한 시스템을 사용하여이 작업을 수행한다는 것을 알고 있으며 그 이유를 이해하지 못합니다. Take and skip은 Func<TSource,Int32,TResult>변환 기능이 있는 공통 선택을 사용하는 모든 작업을 허용합니다 . 처럼:

public IEnumerable<IEnumerable<T>> Buffer<T>(IEnumerable<T> source, int size)=>
    source.Select((item, index) => source.Skip(size * index).Take(size)).TakeWhile(bucket => bucket.Any());

또 다른 한 줄 구현. 빈 목록에서도 작동합니다.이 경우 크기가 0 인 배치 컬렉션을 얻습니다.

var aList = Enumerable.Range(1, 100).ToList(); //a given list
var size = 9; //the wanted batch size
//number of batches are: (aList.Count() + size - 1) / size;

var batches = Enumerable.Range(0, (aList.Count() + size - 1) / size).Select(i => aList.GetRange( i * size, Math.Min(size, aList.Count() - i * size)));

Assert.True(batches.Count() == 12);
Assert.AreEqual(batches.ToList().ElementAt(0), new List<int>() { 1, 2, 3, 4, 5, 6, 7, 8, 9 });
Assert.AreEqual(batches.ToList().ElementAt(1), new List<int>() { 10, 11, 12, 13, 14, 15, 16, 17, 18 });
Assert.AreEqual(batches.ToList().ElementAt(11), new List<int>() { 100 });

또 다른 방법은 Rx 버퍼 연산자를 사용하는 것입니다.

//using System.Linq;
//using System.Reactive.Linq;
//using System.Reactive.Threading.Tasks;

var observableBatches = anAnumerable.ToObservable().Buffer(size);

var batches = aList.ToObservable().Buffer(size).ToList().ToTask().GetAwaiter().GetResult();

왜 아무도 구식 for-loop 솔루션을 게시하지 않았는지 궁금합니다. 다음은 하나입니다.

List<int> source = Enumerable.Range(1,23).ToList();
int batchsize = 10;
for (int i = 0; i < source.Count; i+= batchsize)
{
    var batch = source.Skip(i).Take(batchsize);
}

이 단순함은 Take 메서드가 가능하기 때문에 가능합니다.

... 요소가 생성되거나 더 이상 요소를 포함하지 않을 source때까지 count요소를 열거 하고 생성 source합니다. 경우 count의 요소 수를 초과 source의 모든 요소 source반환을

다음은 Nick Whaley의 ( link ) 및 infogulch의 ( link ) lazy Batch구현 의 개선을 시도한 것입니다. 이것은 엄격합니다. 배치를 올바른 순서로 열거하거나 예외가 발생합니다.

public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(
    this IEnumerable<TSource> source, int size)
{
    if (size <= 0) throw new ArgumentOutOfRangeException(nameof(size));
    using (var enumerator = source.GetEnumerator())
    {
        int i = 0;
        while (enumerator.MoveNext())
        {
            if (i % size != 0) throw new InvalidOperationException(
                "The enumeration is out of order.");
            i++;
            yield return GetBatch();
        }
        IEnumerable<TSource> GetBatch()
        {
            while (true)
            {
                yield return enumerator.Current;
                if (i % size == 0 || !enumerator.MoveNext()) break;
                i++;
            }
        }
    }
}

그리고 여기 Batch유형의 소스에 대한 게으른 구현이 있습니다 IList<T>. 이것은 열거에 제한을 두지 않습니다. 배치는 부분적으로, 임의의 순서로, 두 번 이상 열거 될 수 있습니다. 열거하는 동안 컬렉션을 수정하지 않는 제한은 여전히 유효합니다. 이것은 enumerator.MoveNext()청크 또는 요소를 산출 하기 전에 더미 호출을 수행함으로써 달성됩니다 . 단점은 열거자가 언제 끝날지 알 수 없기 때문에 열거자가 처리되지 않은 상태로 남아 있다는 것입니다.

public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(
    this IList<TSource> source, int size)
{
    if (size <= 0) throw new ArgumentOutOfRangeException(nameof(size));
    var enumerator = source.GetEnumerator();
    for (int i = 0; i < source.Count; i += size)
    {
        enumerator.MoveNext();
        yield return GetChunk(i, Math.Min(i + size, source.Count));
    }
    IEnumerable<TSource> GetChunk(int from, int toExclusive)
    {
        for (int j = from; j < toExclusive; j++)
        {
            enumerator.MoveNext();
            yield return source[j];
        }
    }
}

    static IEnumerable<IEnumerable<T>> TakeBatch<T>(IEnumerable<T> ts,int batchSize)
    {
        return from @group in ts.Select((x, i) => new { x, i }).ToLookup(xi => xi.i / batchSize)
               select @group.Select(xi => xi.x);
    }

참고 URL : https://stackoverflow.com/questions/13731796/create-batches-in-linq

'Development Tip' 카테고리의 다른 글

16 진수 RGB 문자열에서 System.Drawing.Color를 만드는 방법은 무엇입니까? (0)	2020.09.25
Xcode 4 오류 : 실행 파일 시작 오류 (0)	2020.09.25
자바 : 텍스트 파일 읽는 방법 (0)	2020.09.25
WebView에서 스크롤을 비활성화 하시겠습니까? (0)	2020.09.25
S3에서 10,000 개의 파일을 공개하는 방법 (0)	2020.09.25

현재글linq에서 배치 만들기

yourdevel

linq에서 배치 만들기

linq에서 배치 만들기

'Development Tip' 카테고리의 다른 글

'Development Tip'의 다른글

티스토리툴바

linq에서 배치 만들기

linq에서 배치 만들기

'Development Tip' 카테고리의 다른 글

'Development Tip'의 다른글

관련글

티스토리툴바