가능한 가장 빠른 grep

Development Tip

가능한 가장 빠른 grep

yourdevel 2020. 10. 6. 19:33

가능한 가장 빠른 grep

grep가능한 한 빨리 할 수있는 팁이 있는지 알고 싶습니다 . 가능한 한 가장 빠른 방법으로 검색 할 텍스트 파일 기반이 상당히 많습니다. -i옵션을 제거 할 수 있도록 모두 소문자로 만들었습니다 . 이렇게하면 검색이 훨씬 빨라집니다.

또한 -F및 -P모드가 기본 모드보다 빠르다 는 것을 알게되었습니다 . 검색 문자열이 정규식이 아닌 경우 (단지 일반 텍스트) 전자를 사용하고 정규식이 포함 된 경우 후자를 사용합니다.

누구든지 속도 향상에 대한 경험이 grep있습니까? 특정 플래그로 처음부터 컴파일하거나 (저는 Linux CentOS에 있습니다) 특정 방식으로 파일을 구성하거나 어떤 방식 으로든 검색을 병렬로 만들 수 있습니까?

다음과 함께 사용하는 방법에 대한 예제가 포함 된 GNU parallel을 사용해보십시오 .grep

grep -r디렉토리를 통해 반복적으로 greps. 멀티 코어 CPU에서 GNU parallel는 종종이 속도를 높일 수 있습니다.
find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
이것은 코어 당 1.5 개의 작업을 실행하고 grep.

큰 파일의 경우 --pipe및 --block인수를 사용하여 입력을 여러 청크로 분할 할 수 있습니다 .

 parallel --pipe --block 2M grep foo < bigfile

SSH를 통해 여러 다른 컴퓨터에서 실행할 수도 있습니다 (비밀번호를 피하기 위해 ssh-agent가 필요함).

parallel --pipe --sshlogin server.example.com,server2.example.net grep foo < bigfile

매우 큰 파일을 검색하는 경우 로케일을 설정하면 정말 도움이 될 수 있습니다.

GNU grep은 UTF-8보다 C 로켈에서 훨씬 더 빠릅니다.

export LC_ALL=C

Ripgrep은 이제 가장 빠르다고 주장합니다.

https://github.com/BurntSushi/ripgrep

기본적으로 병렬 처리도 포함

 -j, --threads ARG
              The number of threads to use.  Defaults to the number of logical CPUs (capped at 6).  [default: 0]

README에서

Rust의 정규식 엔진 위에 구축되었습니다. Rust의 정규식 엔진은 유한 오토마타, SIMD 및 공격적인 리터럴 최적화를 사용하여 검색을 매우 빠르게 만듭니다.

--mmap을 사용하면 일부 시스템에서 도움이 될 수 있습니다.

http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

엄격하게 코드 개선은 아니지만 2 백만 개 이상의 파일에서 grep을 실행 한 후 도움이되는 것으로 나타났습니다.

저렴한 SSD 드라이브 (120GB)로 작업을 옮겼습니다. 약 $ 100에 정기적으로 많은 파일을 처리하는 경우 저렴한 옵션입니다.

어떤 파일에 문자열이 포함되어 있는지 신경 쓰지 않는다면 읽기 와 grepping 을 두 개의 작업으로 분리 할 수 있습니다 grep. 각 작은 파일에 대해 한 번씩 여러 번 생성하는 데 비용이 많이들 수 있기 때문 입니다.

매우 큰 파일이 하나있는 경우 :

parallel -j100% --pipepart --block 100M -a <very large SEEKABLE file> grep <...>
많은 작은 압축 파일 (inode별로 정렬 됨)

ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j80% --group "gzcat {}" | parallel -j50% --pipe --round-robin -u -N1000 grep <..>

나는 보통 최대 처리량을 위해 lz4로 파일을 압축합니다.

일치하는 파일 이름 만 원하는 경우 :

ls -i | sort -n | cut -d' ' -f2 | fgrep \.gz | parallel -j100% --group "gzcat {} | grep -lq <..> && echo {}

Sandro의 응답을 바탕으로 저는 그가 여기 에서 제공 한 참조를 살펴보고 BSD grep 대 GNU grep을 가지고 놀았습니다. 내 빠른 벤치 마크 결과는 GNU grep이 훨씬 빠르다는 것을 보여주었습니다.

그래서 원래 질문 인 "가장 빠른 grep"에 대한 제 추천 : BSD grep (예 : MacOS의 기본값) 대신 GNU grep을 사용하고 있는지 확인하십시오.

개인적으로 grep 대신 ag (silver searcher)를 사용하며 훨씬 빠르며 병렬 및 파이프 블록과 결합 할 수도 있습니다.

https://github.com/ggreer/the_silver_searcher

업데이트 : 이제 사용 사례에 따라 ag보다 빠른 https://github.com/BurntSushi/ripgrep 을 사용합니다.

하나의 큰 파일에서 검색 (특히 패턴 변경)을 위해 grep을 사용하는 것이 더 빠르다는 것을 알게 된 한 가지는 병렬 플래그와 함께 split + grep + xargs를 사용하는 것입니다. 예를 들면 :

my_ids.txt라는 큰 파일에서 검색하려는 ID 파일이 있음 bigfile의 이름 bigfile.txt

split을 사용하여 파일을 여러 부분으로 분할합니다.

# Use split to split the file into x number of files, consider your big file
# size and try to stay under 26 split files to keep the filenames 
# easy from split (xa[a-z]), in my example I have 10 million rows in bigfile
split -l 1000000 bigfile.txt
# Produces output files named xa[a-t]

# Now use split files + xargs to iterate and launch parallel greps with output
for id in $(cat my_ids.txt) ; do ls xa* | xargs -n 1 -P 20 grep $id >> matches.txt ; done
# Here you can tune your parallel greps with -P, in my case I am being greedy
# Also be aware that there's no point in allocating more greps than x files

In my case this cut what would have been a 17 hour job into a 1 hour 20 minute job. I'm sure there's some sort of bell curve here on efficiency and obviously going over the available cores won't do you any good but this was a much better solution than any of the above comments for my requirements as stated above. This has an added benefit over the script parallel in using mostly (linux) native tools.

cgrep, if it's available, can be orders of magnitude faster than grep.

MCE 1.508 includes a dual chunk-level {file, list} wrapper script supporting many C binaries; agrep, grep, egrep, fgrep, and tre-agrep.

https://metacpan.org/source/MARIOROY/MCE-1.509/bin/mce_grep

https://metacpan.org/release/MCE

One does not need to convert to lowercase when wanting -i to run fast. Simply pass --lang=C to mce_grep.

Output order is preserved. The -n and -b output is also correct. Unfortunately, that is not the case for GNU parallel mentioned on this page. I was really hoping for GNU Parallel to work here. In addition, mce_grep does not sub-shell (sh -c /path/to/grep) when calling the binary.

Another alternate is the MCE::Grep module included with MCE.

A slight deviation from the original topic: the indexed search command line utilities from the googlecodesearch project are way faster than grep: https://github.com/google/codesearch:

Once you compile it (the golang package is needed), you can index a folder with:

# index current folder
cindex .

The index will be created under ~/.csearchindex

Now you can search:

# search folders previously indexed with cindex
csearch eggs

I'm still piping the results through grep to get colorized matches.

참고URL : https://stackoverflow.com/questions/9066609/fastest-possible-grep

'Development Tip' 카테고리의 다른 글

Convert.toFloat () 메서드가없는 이유는 무엇입니까? (0)	2020.10.07
데이터 속성 값을 업데이트 할 수 없습니다. (0)	2020.10.06
감소 작업은 언제 Hadoop에서 시작됩니까? (0)	2020.10.06
SQLAlchemy : 세션 생성 vs. 재사용 (0)	2020.10.06
상속을 통해 열거 형 확장 (0)	2020.10.06

현재글가능한 가장 빠른 grep

yourdevel

가능한 가장 빠른 grep

가능한 가장 빠른 grep

'Development Tip' 카테고리의 다른 글

'Development Tip'의 다른글

티스토리툴바

가능한 가장 빠른 grep

가능한 가장 빠른 grep

'Development Tip' 카테고리의 다른 글

'Development Tip'의 다른글

관련글

티스토리툴바