텍스트 처리-Python 대 Perl 성능

Development Tip

텍스트 처리-Python 대 Perl 성능

yourdevel 2020. 11. 15. 11:53

텍스트 처리-Python 대 Perl 성능

여기 내 펄과 파이썬 스크립트를 5 회 반복 X 1메가바이트 (최대)에 3백킬로바이트에 대해 (21)에 대한 로그 파일, 각에서 간단한 텍스트 처리를하는 것입니다 (때문에로, 125 개 파일의 총 기록 을 5 회 반복).

Python 코드 (컴파일 re및 사용 하도록 수정 된 코드 re.I)

#!/usr/bin/python

import re
import fileinput

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for line in fileinput.input():
    fn = fileinput.filename()
    currline = line.rstrip()

    mprev = exists_re.search(currline)

    if(mprev):
        xlogtime = mprev.group(1)

    mcurr = location_re.search(currline)

    if(mcurr):
        print fn, xlogtime, mcurr.group(1)

Perl 코드

#!/usr/bin/perl

while (<>) {
    chomp;

    if (m/^(.*?) INFO.*Such a record already exists/i) {
        $xlogtime = $1;
    }

    if (m/^AwbLocation (.*?) insert into/i) {
        print "$ARGV $xlogtime $1\n";
    }
}

그리고 내 PC에서 두 코드 모두 정확히 동일한 결과 파일 인 10,790 줄을 생성합니다. 그리고 다음은 Cygwin의 Perl 및 Python 구현에 대한 타이밍입니다.

User@UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.py *log* *log* *log* *log* *log* >
summarypy.log

real    0m8.185s
user    0m8.018s
sys     0m0.092s

User@UserHP /cygdrive/d/tmp/Clipboard
# time /tmp/scripts/python/afs/process_file.pl *log* *log* *log* *log* *log* >
summarypl.log

real    0m1.481s
user    0m1.294s
sys     0m0.124s

원래이 간단한 텍스트 처리를 위해 Python을 사용하면 10.2 초, Perl을 사용하면 1.9 초가 소요되었습니다.

(업데이트) 그러나 rePython 의 컴파일 된 버전 이후에는 이제 Python에서 8.2 초, Perl에서 1.5 초가 걸립니다. 여전히 Perl은 훨씬 빠릅니다.

Python의 속도를 전혀 향상시킬 수있는 방법이 있습니까? 아니면 Perl이 간단한 텍스트 처리를위한 빠른 속도가 될 것임이 분명합니다.

그건 그렇고 이것이 제가 간단한 텍스트 처리를 위해 한 유일한 테스트는 아닙니다 ... 그리고 제가 소스 코드를 만드는 각각의 다른 방법은 항상 Perl이 큰 차이로 승리합니다. 그리고 단 한번도 파이썬이 단순한 m/regex/매치와 프린트에 대해 더 나은 성능을 보여주지 못했습니다 .

C, C ++, Assembly, 기타 Python 등의 사용을 권장하지 마십시오.

Standard Perl과 비교하여 기본 제공 모듈이있는 Standard Python을 사용하는 솔루션을 찾고 있습니다 (모듈을 사용하지 않음). 소년, 나는 가독성 때문에 모든 작업에 Python을 사용하고 싶지만 속도를 포기하기 위해 그렇게 생각하지 않습니다.

따라서 Perl과 비슷한 결과를 얻을 수 있도록 코드를 개선 할 수있는 방법을 제안하십시오.

업데이트 : 2012-10-18

다른 사용자가 제안했듯이 Perl이 그 자리를 차지하고 Python이 그 자리를 차지합니다.

따라서이 질문에 대해 수백 또는 수천 개의 텍스트 파일에 대한 각 줄의 간단한 정규식 일치 및 결과를 파일에 기록 (또는 화면에 인쇄)하는 경우 Perl은 항상이 작업의 성능에서 승리 할 것이라고 안전하게 결론을 내릴 수 있습니다. . 그렇게 간단합니다.

Perl이 성능면에서 승리한다고 말할 때 ... 표준 Perl과 Python 만 비교됩니다 ... 일부 모호한 모듈 (나와 같은 일반 사용자에게는 모호함)에 의존하지 않고 Python에서 C, C ++, 어셈블리 라이브러리를 호출하지 않습니다. 또는 Perl. 간단한 텍스트 일치 작업을 위해 이러한 모든 추가 단계와 설치를 배울 시간이 없습니다.

따라서 Perl은 텍스트 처리 및 정규식에 적합합니다.

파이썬은 다른 곳에서도 흔들릴 수 있습니다.

업데이트 2013-05-29 : 유사한 비교를 수행하는 훌륭한 기사 가 여기 있습니다 . Perl은 간단한 텍스트 매칭으로 다시 승리합니다. 자세한 내용은 기사를 읽어보십시오.

이것은 Perl이 수행하도록 설계된 것과 정확히 일치하므로 더 빠르다는 것이 놀랍지 않습니다.

Python 코드에서 쉬운 최적화 중 하나는 해당 정규식을 미리 컴파일하는 것이므로 매번 다시 컴파일되지 않습니다.

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists')
location_re = re.compile(r'^AwbLocation (.*?) insert into')

그런 다음 루프에서 :

mprev = exists_re.search(currline)

과

mcurr = location_re.search(currline)

그 자체로는 Python 스크립트를 Perl 스크립트와 일치하도록 마술처럼 가져 오지 않지만, 먼저 컴파일하지 않고 루프에서 반복적으로 re를 호출하는 것은 Python에서 나쁜 습관입니다.

가설 : Perl은 Python이 제공하지 않는 최적화로 인해 일치하지 않는 행에서 역 추적하는 데 더 적은 시간을 소비합니다.

교체하면 무엇을 얻을 수 있습니까?

^(.*?) INFO.*Such a record already exists

와

^((?:(?! INFO).)*?) INFO.*Such a record already

또는

^(?>(.*?) INFO).*Such a record already exists

함수 호출은 Python에서 시간 측면에서 약간 비쌉니다. 그러나 루프 내부에서 파일 이름을 가져 오는 루프 불변 함수 호출이 있습니다.

fn = fileinput.filename()

이 줄을 for루프 위로 이동하면 Python 타이밍이 약간 개선 된 것을 볼 수 있습니다. 그래도 Perl을 이기기에는 충분하지 않을 것입니다.

일반적으로 모든 인공 벤치 마크는 악합니다. 그러나 다른 모든 것이 동일하면 (알고리즘 접근 방식) 상대적으로 개선 할 수 있습니다. 그러나 나는 Perl을 사용하지 않는다는 점에 유의해야한다. 그래서 나는 그것의 찬성으로 논쟁 할 수 없다. 즉, Python을 사용하면 Pyrex 또는 Cython 을 사용하여 성능을 향상시킬 수 있습니다. 또는 모험심이 있다면 ShedSkin을 통해 Python 코드를 C ++로 변환 해 볼 수 있습니다 (대부분의 핵심 언어와 일부 핵심 모듈에서 작동하지만 전부는 아님).

그럼에도 불구하고 여기에 게시 된 몇 가지 팁을 따를 수 있습니다.

http://wiki.python.org/moin/PythonSpeed/PerformanceTips

Perl이 더 빠를 것으로 기대합니다. 궁금해서 다음을 시도해 볼 수 있습니까?

#!/usr/bin/python

import re
import glob
import sys
import os

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for mask in sys.argv[1:]:
    for fname in glob.glob(mask):
        if os.path.isfile(fname):
            f = open(fname)
            for line in f:
                mex = exists_re.search(line)
                if mex:
                    xlogtime = mex.group(1)

                mloc = location_re.search(line)
                if mloc:
                    print fname, xlogtime, mloc.group(1)
            f.close()

Update as reaction to "it is too complex".

Of course it looks more complex than the Perl version. The Perl was built around the regular expressions. This way, you can hardly find interpreted language that is faster in regular expressions. The Perl syntax...

while (<>) {
    ...
}

... also hides a lot of things that have to be done somehow in a more general language. On the other hand, it is quite easy to make the Python code more readable if you move the unreadable part out:

#!/usr/bin/python

import re
import glob
import sys
import os

def input_files():
    '''The generator loops through the files defined by masks from cmd.'''
    for mask in sys.argv[1:]:
        for fname in glob.glob(mask):
            if os.path.isfile(fname):
                yield fname


exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for fname in input_files():
    with open(fname) as f:        # Now the f.close() is done automatically
        for line in f:
            mex = exists_re.search(line)
            if mex:
                xlogtime = mex.group(1)

            mloc = location_re.search(line)
            if mloc:
                print fname, xlogtime, mloc.group(1)

Here the def input_files() could be placed elsewhere (say in another module), or it can be reused. It is possible to mimic even the Perl's while (<>) {...} easily, even though not the same way syntactically:

#!/usr/bin/python

import re
import glob
import sys
import os

def input_lines():
    '''The generator loops through the lines of the files defined by masks from cmd.'''
    for mask in sys.argv[1:]:
        for fname in glob.glob(mask):
            if os.path.isfile(fname):
                with open(fname) as f: # now the f.close() is done automatically
                    for line in f:
                        yield fname, line

exists_re = re.compile(r'^(.*?) INFO.*Such a record already exists', re.I)
location_re = re.compile(r'^AwbLocation (.*?) insert into', re.I)

for fname, line in input_lines():
    mex = exists_re.search(line)
    if mex:
        xlogtime = mex.group(1)

    mloc = location_re.search(line)
    if mloc:
        print fname, xlogtime, mloc.group(1)

Then the last for may look as easy (in principle) as the Perl's while (<>) {...}. Such readability enhancements are more difficult in Perl.

Anyway, it will not make the Python program faster. Perl will be faster again here. Perl is a file/text cruncher. But--in my opinion--Python is a better programming language for more general purposes.

참고URL : https://stackoverflow.com/questions/12793562/text-processing-python-vs-perl-performance

'Development Tip' 카테고리의 다른 글

모든 브라우저에 대한 기본 시간 초과 설정은 어디에서 찾을 수 있습니까? (0)	2020.11.15
Pandas 데이터 프레임에 상수 값이있는 열 추가 (0)	2020.11.15
서브를 실행하면 어떻게됩니까? (0)	2020.11.15
Linux에 ASP.NET MVC 배포 : 모범 사례, 도구 및 놀라움 (0)	2020.11.15
벡터를 이동하면 반복기가 무효화됩니까? (0)	2020.11.15

현재글텍스트 처리-Python 대 Perl 성능

yourdevel

텍스트 처리-Python 대 Perl 성능

텍스트 처리-Python 대 Perl 성능

'Development Tip' 카테고리의 다른 글

'Development Tip'의 다른글

티스토리툴바

텍스트 처리-Python 대 Perl 성능

텍스트 처리-Python 대 Perl 성능

'Development Tip' 카테고리의 다른 글

'Development Tip'의 다른글

관련글

티스토리툴바