Development Tip

디스크에 쓰지 않고 AWS S3의 텍스트 파일을 Pandas로 가져 오는 방법

yourdevel 2020. 11. 30. 20:05
반응형

디스크에 쓰지 않고 AWS S3의 텍스트 파일을 Pandas로 가져 오는 방법


탭으로 구분 된 테이블 인 S3에 저장된 텍스트 파일이 있습니다. 팬더에로드하고 싶지만 heroku 서버에서 실행 중이기 때문에 먼저 저장할 수 없습니다. 여기 내가 지금까지 가지고있는 것입니다.

import io
import boto3
import os
import pandas as pd

os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"

s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
file = response["Body"]


pd.read_csv(file, header=14, delimiter="\t", low_memory=False)

오류는

OSError: Expected file path name or file-like object, got <class 'bytes'> type

응답 본문을 Pandas가 허용하는 형식으로 어떻게 변환합니까?

pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)

returns

TypeError: initial_value must be str or None, not StreamingBody

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)

returns

TypeError: 'StreamingBody' does not support the buffer interface

업데이트-다음 작업을 사용하여

file = response["Body"].read()

pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)

pandas사용 boto을 위해 read_csv당신이 할 수 있어야한다, 그래서 :

import boto
data = pd.read_csv('s3://bucket....csv')

boto3있기 때문에 필요한 경우 다음을 수행 python3.4+할 수 있습니다.

import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))

이제 Pandas는 S3 URL을 처리 할 수 ​​있습니다 . 간단히 할 수 있습니다.

import pandas as pd
import s3fs

df = pd.read_csv('s3://bucket-name/file.csv')

s3fs없는 경우 설치 해야합니다.pip install s3fs

입증

S3 버킷이 비공개이고 인증이 필요한 경우 두 가지 옵션이 있습니다.

1- 구성 파일에 액세스 자격 증명 추가~/.aws/credentials

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

또는

2- Set the following environment variables with their proper values:

  • aws_access_key_id
  • aws_secret_access_key
  • aws_session_token

This is now supported in latest pandas. See

http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files

eg.,

df = pd.read_csv('s3://pandas-test/tips.csv')

With s3fs it can be done as follow:

import s3fs
import pandas as pd
fs = s3fs.S3FileSystem(anon=False)

# CSV
with fs.open('mybucket/path/to/object/foo.pkl') as f:
    df = pd.read_csv(f)

# Pickle
with fs.open('mybucket/path/to/object/foo.pkl') as f:
    df = pd.read_pickle(f)

Since the files can be too large, it is not wise to load them in the dataframe altogether. Hence, read line by line and save it in the dataframe. Yes, we can also provide the chunk size in the read_csv but then we have to maintain the number of rows read.

Hence, I came up with this engineering:

def create_file_object_for_streaming(self):
        print("creating file object for streaming")
        self.file_object = self.bucket.Object(key=self.package_s3_key)
        print("File object is: " + str(self.file_object))
        print("Object file created.")
        return self.file_object

for row in codecs.getreader(self.encoding)(self.response[u'Body']).readlines():
            row_string = StringIO(row)
            df = pd.read_csv(row_string, sep=",")

I also delete the df once work is done. del df


An option is to convert the csv to json via df.to_dict() and then store it as a string. Note this is only relevant if the CSV is not a requirement but you just want to quickly put the dataframe in an S3 bucket and retrieve it again.

from boto.s3.connection import S3Connection
import pandas as pd
import yaml

conn = S3Connection()
mybucket = conn.get_bucket('mybucketName')
myKey = mybucket.get_key("myKeyName")

myKey.set_contents_from_string(str(df.to_dict()))

This will convert the df to a dict string, and then save that as json in S3. You can later read it in the same json format:

df = pd.DataFrame(yaml.load(myKey.get_contents_as_string()))

The other solutions are also good, but this is a little simpler. Yaml may not necessarily be required but you need something to parse the json string. If the S3 file doesn't necessarily need to be a CSV this can be a quick fix.

참고URL : https://stackoverflow.com/questions/37703634/how-to-import-a-text-file-on-aws-s3-into-pandas-without-writing-to-disk

반응형