Debug high memory allocations and GC cycles downloading large S3 objects

Eran Levy
7 min readMar 7, 2020

AWS S3 is our data lake storage platform at Cyren and we use the objects stored there mostly around our backend services written in Go. The size of the objects we store there vary depending on the data type. We are using AWS SDK for Go for that purpose.

We started to experience pod restarts because of OOMKilled events (Exit Code 137). We found out that as soon as the service downloads large objects (~30MB) from S3, its memory allocation grows to the memory resource limit which triggers OOM termination:

Grafana

Go got a great set of powerful runtime tools for profiling and running benchmarks along with these profilers is pretty simple. I decided that in order to investigate it further, it might makes sense to eliminate and create a separate benchmark that downloads large object from one of our test S3 buckets to improve profile accuracy.

Before we dive into the details, lets first understand the environment:

  • Go 1.13
  • aws-sdk-go (NOTE: AWS announced developer preview for the aws-sdk-go-v2 release)

Running benchmarks in Go is pretty easy and there are bunch of testing and benchmark documentation out there so I will skip that part, you can read more here.

Following is the benchmark code that downloads a given object from S3:

func BenchmarkDownloader(b *testing.B) { 

s := session.Must(session.NewSession(
&aws.Config{
Region: aws.String("YOUR-AWS-REGION"),
Credentials: credentials.NewEnvCredentials(),
}))
s3Client := s3.New(s)
s3Downloader := s3manager.NewDownloaderWithClient(s3Client)

benchmarkDownloader("emptyfile", s3Downloader, b)
}

func benchmarkDownloader(k string, dw *s3manager.Downloader, b *testing.B) {
for i := 0; i < b.N; i++ {
var ba []byte
buffer := aws.NewWriteAtBuffer(ba)

_, err := dw.Download(buffer, &s3.GetObjectInput{
Bucket: aws.String("BUCKET-NAME"),
Key: aws.String(k),
})

if err != nil {
b.Errorf("%s", err)
}

sizeInBytes := len(buffer.Bytes())
b.Logf("Downloaded %d bytes", sizeInBytes)
}
}

Running the benchmark with the relevant profilers (setting -benchtime=45s to improve accuracy due to the fact we are dealing here with 30MB objects):

go test -bench=BenchmarkDownloader -benchtime=45s -benchmem -memprofile memprofile-downloader.out -cpuprofile profile-downloader.out -trace trace-downloader.out

This will run the benchmark and create the requested profile files (memory, CPU and trace) that can be parsed and served using the go tool.

Since I knew that its memory related, I wanted to get a birds eye view of the trace, its simple as running the following command:

go tool trace trace-downloader.out

NOTE: if you are using Chrome v80.0+ (chrome://version) and you see a blank page along with this error in the developer tools console: Uncaught ReferenceError: tr is not defined
there are some compatibility issues with the latest Chrome release and the Go team is working to fix it in this 1.14 release. In the meantime, open your chrome browser with the following flags:
— enable-blink-features=ShadowDOMV0,CustomElementsV0,HTMLImports

go tool trace

You can immediately notice just by looking on the GC row that the number of GC cycles is huge. You can not miss the high number of Syscalls there, probably because of the massive allocations. It looks suspicious, so jumped directly into the memprofile using pprof (I mostly use the web UI, its much more convenient for me but you can obviously use the cli):

go tool pprof -http=:8980 profile-downloader.out

go tool pprof

Zooming out and looking at the graph, you can get the sense of the big picture. Seems like there are many allocations around the WriteAtBuffer and I couldn’t understand much from the chain (this post provides pretty good hello world explanation for pprof and ref to additional posts in the same topic). At this stage, I wasn’t sure what exactly being done by the SDK internals. I started to read the documentation provided, the API ref and the developer guide but unfortunately neither of them contains the details to help me understand further the downloader client internals. Next step was to start reading the code in order to drill down and get some sense on the internals of this execution path —

The SDK provides an API to create a downloader client instance that in its basic behavior downloading chunks of bytes concurrently to the given buffer.

In the underlying layer, the client spins up multiple goroutines to perform concurrent GET requests based on the downloader configuration you provided on client instance creation. If you haven’t provided any, the default configuration is: 5 goroutines and 5MB part size (the range of bytes that the downloader get per each download). You can configure the Downloader while creating the client by providing it with the options that will be applied on creation:

func newDownloader(client s3iface.S3API, options ...func(*Downloader)) *Downloader

For instance, you can set Concurrency to 1 in order to download the parts sequentially:

downloader := s3manager.NewDownloaderWithClient(s3Svc, func(d *s3manager.Downloader){
d.Concurrency=1
})

The SDK is taking care of the overall request-response interaction and error handling, you can configure the retry count or other provided request configurations using the s3manager.WithDownloaderRequestOptions helper function.

The way that its performing the actual download internally is that at its default behavior the downloader is trying to get the first chunk of bytes and based on its response parameters such as the total size, spinning up the necessary goroutines along with a WaitGroup to download the rest of the parts by advancing its byte range with the pre-configured part size. The response body is being copied into the buffer we passed to the downloader — aws.NewWriteAtBuffer. Seems like in cases of large files there are many allocations by growing the existing buffer and copying the response in there:

newBuf := make([]byte, expLen, int64(b.GrowthCoeff*float64(expLen)))
copy(newBuf, b.buf)

I started look at the downloader configurations and found out that its possible to pass another buffer strategy:

// Defines the buffer strategy used when downloading a part.
//
// If a WriterReadFromProvider is given the Download manager
// will pass the io.WriterAt of the Download request to the provider
// and will use the returned WriterReadFrom from the provider as the
// destination writer when copying from http response body.
BufferProvider WriterReadFromProvider

The SDK provides another buffer strategy using sync.Pool to manage the allocations which is pretty cool and usually a preferred way to deal with such cases. As you see in the following piece, its being used while downloading:

if d.cfg.BufferProvider != nil {
w, cleanup = d.cfg.BufferProvider.GetReadFrom(w)
}

Hence, using sync.Pool enables the SDK to create a pre allocated buffer and reuse that space from different goroutines safely. There are no examples that use that strategy but I found some code samples in: service/s3/s3manager/download_test.go

I modified a bit my benchmark to use that strategy:

const (
bufCap
= 1024 * 1024 * 1
poolSize = 1024 * 1024 * 5
)
func BenchmarkDownloader(b *testing.B) {

s := session.Must(session.NewSession(
&aws.Config{
Region: aws.String("YOUR-AWS-REGION"),
Credentials: credentials.NewEnvCredentials(),
}))
s3Client := s3.New(s)
s3Downloader := s3manager.NewDownloaderWithClient(s3Client, func(d *s3manager.Downloader) {
d.BufferProvider = s3manager.NewPooledBufferedWriterReadFromProvider(poolSize)
})

benchmarkDownloader("emptyfile", s3Downloader, b)
}

func benchmarkDownloader(k string, dw *s3manager.Downloader, b *testing.B) {
for i := 0; i < b.N; i++ {
ba := make([]byte, bufCap)
buffer := aws.NewWriteAtBuffer(ba)

_, err := dw.Download(buffer, &s3.GetObjectInput{
Bucket: aws.String("BUCKET-NAME"),
Key: aws.String(k),
})

if err != nil {
b.Errorf("%s", err)
}

sizeInBytes := len(buffer.Bytes())
b.Logf("Downloaded %d bytes", sizeInBytes)
}
}

Mind the changes: s3manager.NewPooledBufferedWriterReadFromProvider(poolSize) and buffer size — setting 5MB to sync.Pool and 1MB for the buffer.

Running the modified benchmark and reviewing the trace reveals a much better picture:

go tool trace

Mind how GC cycles have been reduced dramatically, Syscalls and Heap are more close to how it should look like for such types of processes that involve network and buffering. I was interested to see how the memprofile graph for alloc_space looks like:

go tool pprof

Mind the boxes at the edge of the chains which show much lower allocations during the memory samples. The “Top” view shows a much better picture — you can see that the “Cum” and “Flat” columns show that allocations have been stabilized than what it was before:

go tool pprof

Using the sync.Pool buffer strategy provided by the AWS SDK for Go solved the issue.

Looking at profile results using pprof obviously deserves its own post. In the meantime, I found this post as a very good reference.

Happy Coding!

--

--