oracle aide

December 24, 2016

Redshift “Failed temporary block read” / code: 1075

Filed under: Uncategorized — oracleaide @ 12:37 am

 Problem

A Redshift query fails with a “Failed temporary block read / code: 1075” error.

We check for failed disks and see none:

 select host as node_id, count(*) as failed_disk_qty
from stv_partitions
where part_begin=0 and failed = 1
group by host; 

— returns nothing

Only after the failure repeats X times (10?) the disk is marked as failed and the aforementioned query returns a row with the count of failed disks.

 As as soon as the disk is marked as failed: Redshift starts avoiding the bad block, the Redshift support team generously replaces the whole failed node in no time.

Essentially, because of a single failed block we get a whole brand new computer 

Great!

 Why is this a problem?

The failing query takes XX-YY minutes to complete.

If we have to repeat the cycle 10 times – it will take XX*10-YY*10 minutes for the cluster to recognize and blacklist the failing disk.

Since only some queries fail – we suspect those with a WITH clause, which creates temporary tables behind the scenes – the process could take even longer.

 Workaround

The workaround we came up with is to run the same query in parallel to make it fail faster and mark the disk as bad sooner.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: