Tuesday, May 15, 2012

Greenplum's distribution keys

Greenplum is a distributed database.  Specifically, when you create a table, it creates the same table on all it's child nodes and distributes out the data based on the key you provide.   If you omit the key, Greenplum uses round-robin to distribute the data, and you can get nodes end up shouldering too much of the load.   At ClickFox we had an issue with omitted and bad distribution keys.   I wrote this 'little' query this afternoon to generate a report on all the distribution keys.  Hopefully someone find it useful:



SELECT pgn.nspname,
       pgc.relname,
       pga.attname
FROM   (SELECT gdp.localoid,
               CASE
                 WHEN ( Array_upper(gdp.attrnums, 1) > 0 ) THEN
                 Unnest(gdp.attrnums)
                 ELSE NULL
               END AS attnum
        FROM   gp_distribution_policy gdp
        ORDER  BY gdp.localoid) AS distrokey
       INNER JOIN pg_class AS pgc
         ON distrokey.localoid = pgc.oid
       INNER JOIN pg_namespace pgn
         ON pgc.relnamespace = pgn.oid
       LEFT OUTER JOIN pg_attribute pga
         ON distrokey.attnum = pga.attnum
            AND distrokey.localoid = pga.attrelid
ORDER  BY pgn.nspname,
          pgc.relname 

4 comments:

Zodan said...

Hi David,

Thank you for the code that you provided here. This is helping us with some synchronisation work we are trying to achieve between MSSQL databases and Greenplum.

In particular when we want to create partitions with the same distribution key so that we can easily exchange /switch partitions.

Kind regards,
Hendrik Groenewald
hcgroenewald@gmail.com

RI said...

Thanks for your code. It helped to resolve some messy situation in GP database.

Regards

RI
iramsor@yahoo.com

Jun said...

Many Thanks for your code. You make me have more knowledge about Greenplum.

Dave Nellis said...

I haven't looked at my blog in months! Glad someone found this helpful