Administration
Also available as:
PDF

DistCp Between HA Clusters

To copy data between HA clusters, use the dfs.internal.nameservices property in the hdfs-site.xml file to explicitly specify the name services belonging to the local cluster, while continuing to use the dfs.nameservices property to specify all of the name services in the local and remote clusters.

Use the following steps to copy data between HA clusters:

  1. Create a new directory and copy the contents of the /etc/hadoop/conf directory on the local cluster to this directory. The local cluster is the cluster where you plan to run the distcp command.

    The following steps use distcpConf as the directory name. Substitute the name of the directory you created for distcpConf.

  2. In the hdfs-site.xml file in the distcpConf directory, add the nameservice ID for the remote cluster to the dfs.nameservices property.

    [Note]Note

    localns is the nameservice ID of the local cluster and externalns is the namespace ID of the remote cluster.

    <property>
    <name>dfs.nameservices</name>
     <value>localns, externalns </value>
     </property>
    <property>
    <name>dfs.internal.nameservices</name>
    <value>localns</value>
    </property>
    
            
  3. On the remote cluster, find the hdfs-site.xml file and copy the properties that refer to the nameservice ID to the end of the hdfs-site.xml file in the distcpConf directory you created in step 1:

    dfs.ha.namenodes.<nameserviceID>

    dfs.namenode.rpc-address.<nameserviceID>.<namenode1>

    dfs.namenode.servicerpc-address.<nameserviceID>.<namenode1>

    dfs.namenode.http-address.<nameserviceID>.<namenode1>

    dfs.namenode.https-address.<nameserviceID>.<namenode1>

    dfs.namenode.rpc-address.<nameserviceID>.<namenode2>

    dfs.namenode.servicerpc-address.<nameserviceID>.<namenode2>

    dfs.namenode.http-address.<nameserviceID>.<namenode2>

    dfs.namenode.https-address.<nameserviceID>.<namenode2>

  4. Enter the following command to copy data from the remote cluster to the local cluster:

    hadoop --config distcpConf distcp hdfs://externalns/<source_directory> hdfs://localns/<destination_directory>

  5. If you want to perform disctcp on a secure cluster, you must also pass the mapreduce.job.send-token-conf property along with distcp command, as follows:

    Hadoop –config distcpConf -Dmapreduce.job.send-token-conf="yarn.http.policy|^yarn.timeline-service.webapp.
    *$|^yarn.timeline-service.client.*$|hadoop.security.key.provider.path|hadoop.rpc.protection|dfs.nameservices|
    ^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.
    *$|dfs.namenode.kerberos.principal|dfs.namenode.kerberos.principal.pattern|mapreduce.jobhistory.principal"
    hdfs://externalns/<source_directory> hdfs://localns/<destination_directory>