google app engine - Why does BigQuery fail to parse an Avro file that is accepted by avro-tools? -


i'm trying export google cloud datastore data avro files in google cloud storage , load files bigquery.

firstly, know big query loads datastore backups. has several disadvantages i'd avoid:

with motivation clarified experiment here dataflow pipeline export data avro format:

package com.example.dataflow;  import com.google.api.services.datastore.datastorev1; import com.google.api.services.datastore.datastorev1.entity; import com.google.cloud.dataflow.sdk.pipeline; import com.google.cloud.dataflow.sdk.coders.avrocoder; import com.google.cloud.dataflow.sdk.io.avroio; import com.google.cloud.dataflow.sdk.io.datastoreio; import com.google.cloud.dataflow.sdk.io.read; import com.google.cloud.dataflow.sdk.options.dataflowpipelineoptions; import com.google.cloud.dataflow.sdk.options.pipelineoptions; import com.google.cloud.dataflow.sdk.options.pipelineoptionsfactory; import com.google.cloud.dataflow.sdk.transforms.dofn; import com.google.cloud.dataflow.sdk.transforms.pardo; import org.apache.avro.schema; import org.apache.avro.file.datafilereader; import org.apache.avro.file.datafilewriter; import org.apache.avro.file.seekablebytearrayinput; import org.apache.avro.generic.genericdatumreader; import org.apache.avro.generic.genericrecord; import org.apache.avro.io.datumreader; import org.apache.avro.protobuf.protobufdata; import org.apache.avro.protobuf.protobufdatumwriter; import org.slf4j.logger; import org.slf4j.loggerfactory;  import java.io.bytearrayoutputstream;  public class gcdsentitiestoavrossccepipeline {      private static final string gcs_target_uri = "gs://mybucket/datastore/dummy";     private static final string entity_kind = "dummy";      private static schema getschema() {         return protobufdata.get().getschema(entity.class);     }      private static final logger log = loggerfactory.getlogger(gcdsentitiestoavrossccepipeline.class);     public static void main(string[] args) {         pipelineoptions options = pipelineoptionsfactory.fromargs(args).withvalidation().create();         pipeline p = pipeline.create(options);          datastorev1.query.builder q = datastorev1.query.newbuilder()                 .addkind(datastorev1.kindexpression.newbuilder().setname(entity_kind));          p.apply(read.named("datastorequery").from(datastoreio.source()                 .withdataset(options.as(dataflowpipelineoptions.class).getproject())                 .withquery(q.build())))             .apply(pardo.named("protobuftoavro").of(new protobuftoavro()))             .setcoder(avrocoder.of(getschema()))             .apply(avroio.write.named("writetoavro")                     .to(gcs_target_uri)                     .withschema(getschema())                     .withsuffix(".avro"));         p.run();      }      private static class protobuftoavro extends dofn<entity, genericrecord> {         private static final long serialversionuid = 1l;          @override         public void processelement(processcontext c) throws exception {             schema schema = getschema();             protobufdatumwriter<entity> pbwriter = new protobufdatumwriter<>(entity.class);             datafilewriter<entity> datafilewriter = new datafilewriter<>(pbwriter);             bytearrayoutputstream bos = new bytearrayoutputstream();             datafilewriter.create(schema, bos);             datafilewriter.append(c.element());             datafilewriter.close();              datumreader<genericrecord> datumreader = new genericdatumreader<>(schema);             datafilereader<genericrecord> datafilereader = new datafilereader<>(                     new seekablebytearrayinput(bos.tobytearray()), datumreader);              c.output(datafilereader.next());          }     } } 

the pipeline runs fine, when try load resultant avro file big query following error:

bq load --project_id=roodev001 --source_format=avro dummy.dummy_1 gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro waiting on bqjob_r5c9b81a49572a53b_00000154951eb523_1 ... (0s) current status: done    bigquery error in load operation: error processing job 'roodev001:bqjob_r5c9b81a49572a53b_00000154951eb523_1': apache avro library failed parse file gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro. 

however if load resultant avro file avro tool, fine:

avro-tools tojson datastore-dummy-00000-of-00001.avro | head log4j:warn no appenders found logger (org.apache.hadoop.metrics2.lib.mutablemetricsfactory). log4j:warn please initialize log4j system properly. log4j:warn see http://logging.apache.org/log4j/1.2/faq.html#noconfig more info. {"key":{"com.google.api.services.datastore.datastorev1$.key":{"partition_id":{"com.google.api.services.datastore.datastorev1$.partitionid":{"dataset_id":"s~roodev001","namespace":""}},"path_element":[{"kind":"dummy","id":4503905778008064,"name":""}]}},"property":[{"name":"number","value":{"boolean_value":false,"integer_value":879,"double_value":0.0,"timestamp_microseconds_value":0,"key_value":null,"blob_key_value":"","string_value":"","blob_value":"","entity_value":null,"list_value":[],"meaning":0,"indexed":true}}]} ... 

i used code populate datastore dummy data before running dataflow pipeline:

package com.example.datastore;  import com.google.gcloud.authcredentials; import com.google.gcloud.datastore.*;  import java.io.ioexception;  public static void main(string[] args) throws ioexception {      datastore datastore = datastoreoptions.builder()             .projectid("myprojectid")             .authcredentials(authcredentials.createapplicationdefaults())             .build().service();      keyfactory dummykeyfactory = datastore.newkeyfactory().kind("dummy");       batch batch = datastore.newbatch();     int batchcount = 0;     (int = 0; < 4000; i++){         incompletekey key = dummykeyfactory.newkey();         system.out.println("adding entity " + i);         batch.add(entity.builder(key).set("number", i).build());         batchcount++;         if (batchcount > 99) {             batch.submit();             batch = datastore.newbatch();             batchcount = 0;         }     }      system.out.println("done");  } 

so why bigquery rejecting avro files?

bigquery uses c++ avro library, , apparently doesn't "$" in namespace. here's error message:

invalid namespace: com.google.api.services.datastore.datastorev1$

we're working on getting these avro error messages out end user.


Comments

Popular posts from this blog

javascript - Laravel datatable invalid JSON response -

java - Exception in thread "main" org.springframework.context.ApplicationContextException: Unable to start embedded container; -

sql server 2008 - My Sql Code Get An Error Of Msg 245, Level 16, State 1, Line 1 Conversion failed when converting the varchar value '8:45 AM' to data type int -