google app engine - Why does BigQuery fail to parse an Avro file that is accepted by avro-tools? -
i'm trying export google cloud datastore data avro files in google cloud storage , load files bigquery.
firstly, know big query loads datastore backups. has several disadvantages i'd avoid:
- backup tool closed source
- backup tool format undocumented.
- backup tool format cannot read directly dataflow
- backup scheduling appengine in (apparently perpetual) alpha.
- it possible implement own backup handler in appengine, fire , forget. won't know when backup has finished or file name be.
with motivation clarified experiment here dataflow pipeline export data avro format:
package com.example.dataflow; import com.google.api.services.datastore.datastorev1; import com.google.api.services.datastore.datastorev1.entity; import com.google.cloud.dataflow.sdk.pipeline; import com.google.cloud.dataflow.sdk.coders.avrocoder; import com.google.cloud.dataflow.sdk.io.avroio; import com.google.cloud.dataflow.sdk.io.datastoreio; import com.google.cloud.dataflow.sdk.io.read; import com.google.cloud.dataflow.sdk.options.dataflowpipelineoptions; import com.google.cloud.dataflow.sdk.options.pipelineoptions; import com.google.cloud.dataflow.sdk.options.pipelineoptionsfactory; import com.google.cloud.dataflow.sdk.transforms.dofn; import com.google.cloud.dataflow.sdk.transforms.pardo; import org.apache.avro.schema; import org.apache.avro.file.datafilereader; import org.apache.avro.file.datafilewriter; import org.apache.avro.file.seekablebytearrayinput; import org.apache.avro.generic.genericdatumreader; import org.apache.avro.generic.genericrecord; import org.apache.avro.io.datumreader; import org.apache.avro.protobuf.protobufdata; import org.apache.avro.protobuf.protobufdatumwriter; import org.slf4j.logger; import org.slf4j.loggerfactory; import java.io.bytearrayoutputstream; public class gcdsentitiestoavrossccepipeline { private static final string gcs_target_uri = "gs://mybucket/datastore/dummy"; private static final string entity_kind = "dummy"; private static schema getschema() { return protobufdata.get().getschema(entity.class); } private static final logger log = loggerfactory.getlogger(gcdsentitiestoavrossccepipeline.class); public static void main(string[] args) { pipelineoptions options = pipelineoptionsfactory.fromargs(args).withvalidation().create(); pipeline p = pipeline.create(options); datastorev1.query.builder q = datastorev1.query.newbuilder() .addkind(datastorev1.kindexpression.newbuilder().setname(entity_kind)); p.apply(read.named("datastorequery").from(datastoreio.source() .withdataset(options.as(dataflowpipelineoptions.class).getproject()) .withquery(q.build()))) .apply(pardo.named("protobuftoavro").of(new protobuftoavro())) .setcoder(avrocoder.of(getschema())) .apply(avroio.write.named("writetoavro") .to(gcs_target_uri) .withschema(getschema()) .withsuffix(".avro")); p.run(); } private static class protobuftoavro extends dofn<entity, genericrecord> { private static final long serialversionuid = 1l; @override public void processelement(processcontext c) throws exception { schema schema = getschema(); protobufdatumwriter<entity> pbwriter = new protobufdatumwriter<>(entity.class); datafilewriter<entity> datafilewriter = new datafilewriter<>(pbwriter); bytearrayoutputstream bos = new bytearrayoutputstream(); datafilewriter.create(schema, bos); datafilewriter.append(c.element()); datafilewriter.close(); datumreader<genericrecord> datumreader = new genericdatumreader<>(schema); datafilereader<genericrecord> datafilereader = new datafilereader<>( new seekablebytearrayinput(bos.tobytearray()), datumreader); c.output(datafilereader.next()); } } }
the pipeline runs fine, when try load resultant avro file big query following error:
bq load --project_id=roodev001 --source_format=avro dummy.dummy_1 gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro waiting on bqjob_r5c9b81a49572a53b_00000154951eb523_1 ... (0s) current status: done bigquery error in load operation: error processing job 'roodev001:bqjob_r5c9b81a49572a53b_00000154951eb523_1': apache avro library failed parse file gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro.
however if load resultant avro file avro tool, fine:
avro-tools tojson datastore-dummy-00000-of-00001.avro | head log4j:warn no appenders found logger (org.apache.hadoop.metrics2.lib.mutablemetricsfactory). log4j:warn please initialize log4j system properly. log4j:warn see http://logging.apache.org/log4j/1.2/faq.html#noconfig more info. {"key":{"com.google.api.services.datastore.datastorev1$.key":{"partition_id":{"com.google.api.services.datastore.datastorev1$.partitionid":{"dataset_id":"s~roodev001","namespace":""}},"path_element":[{"kind":"dummy","id":4503905778008064,"name":""}]}},"property":[{"name":"number","value":{"boolean_value":false,"integer_value":879,"double_value":0.0,"timestamp_microseconds_value":0,"key_value":null,"blob_key_value":"","string_value":"","blob_value":"","entity_value":null,"list_value":[],"meaning":0,"indexed":true}}]} ...
i used code populate datastore dummy data before running dataflow pipeline:
package com.example.datastore; import com.google.gcloud.authcredentials; import com.google.gcloud.datastore.*; import java.io.ioexception; public static void main(string[] args) throws ioexception { datastore datastore = datastoreoptions.builder() .projectid("myprojectid") .authcredentials(authcredentials.createapplicationdefaults()) .build().service(); keyfactory dummykeyfactory = datastore.newkeyfactory().kind("dummy"); batch batch = datastore.newbatch(); int batchcount = 0; (int = 0; < 4000; i++){ incompletekey key = dummykeyfactory.newkey(); system.out.println("adding entity " + i); batch.add(entity.builder(key).set("number", i).build()); batchcount++; if (batchcount > 99) { batch.submit(); batch = datastore.newbatch(); batchcount = 0; } } system.out.println("done"); }
so why bigquery rejecting avro files?
bigquery uses c++ avro library, , apparently doesn't "$" in namespace. here's error message:
invalid namespace: com.google.api.services.datastore.datastorev1$
we're working on getting these avro error messages out end user.
Comments
Post a Comment