There is a data size constraint on protobuff, as it’s using `int` type for bufferSize, which limits the maximum value to serialize to 2GB.
There are two possibly workaround, which are basically same concept:
flush to the same stream by batch
as an example, it could either batch per 1 million rows, or if the data size is above 268 Mb
while (rs != null && rs.next()) {
models.addModels(..newBuilder().set...(rs.getString("..")...)
.build());
if(++rowcount > 1_000_000){
// if(rowcount > 1_000_000 || models.build().getSerializedSize() > Math.pow(2,28)){
rowcount=0;
//flush by batch
try (FileOutputStream fos = new FileOutputStream(Constants.MODEL_PB_FILE, true)) {
model.build().writeTo(fos);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
model.clear();
}
}
alternatively, this could be pushed and read by batch from different streams.
if(++rowcount >= 1_000_000){
rowcount=0;
//flush by batch
try (FileOutputStream fos = new FileOutputStream(CACHE_FILE + currentFileIndex++, true)) {
models.build().writeTo(fos);
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
models.clear();
}
same for read
Files.list(Paths.get(Constants.CACHE_FILE_DIR + File.separator+Constants.PB_FILE)).filter(Files::isRegularFile)
.map(Path::toFile)
.filter(file -> file.getName().startsWith(Constants.PB_FILE))
.parallel().map(file -> readFile(file))
.reduce(....)
One thought on “Workaround for protobuff buffersize limitation”